The AI That Remembers: How a Hollywood Star Built the Memory System LLM Agents Have Been Missing

MemPalace went from zero to 23,000 GitHub stars in 72 hours. Here’s what the tool actually does — and where the marketing outran the code.

Apr 12, 2026

On April 5, 2026, Milla Jovovich — the actress who played Leeloo in The Fifth Element — pushed a Python repository to GitHub.

By the end of the weekend, it had 23,000 stars, over 3,000 forks, and the number one trending slot on the platform. Developer communities on Hacker News and Reddit were arguing simultaneously about whether the benchmark claims were fraudulent and whether Hollywood celebrities could actually write production Python. Tech media ran pieces with titles ranging from “The Future of AI Memory Is Here” to “Snake Oil.”

The reality, as usual, sits somewhere more useful than either end of that spectrum.

MemPalace is a local AI memory system. It solves a real problem. The marketing around it is significantly overstated. The code underneath is genuinely functional and, in one important respect, architecturally novel. Knowing the difference matters if you’re making decisions about how to build AI agents that do serious work.

---

The Problem That Made 23,000 People Care

Before getting into what MemPalace does, it’s worth dwelling on the problem it addresses, because that problem is the reason a repository from an actress’s personal GitHub account can go viral with developers in 48 hours.

AI agents forget everything.

This is not a subtle limitation or an edge case. It is the central architectural constraint of every LLM-based agent in production today. When a session ends, the context disappears. The model retains nothing about what was discussed, decided, built, or changed. Tomorrow, you start from zero.

For casual use — asking an AI to draft an email, summarize a document, answer a factual question — session amnesia is a manageable inconvenience. You provide the relevant context, the model helps, you move on.

For the AI use cases that actually generate significant productivity gains, session amnesia is a fundamental obstacle.

Consider what a capable AI coding agent needs to be useful across a real project over real time. It needs to know why you chose PostgreSQL over MySQL six months ago. It needs to know that the authentication module was refactored in February and that the old token validation logic should never be referenced. It needs to know the performance characteristics you’ve measured on the production API so it doesn’t propose solutions that ignore established constraints. It needs to know the decisions made in the last architecture review.

None of that context fits in a single session. None of it is stable — it evolves as the project evolves. And none of it can be efficiently recovered by simply pasting conversation history into each new session.

The workarounds that exist today are all compromises.

Context window stuffing — feeding the model the full history of all relevant conversations — works in theory and fails in practice. For a project of any meaningful duration, the accumulated context quickly runs to hundreds of thousands of tokens. At commercial inference pricing, this approach costs hundreds of dollars a year per active user. At scale, it costs much more. The compute overhead is also substantial regardless of cost: inference time grows with context length, which means slower responses as history accumulates.

LLM-generated summaries — having the AI periodically compress past conversations into a summary document — are the most common mitigation. Tools like CLAUDE.md and equivalent system prompt files work this way: you maintain a running document of key facts, decisions, and context, and inject it into each new session. This is better than nothing, but it has a fundamental flaw. LLM summaries are lossy by design. Every summarization pass discards information. Details get flattened into generalizations. Specific constraints get abstracted into principles. Over months of a complex project, the “memory” document becomes a faded shadow of the actual history, missing exactly the specific facts you need when a related decision comes up.

Static files like CLAUDE.md are good for things that genuinely don’t change: your preferences, recurring abbreviations, standard project structure. They break down for anything dynamic, because they require manual maintenance and don’t scale with project complexity.

The net result is that serious long-form AI agent work — the kind that produces sustained value across weeks and months — requires either accepting significant context loss or paying substantial ongoing costs to simulate memory that the underlying architecture doesn’t provide.

This is the problem MemPalace is attempting to solve. And it’s why a repository from an actress’s GitHub account went to 23,000 stars before most people finished their Saturday morning coffee.

---

Who Actually Built This

The “Milla Jovovich built an AI memory system” framing is both true and somewhat misleading, and the distinction matters for evaluating the project’s credibility.

MemPalace was co-created by Jovovich and Ben Sigman, a crypto CEO and software developer who is the primary engineer on the project. Jovovich’s GitHub history shows 7 commits over 2 days — a level of involvement that generated immediate skepticism in developer communities. The architecture documents, the benchmark methodology, and the core retrieval code bear the marks of someone with serious engineering background. That’s Sigman’s work.

What Jovovich brought to the project was the distribution mechanism. Her decision to publish under her personal account rather than a separate organization account was almost certainly deliberate. The name recognition created the initial signal boost that pushed the repository onto GitHub’s trending page. Developer attention did the rest.

This is not unprecedented. Celebrity association with technical projects ranges from vanity credit (rare in open-source, more common in startup funding) to genuine collaboration where non-technical contributors shape direction and funding while technical contributors build. The available evidence suggests Jovovich falls somewhere between those poles — more involved than pure figurehead, less technical than Sigman.

The practical implication for evaluating MemPalace: judge the code and the benchmarks, not the commit history. The repository has been independently reviewed by multiple developers with relevant expertise. The architectural approach is real. The code runs. The marketing is overstated, but the underlying system is not a prop.

---

The Architecture: What “Memory Palace” Actually Means in Code

MemPalace takes its name from the Method of Loci — a mnemonic technique used since ancient Greece in which information is stored by associating it with specific physical locations in a mental “palace.” The technique works because human memory is better at spatial and narrative associations than at raw fact recall. By mentally “placing” information in a familiar location, retrieving it becomes a matter of mentally navigating to that location rather than trying to recall an isolated fact.

The software applies this organizational principle to vector storage and retrieval for AI memory.

Standard AI memory systems work like this: conversations get chunked into segments, each segment gets converted into an embedding (a high-dimensional numeric vector that encodes semantic meaning), and the embeddings get stored in a vector database. When you need to retrieve relevant memory, you convert your query into an embedding and search for the stored embeddings closest to it in vector space. This is semantic similarity search, and it’s the foundation of systems like Mem0, Zep, and Letta.

The approach works reasonably well, with a consistent limitation: it treats all memories as equally retrievable, equally weighted, and organized in a flat namespace. There’s no structural discrimination between a factual recall question (”What database are we using?”) and a temporal recall question (”What changed in the auth module last month?”) and a synthesis question (”Why did we make the architectural decisions we made in Q4?”). All queries hit the same flat index. The retrieval quality depends entirely on the similarity metric.

MemPalace introduces a hierarchical structure that maps to different types of memory and different retrieval strategies:

Wings are the top-level organizational units — one per project, person, or major relationship context. If you’re working on three projects simultaneously, each gets its own wing. Memories about a project live in its wing and don’t contaminate retrieval for other projects.

Halls within each wing correspond to memory types:

Fact recall — static facts that don’t change (what language is this written in, who is the primary stakeholder)
Temporal events — things that happened at a specific time (what changed in March, what decision was made last week)
Multi-hop reasoning — complex interconnected knowledge requiring synthesis
Knowledge updates — facts that supersede earlier facts
Synthesis — patterns, principles, and accumulated understanding

Rooms hold specific conversation threads or topic clusters within a hall.

Drawers contain the individual verbatim exchanges, stored in ChromaDB for semantic search.

When a query arrives, MemPalace runs a two-pass retrieval. The first pass classifies the query by type — is this a factual lookup, a temporal question, or a synthesis query? — and searches only the relevant hall. This narrows the search space and reduces the chance of retrieving semantically similar but contextually irrelevant results. The second pass searches the full corpus with hall-specific score bonuses, catching anything that was miscategorized in the first pass.

The entire system runs locally. ChromaDB handles vector storage and retrieval. SQLite manages the knowledge graph and metadata — the structural relationships between wings, halls, rooms, and drawers. No cloud services are required. No API keys for the core function. Memory is stored on your machine, under your control.

Every 15 messages, MemPalace automatically triggers a background process that sweeps the recent conversation, extracts topics, decisions, and code changes, and files them into the appropriate location in the palace structure. This happens without user intervention.

The system initializes with a 170-token startup load — the L0 and L1 layers that provide the index. Deeper layers are pulled only when queried. This keeps per-session overhead close to zero.

---

The Benchmark Claims: What Held Up and What Didn’t

MemPalace launched with aggressive benchmark claims. The headline was 100% accuracy on LongMemEval — the standard benchmark for AI long-term memory systems.

The community caught the problem within days.

GitHub issue #29 documented the key finding: the 100% score was achieved by identifying which specific questions the system got wrong, engineering targeted fixes for those exact questions, and retesting on the same dataset. This is overfitting to the test set. It is not a benchmark result — it’s a demonstration that you can tune a system to pass a test when you know the answers in advance. After community pressure, the developers revised the headline number.

The 100% LoCoMo benchmark score has a different problem. LoCoMo conversation sessions contain 19 to 32 items. MemPalace ran the benchmark with top_k=50, meaning the retrieval window was larger than the entire candidate pool. When you retrieve more items than exist in the database, you retrieve everything by default. A 100% recall rate under these conditions tells you nothing about the system’s actual selectivity — it just means you asked for more than was there.

The independently verified numbers, using correct methodology:

LongMemEval (raw mode): 96.6% accuracy. This is the pre-tuning result, before the fixes that inflated the score to 100%. Independent testers have confirmed this number is reproducible.
LongMemEval (hybrid mode): 88.9% Recall@10. Hybrid mode uses an optional LLM reranking step that incurs a small API cost (approximately $0.001 per query) to improve precision.

For reference, Mem0 scores approximately 85% on LongMemEval, and Zep scores around 82%. The 96.6% is a genuine result and represents a meaningful improvement over comparable tools.

There’s a useful technical debate happening about why MemPalace achieves this accuracy. Multiple independent analyses have found that ChromaDB’s underlying vector retrieval is doing most of the heavy lifting. The hierarchical palace structure — the wings, halls, and rooms — contributes a meaningful but not dominant portion of the accuracy advantage. Some developers argue the architecture’s primary value is organizational clarity for humans (which also helps the LLM navigate the memory structure) rather than fundamental retrieval improvement.

This is a legitimate question and the answer is probably “both.” The structured retrieval provides a real advantage by narrowing the search space and reducing interference between different types of queries. ChromaDB provides strong baseline retrieval. The combination produces better results than either alone. The specific contribution of each isn’t fully disentangled in the published benchmarks.

---

Where It Falls Short

MemPalace is not ready for production deployment without accepting some rough edges.

The MCP integration — the interface that allows Claude Code, ChatGPT, and Cursor to use MemPalace as a memory backend — ships with a known stdout bug that breaks integration with Claude Desktop. The bug was reported and acknowledged; whether it’s been fixed by the time you read this depends on when that is.

The README describes features that are not yet implemented in the code. This is common in fast-moving open-source projects, and in this case it appears to be a documentation problem rather than intentional misrepresentation — the feature set in the README reflects the roadmap, not the current state of the code. But for anyone trying to evaluate the tool for a specific use case, reading the README and reading the code will tell you different things.

The benchmark methodology problems have been corrected in the documentation but the correction was slower than the original claim. The 100% number circulated widely in media coverage that won’t be updated.

The project is also six days old as of this writing. The velocity of community interest is a positive signal — experienced developers have reviewed the code and found it functional — but the maturity that comes from months of production use by diverse organizations doesn’t exist yet.

---

What It’s Actually Good For, Right Now

Given all of that, what should you do with MemPalace?

If you’re a developer working on complex projects over extended timeframes — significant codebases, long-running research, anything where the accumulated history of decisions and changes matters — MemPalace is worth testing. Install it, configure it, run it against your actual workflow for a week. The core retrieval works. The local-first architecture means your data stays on your machine. The 96.6% recall accuracy on LongMemEval, even with appropriate caveats about benchmark methodology, represents genuinely capable retrieval.

If you’re evaluating AI memory solutions for an organization — deciding what tooling your AI agent infrastructure will use — treat MemPalace as a project to watch closely over the next 90 days rather than something to standardize on immediately. The architecture is sound. The implementation needs maturation. The community is engaged and the development velocity has been high. This could look very different by July.

If you’re thinking about what the MemPalace moment signals for AI infrastructure more broadly: the problem it addresses is the right one. The “goldfish memory” problem is not a niche edge case. It is the central limitation of deploying serious AI agents for sustained work. The architectural direction MemPalace represents — local, structured, hierarchical memory with near-zero startup overhead — is where this needs to go. Whether MemPalace specifically becomes the standard tool or gets superseded by something better, the design decisions it’s making are worth understanding.

---

The Bigger Picture

There is something worth noting about the fact that a Hollywood actress co-created the project that is — at least for this moment — the leading open-source solution to one of AI’s most important practical limitations.

It says something about how AI development has changed. The barrier to building meaningful AI tooling has dropped enough that people outside the traditional ML research and engineering community are producing real artifacts. The tools — Claude Code, Cursor, other coding agents — are capable enough that someone with vision, a credible collaborator, and persistence can move from problem identification to functional code in a compressed timeframe.

It also says something about how the AI developer community processes new tools. 23,000 stars in 72 hours is partially a function of the celebrity association. But the technical discussion on Hacker News was substantive from the beginning. The benchmark problems were identified by people who actually read the code. The independent accuracy tests were run by people who understood what they were measuring. The community that drove the repository to trending is not uncritical — it just processes signal fast.

The memory problem for AI agents is real and important. MemPalace is a functional, architecturally interesting, marketing-overstated attempt to solve it. It will either mature into a significant tool or get superseded by something that learned from it. Either outcome is progress on a problem that needed more attention than it was getting.

---

MemPalace is available at github.com/milla-jovovich/mempalace. The independent benchmark analysis referenced in this piece was published by Nicholas Rhodes at Substack and the technical review by Danilchenko at danilchenko.dev.

Erik Jones

Discussion about this post

Ready for more?