Hacker Newsnew | past | comments | ask | show | jobs | submit | tanoku's commentslogin

Very very seldomly! Obviously we need to be correct in 100% of the cases, which is why the de-optimizer is there, but in practice it just never triggers. Right now the things that trigger a deopt are very limited — malformed time literals, oferflowing integer negations and very little else. The performance impact is essentially zero.


Our vector implementation is fully compatible with the syntax for vector fields and comparison functions in MySQL 9 -- we have carefully backported all the changes to our version of MySQL 8 to ensure full backwards and forwards compatibility. The index syntax is specific to PlanetScale because the open-source MySQL 9 preview does not yet have support for ANN indexes on vector columns, but we're committed to being as compatible as possible in the future and to minimize fragmentation. I do agree that fragmenting the ecosystem is not great!


Definitely yes to the first: the index implementation is fully integrated with PlanetScale's sharding layer, so you can scale horizontally as much as you need. This works very well in practice: the sharding layer, Vitess, is extremely well battle tested with petabyte-sized clusters in production, and our vector indexes are like any other table on the cluster, so it really scales horizontally with very predictable performance.

As for separating storage and compute: we don't do that. One of our key sells here is that this is vector data fully integrated into your relational database with ACID semantics. Very hard to do separate storage and compute and keep this behavior!


Hi! I'm one of the authors of this feature. It's something quite novel, because it's not just a HNSW plug-in for MySQL (like e.g. pgvector is for Postgres). It's a fully transactional index, integrated into InnoDB.

We based the implementation on two very new papers from Microsoft Research, SPANN and SPFresh. SPANN is a hybrid graph/tree algorithm that does a fantastic job of scaling larger-than-RAM indexes (https://arxiv.org/abs/2111.08566) and SPFresh expands upon it by defining a set of background operations that can be performed to maintain the index's performance and recall while it's continuously updated in-place (https://arxiv.org/html/2410.14452v1). The novel thing here is designing all the SPANN _and_ SPFresh operations to be transactional, and integrating them in MySQL's default storage engine.

This tight integration fundamentally means that inserting, updating and deleting vector data from MySQL is always reflected immediately in the index as part of committing your transaction. But it also means that the indexes are fully covered by the MySQL binlog; they recover from hard crashes just fine. They're also managed by MySQL's buffer pool, so they scale to terabytes of data, just like any other table. And also crucially, they're fully integrated with the query planner, so they can be used in any query, including JOINs and WHERE clauses (something that many other vector indexes really struggle with).

We plan to release a whitepaper on our transactionality extensions to SPFresh, which I think will be super interesting, but meanwhile please feel free to test the beta and ask me any questions (here, or by emailing PlanetScale support). Thanks!


Will this be open sourced and be upstreamed to MySQL? Or will it always be Planetscale specific?


It's already open source, because GPL requires it. It's unlikely to be accepted as an upstream contribution given that Oracle has their own Vector type that is only available in their MySQL cloud service. Don't expect it to show up in a standard Oracle MySQL community release.


Is that the case when they're only providing as a service rather than distributing it? And if it is, where's the code?


Can it be put into MarianDB? Or has the two project diverged so much that is not possible now?

To be honest using MySQL the fact that it is owned by a corp is quite a concern ATM.


Well. Let's see how it turns out. I am hoping Vector Type will be available in community release in later 9.x version.


I have the same question. I have a small DB on Planetscale and happily pay 39 for it. IMO, Vector DB's are the next frontier. I can understand the tough decision for Planetscale, given that the cloud services can just offer it as a service and corner all value of an open source project.


very cool stuff! I just read the SPFresh paper a few days ago and was wondering if it's been implemented in industry (e.g. Turbopuffer's implementation of SPANN).

I'd be curious how y'all represent the posting lists for each partition in InnoDB:

- what IDs are you storing in the posting lists?

- how are the posting lists represented on disk? are they using compression and/or some form of skip indexing? the paper seemed to use a pretty simple block-based representation, but I'm curious what works well in practice.

- how do the posting list data structures themselves handle incremental updates and MVCC?


These are very relevant questions! Thank you!

We're storing IDs from a ghost column that is created in the table where you're inserting vector data. This works very well in practice and allows updating the value of the vectors in the table, because they're translated into a delete + insert in the vector index by updating the ghost ID.

We have abstracted away the quantization system from the index; for the initial release, vector data is stored in raw blocks, like in the paper. Query performance is good, but disk usage is high. We're actively testing different quantization algorithms to see which ones we end up offering on GA. We're hoping our beta users will help us guide this choice!

Incremental updates and MVCC are _extremely tricky_, for both correctness and performance. As you've surely noticed, the hard thing here is that the original paper is very focused on LSM trees, because it exploits the fact that LSM trees get compacted lazily to perform incremental updates to the posting lists ('merges'). MySQL (and Postgres, and all relational databases, really) are B-tree based, and in-place updates for B-trees are expensive! I think we came up with very interesting workarounds for the problem, but it's a quite a bit to drill down in a HN comment. Please stay tuned for our whitepaper. :)


looking forward to it!

I'd be curious if y'all end up supporting adding filter attributes to the inverted index that can then be pushed down into the posting list traversal.

for example, a restaurant search app may have (1) an embedding for each restaurant but also (2) a cuisine. then, if a restaurant has `cuisine = Italian`, we'd also store its ghost ID in a `cuisine:Italian` posting list.

at query time, the query planner could take a query like `SELECT * FROM t1 WHERE cuisine = 'Italian' ORDER BY DISTANCE(..)` and emit a plan that efficiently intersects the `cuisine:Italian` posting list with the union of the partitions' posting lists.

this feels to me like a potential strength of the inverted indexing approach compared to graph-based approaches, which struggle with general filtering (e.g. the Filtered-DiskANN paper).


tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet


I'm curious what the use case/customer demand for something like vectors to be transactional is? It strikes me as something fundamentally less about strong consistency than most use cases in a database. Don't get me wrong, if it were free to have transactional, why not, but I can only assume this comes with profound tradeoffs in a workload contention sense.


A good example is TiDB's implementation of vector indexes. TIDB is also a MySQL-compatible distributed transactional database, which separates the storage engine for vector indexes from the transactional store through an extra replica in its OLAP engine (via TiFlash), avoiding resource contention.


Since you're here - why decide to write something from scratch for mysql instead of using/improving postgres? Seems like a very strange decision. MySQL is a poor RDBMS in comparison.



You bring up such a discredited blog post as proof?


PlanetScale is built on Vitess. Vitess was created in 2010 to help scaling MySQL at Youtube.


MySQL powers much larger workloads than Postgres as a number of marquee silicon valley companies. Though it's generally used as a sharded key value store in those scenarios


Hey Nican! Thanks for the feedback. It wasn’t clear from the blog post, but as the sibling poster points out, the system has full eventual consistency: it behaves like a replica, but it replicates a whole cluster of MySQL instances simultaneously (i.e. your full PlanetScale database). Because of this design, we never lock or affect the performance of writes to the main database.

As for predictability, we’re working on some interesting optimizations that allow similar queries to reuse the internal state of each other, so the system becomes more efficient the more queries it’s caching. Stay tuned!


Hey, author here. Thanks for the kind words! This is a custom pipeline that I designed for the article. It's implemented as a Node.js library using SVG.js and it statically generates the interactive SVGs directly in the static site generator I was using (Eleventy) by calling out to the Go compiler and extracting assembly for any lines you mark as interesting. It turned out very handy for iterating, but it's not particularly reusable I'm afraid!


I came here to ask about the same thing. Very cool! I would be very interested even in a blog post just on how you did the SVG generation.


I agree with the commenter you're replying to; I'd only add that Intel syntax is much more readable than AT&T.


FWIW, I'm fairly sure this is the assembly syntax used by Go - the author may not have made a decision to use this vs another


Possibly, but objdump can disassemble into Intel syntax.


Oh OK, you have clearly spent more time thinking about this problem than the team of engineers at GitHub who've been researching code search at scale for more than four years. I bet they feel real silly right now knowing they could have shipped this search engine in a couple weeks taping together off-the-shelf libraries if they only had your talent for software architecture.


Sure, I did. I've implemented a proper search for a very big companies document and knowledge base. Similar to gmane, which also used xapian. Much better than what I see there. Or with the old google code search


Actual author of Sundown here.

> even if it was active it's not a very common Markdown implementation

First, even if this was true, it's still _not_ OK to name your project Sundown.

Second, _my_ Sundown is the most used Markdown engine in the world. It has powered all Markdown rendering in GitHub.com and Reddit.com for more than 5 years, and it has rendered billions more documents than all other Markdown parsers out there.

Could you please not be an asshole and rename your project? It's really not that hard to come up with a name for a parser. Don't steal mine.


> Could you please not be an asshole and rename your project?

I have updated the blog post and changed the name to "SunSed Short-form HTML tags" it also better describes what it is.

I have high respect for you and your work and admit that the way I came up with the name was sloppy!


Thank you, I appreciate that. I'll point people at SunSed the next time they ask about MD alternatives. :)


Hi there! As the spec explains, this is a Markdown specific blacklist that prevent the tags that would otherwise "break" the content of the Markdown document.

A document that contains these tags will not be parsed properly by an HTML5 compliant parser; the parser will "swallow" other chunks of Markdown content that come after the tags. Hence, we disable the tags altogether.

This is an UX feature, not a security feature. XSS prevention, and a plethora of other security checks, are performed by our user content stack -- but this functionality is shared for all markup languages in GitHub (MD, RST, ASCIIDOC, ...), so it's not discussed in this spec.


Author here. 5 years ago I would have agreed with you and logged e.g. 10 million merge requests to replay them offline. But one thing I've found over the years (which may seem obvious in retrospect) is that staging environment are not identical to production. Particularly not when it comes to finding sneaky bugs and performance regressions -- the code doesn't run on the same exact environment it will run when it is deployed (it has different input behaviors, and most importantly, it has different load and performance characteristics).

The question then becomes "why would you run these experiments offline when you can run them online?". So we simply do. I personally feel it's a game changer.


> why would you run these experiments offline when you can run them online?

It probably doesn't apply in this case, but many bugs cannot be replicated in the production environment. At least, you don't want to break things in production to see if your software does the right thing in adverse scenarios.


Some would say that's a really good strategy. Keeps you on your toes...

http://techblog.netflix.com/2012/07/chaos-monkey-released-in...


Couldn't agree more. It is almost impossible to have a staging environment which is EXACTLY like the Production environment. A slightest difference can introduce bug in Production which will get missed out in staging. And if you can run your experiment in Production without any downside, why not?


TL;DR: In theory, theory and practice are the same. In practice, they are not.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: