Hacker Newsnew | past | comments | ask | show | jobs | submit | twojustin's commentslogin

My team is currently using the hosted cituscloud, which uses PG Bouncer. They note that the reason for sharding is because of high writes. We've actually seen big benefits for moving over to a sharded setup via Citus just as much for the read performance.

By sharding by customer we're able to more effectively leverage Postgres caching and elastically scale. Since switching over, our database has performed and scaled much better than when we were on a single node.

There is some up front migration work. But that's limited some minor patching of activeRecord ORM, and changes to migration scripts. We considered casandra, elastic search, as well as dynamoDB, but the amount of migration changes in the application logic would take an unacceptable amount of time.


Craig from Citus here. Thanks for the kind words Justin, as you mention there is some upfront work, but once it's in place it becomes fairly manageable.

We've been working to make that upfront work easier as well with libraries that allow things to be more drop-in (ActiveRecord-multi-tenant: https://github.com/citusdata/activerecord-multi-tenant and Django-multitenant: https://github.com/citusdata/django-multitenant)


Praying for a http://docs.sequelizejs.com/ plugin


For a customer DB, why was Citus required for your use-case? Do you have millions of users?


One of the things that I'm now excited about for citus is multi node IO capacity. It should improve write and read latencies in the situation where a single node disk is getting saturated.

* Edit. I AM excited for this. Typo'd now = not previously


Not sure I fully follow on the multi node IO capacity, can you share a bit more on the workload that you're concerned about?

Edit: Thanks for the clarification, the typo part in particular through me off, makes sense now.


Let's say that you're using RDS and have a single box capable of ~25k IOPS. If you have 16 boxes capable of some number of IOPS (let's say 15k), then your total system capacity is significantly higher than 25k. On read workloads where pages need pulled from disk (high read IOPS), this should see improvement in general.

Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).


> Secondaries cover this case for Gitlab, it seems, but that comes with a set of caveats as well (namely async availability of data).

Another caveat is that all secondaries will end up having more or less the same stuff in their cache. As your data set grows bigger that becomes unsustainable, because you're going to read from disk more and more.

When you shard across N nodes you can keep N times as much data in the cache. Combined with N times higher I/O and compute capacity, that can actually give you way more than N times higher read throughput than a single node (for data sets that don't fit in memory), and you can get much higher write throughput as well.


Great points, thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: