Wednesday, September 5, 2012

Cassanrda 1.1 - Tuning for Frequent Column Updates

Cassandra is known for its good write performance, but there are scenarios, when you might run into trouble - especially when particular use case generates heavy disk IO. This could be the case for columns which receive frequent updates. However you can avoid those problems, with proper configuration, or just by updating to recent Cassandra version. The good news is, that it can be applied to already ruining system, so when you are already having problems, there is still a hope.

Memtables are flushed to immutable SSTables, and it is possible, that single column value can be stored in different SSTables, when its value was changing over long enough time period. This guarantees fast inserts, because data is being just appended to disk. But on the other hand, unnecessary writes will decrease disk performance, not only because many SSTables has to be written, but mainly because duplicates on disk will have to be compacted later on.

The idea is to tune Cassandra in the way, that we take benefit from frequent updates. This can be achieved by keeping data in memory and by delaying disk flushes. In this case new updates will replace existing values in memory.
This will generate less disk traffic, because it will decrease amount of flushed duplicates. This is not all - this will also create write through cache, and read requests will benefit from it. Here are some confutation tips:
  • Make sure that you have at least Cassandra 1.1 - it contains optimization for frequently changing values (CASSANDRA-2498). For the cases where single value is stored in multiple SSTables, older Cassandra versions would need to read column values from all SSTables in order to find most recent one. Now SSTables are sorted by modification time, so it's enough to read most recent value and simply ignore remaining outdated values.
  • Increase thresholds for flushing memtables. Each update on memtable, results in one less entry in SSTable.
  • Each read operation checks first memtables, if data is still there, it will be simply returned - this is the fastest possible access. Its like non blocking write through cache (based on skip list).
  • To large memtable on the other hand will result in larger commit log. This is not a problem, until your instance crashes. It will need some time to start, because it would need to read whole commit log.
  • Compaction merges SSTables together, and this increases read performance, since we have less data to go through. But this process does not have high priority. When Cassandra is nearly exhausted, it will skip compaction, and this can lead to data fragmentation.
Caching:
  • Row cache makes really sense for frequent reads of the same row(s), and additionally when you read most of the columns of each single row.
  • For active row cache, access to single column from particular row will load whole row with all its columns into memory. Analyze data access patterns, and makes sure that it is not an overhead, and that you have enough memory. It would be really waste of resources, to load million columns into memory, to just access only a few.
  • Row cache works as wright through, for data that is already in it. Data is loaded into row cache first when it's being read, and when it was not found in memtable. From this point of time it will get updated on each write operation. Frequently changing entry, without read access will not affect row cache, because it's not there.
  • Updates on data in row cache will decrease performance, and actually, those frequently changing columns are probably also available in memtable. Read process will first search memtable, and in case of hit ignore row cache. From this point of view row cache makes sense, if you also read other columns which are not changing frequently. For example single row has 200 columns, 50 receive frequent updates, 100 sporadic, and read process reads always all. In this case row cache makes sense - we will have to actualize 50 columns on each insert, but we will gain fast access to remaining 150.
  • It might be good idea to disable row cache, increase memtable size in hope to reduce disk writes, and to use memtable as cache.
  • Disabling row cache does not necessary mean additional disk seeks. Cassandra uses memory mapped files, which means that each file access is being cached by operating system. Relaying on memory mapped files is nothing new - Mongo does not have cache at all - it's not needed, since file system cache works just fine. But Mongo has different data structure on hard drive, because they store BSON document optimized for reads, its all in one place, Cassandra might (not always) need first to collect data from different locations.
  • Row cache would help also in situation where single row spreads over many SSTables. In this case putting all data together is CPU intensive operation, not mentioning possible disk access to read each column value.
  • When row cache is disabled, key cache must be used. Key cache is like an index - and you definitely want to load your whole index into memory.
  • When row cache is disabled and key cache is enabled, and read operation get hits on key cache, we have quiet performant solution. Searching SSTables runs fully in memory, only reading column value itself requires disk access. And maybe even not that, since it's memory mapped file.
  • When disabling row cache remember to tune read ahead. The idea is, to read from disk only single column and not more data when it's not needed.
Just to summary.... run performance tests, check Cassandra statistics, and verify how many SSTables has to be searched to find data, and what is the cache usage. This might be good entry point to change memtable size, or to tune caching. In my case disabling row cache, large key cache and increased memtable thresholds was the right decision.