I've consulted on two projects with 15,000 events per second. Elasticsearch puts indexing requests in the Indexing Request Queue. When that overflows, ES will reject indexing requests and lose data. To prevent data loss it is prudent to use Kafka as a buffer to level out traffic.
Read the resiliency page in the right context. Elastic is perfectionist and Data Stax is cavalier. Lack of file checksums was identified and now fixed by Elastic. I asked a Data Stax Solutions Architect if Cassandra used file checksums and he said he didn't know.
Elasticsearch uses two-phase commit to update cluster state. Cassandra gossips cluster state among nodes making the eventual consistency model very complicated and needing repairs. For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch. Cassandra warns the application to do a roll back.
Unlike Couchbase, Cassandra can not determine the most recent updates and can not update a parallel Elasticsearch cluster. Writing a Cassandra secondary-indexing plugin for Elasticsearch is very complicated, and will result in duplicate documents that must be filtered after search. It almost impossible to use Cassandra with Elasticsearch effectively.
In any system, there is a long multi-minute latency to satisfy a query over 4.2 billion documents. A 7-day query would be 29 billion documents. The only systems capable of handling this use case are Hadoop/Spark and/or Elasticsearch.
Exact choices depend on more details of the use case.
* One architecture is HDFS or S3 as event storage with Spark programs that write results into Elasticsearch for visualization.
* Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See http://www.leapfire.com/elasticsearchjoin.html
* For analysis on time periods for 24 hours or less, consider Spark Streaming for short-term disposable summaries.
For one of my clients, I consulted for just 5 hours and they said I saved them weeks of work in trying to make a good decision. As part of that, I did a one-hour presentation on Time-Series Event Systems.
Consider Kafka, Logstash, Elasticsearch, Hadoop, HDFS, S3, Spark, and Spark Streaming.