we have a project with 50,000 rows insert per second in database.
our tables has schema...
after inserts, we process data and make some reports and analysis...
is good solution using elasticsearch as a primary database for our case or we should use another database and use elasticsearch just for analysis.
and if I want use another database with elasticsearch, what is the recommended database for use with elasticsearch?
---
There's a lot of "it depends" here.
What sort of data, what sort of analysis?
---
our data is network packets of local networks, for example a small office network.
we analyse this network data for detect trends and for monitor network usage
for now, our analysis is aggregation and filter on some filed for define trends and draw some charts
---
How important is the data?
---
it is important : )
in insert time, loosing 50 or 100 packet for 1000,000 is acceptable
but after insert, loosing data is not acceptable
---
Similar use case. But, we never use elasticsearch as a primary database. Once the data is there is our databases (mostly SQL) we transform and store it on elasticsearch cluster for analysis and some adhoc projects but we do not use ES as primary. It's because, our systems were built long back and they are critical. But if you are building completely new systems then you have your freedom.
---
I check this link before ask question.
but I don't understand what exactly it is...
---
> Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.
Basically you are aware that something can hoes wrong but we are trying hard to make that never happen
---
the most important issue is lost data in Insert?
or after data indexed in elasticsearch something can goes wrong ?
---
[quote="hossein_ey, post:10, topic:85733"]
the most important issue is lost data in Insert?
[/quote]
Yes.
[quote="hossein_ey, post:10, topic:85733"]
or after data indexed in elasticsearch something can goes wrong ?
[/quote]
Yes. The cluster can in some extreme conditions become RED which means that some primary shards are not available anymore.
---
[quote="dadoonet, post:11, topic:85733"]
Yes. The cluster can in some extreme conditions become RED which means that some primary shards are not available anymore
[/quote]
so in this situation data completely lost ? we can not restore that ?
what is your recommendation? do we use elasticsearch as a primary database or we use another database, if we should use another, what is the best database matched with elasticsearch?
---
[quote="hossein_ey, post:12, topic:85733"]
so in this situation data completely lost ?
[/quote]
No. Not completely. Some shards might be missing. Not all.
[quote="hossein_ey, post:12, topic:85733"]
we can not restore that ?
[/quote]
It depends but it might be hard to recover from that situation. Depending on how the RED actually occurred.
[quote="hossein_ey, post:12, topic:85733"]
what is your recommendation? do we use elasticsearch as a primary database or we use another database
[/quote]
The one I already pasted:
> Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.
So it really depends on your case. IMO you have 3 options:
* You absolutely don't care about loosing part of your data. Let say "non critical logs". Then, may be you will loose 1 day of data at some point but may be it's not that critical for you and actually don't worth adding another server for storing
* You absolutely care about your data and you want to be able to reindex in all cases. You need for that a datastore. A datastore can be a filesystem where you store JSON, HDFS, and/or a database you prefer and you are confident with. About how to inject data in it, you may want to read:
http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/.
* Well. You know that you can loose data in some extreme corner cases but you don't want to pay the price of other servers. Do backups, use elasticsearch replicas (increase to 2)... But you know the risk.
---
Hi,
How are you getting your packet information into ES ? If you have it go through Logstash at some point in your pipeline you could easily just configure another output to a more robust datastore if you're afraid of losing data. If it's just for backup purposes I'd probably just dump it in a compressed file on my SAN through the File output plugin.
PS : 50K packets per second hardly seems like a small office's activity, that's almost what we get on one of our datacenters for 7K users :smiley:
PPS : also, unless you specifically require to store a copy of each and every network packet, I think you should look into using Netflow/IPFIX instead, that'd probably make it easier on your ES cluster than having all those packets going into it ;)
---
I would say use Cassandra database as primary database and for analysis use elasticsearch
---
I've consulted on two projects with 15,000 events per second. Elasticsearch puts indexing requests in the Indexing Request Queue. When that overflows, ES will reject indexing requests and lose data. To prevent data loss it is prudent to use Kafka as a buffer to level out traffic.
Read the resiliency page in the right context. Elastic is perfectionist and Data Stax is cavalier. Lack of file checksums was identified and now fixed by Elastic. I asked a Data Stax Solutions Architect if Cassandra used file checksums and he said he didn't know.
Elasticsearch uses two-phase commit to update cluster state. Cassandra gossips cluster state among nodes making the eventual consistency model very complicated and needing repairs. For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch. Cassandra warns the application to do a roll back.
Unlike Couchbase, Cassandra can not determine the most recent updates and can not update a parallel Elasticsearch cluster. Writing a Cassandra secondary-indexing plugin for Elasticsearch is very complicated, and will result in duplicate documents that must be filtered after search. It almost impossible to use Cassandra with Elasticsearch effectively.
In any system, there is a long multi-minute latency to satisfy a query over 4.2 billion documents. A 7-day query would be 29 billion documents. The only systems capable of handling this use case are Hadoop/Spark and/or Elasticsearch.
Exact choices depend on more details of the use case.
* One architecture is HDFS or S3 as event storage with Spark programs that write results into Elasticsearch for visualization.
* Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See
http://www.leapfire.com/elasticsearchjoin.html )
* For analysis on time periods for 24 hours or less, consider Spark Streaming for short-term disposable summaries.
For one of my clients, I consulted for just 5 hours and they said I saved them weeks of work in trying to make a good decision. As part of that, I did a one-hour presentation on Time-Series Event Systems.
Consider Kafka, Logstash, Elasticsearch, Hadoop, HDFS, S3, Spark, and Spark Streaming.
...Geena
---
[quote="geena.rollins, post:16, topic:85733"]
When that overflows, ES will reject indexing requests and lose data
[/quote]
It's up to the client to retry, ES has told it it cannot accept anything else, so it's not losing anything.
---
Well, ES is not losing the data, the ETL pipeline is losing the data. Kafka with a Logstash consumer is a lossless ETL pipeline.
---
Another reason Elasticsearch is a better primary store is that Snapshots are Consistent at a time point. This is a natural benefit of Lucene's immutable files. Cassandra backups are Eventually Consistent and are difficult to restore. I saw co-workers doing a Cassandra restore; it was not a good day for them.
For complete disaster recovery, restore a Snapshot from time T, then have Logstash consume messages after time T from Kafka. This will result in reading some messages twice. If the application assigns document ids (or Logstash computes them from other fields), then indexing requests will just replace a duplicate in Elasticsearch.
If a shard goes Red and can not be restored, this disaster recovery method will work well.
As a distributed primary store for Time-Series Event data, nothing is more solid than Elasticsearch using these techniques.
...Geena
---
[quote="geena.rollins, post:19, topic:85733"]
For complete disaster recovery, restore a Snapshot from time T, then have Logstash consume messages after time T from Kafka. This will result in reading some messages twice. If the application assigns document ids (or Logstash computes them from other fields), then indexing requests will just replace a duplicate in Elasticsearch.
[/quote]
I think Kafka does not support random access and filter messages by time, how you handle this issue?
---
[quote="geena.rollins, post:16, topic:85733"]
Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See
http://www.leapfire.com/elasticsearchjoin.html )
[/quote]
Just so we're all clear; This is your product you are promoting.
There's no issue with that, but it's always good to be transparent when suggesting commercial solutions.
---
[quote="geena.rollins, post:16, topic:85733"]
When that overflows, ES will reject indexing requests and lose data.
[/quote]
No, it's the client's responsibility to backoff and resend these requests. They would never be acknowledged by Elasticsearch.
[quote="geena.rollins, post:16, topic:85733"]
For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch.
[/quote]
Write consistency `all` means that if not all replica shards are available at the start of an indexing request (up to a certain time window), Elasticsearch does not attempt to index the data at all and fails the request. If the write consistent is met, an indexing operation is performed on the primary and sent to the replicas and a replica fails to acknowledge an indexing request, we fail the replica and indicate that the write was not successful to all replicas in the response to the client. There is no rollback here.
---
[quote="jasontedor, post:22, topic:85733"]
Write consistency `all` means that if not all replica shards are available at the start of an indexing request (up to a certain time window), Elasticsearch does not attempt to index the data at all and fails the request.
[/quote]
Also, it's important to note that write consistency has been replaced by `wait_for_active_shards` which we think clarifies that it is a pre-flight check.
---
This is not a commercial product; nothing for sale. It is a technique anyone can implement themselves. This solution does not require any plugins and uses the existing Elasticsearch Query DSL.
The relevance to this thread is that the web page shows great performance results for Elasticsearch Scroll programs for big datasets and large results sets.
---
These internal details do not matter to the customer. **The point is that with Cassandra the client has to fix the problem itself.** I said the operation is rolled back; I never said the data was changed and rolled back. Perhaps "operation is rejected with no changes to any shards".
It is not very nice for someone to nitpick details of phrasing when it doesn't affect the validity of the point being made.
---
You stated "ES will reject indexing requests and lose data". That is not an "internal detail", it is a misleading statement that could be interpreted as saying that Elasticsearch is doing something wrong, or is implemented poorly when in point of fact it is part of any resilient system to apply backpressure when it is overloaded. It is on the client to respond appropriately to these situations.
Pointing this out is not nitpicking, it is a very important distinction, it's the difference between Elasticsearch did something wrong ("ES...lose data") and the client did something wrong (it needs to respond correctly to backpressure).
Similarly, my explanation of how write consistency works is not nitpicking, and it is absolutely not merely an internal detail. The parameter `write_consistency` impacts client-facing behavior (so clearly not internal), and it's poorly understood (which again is why we changed the name to `wait_for_active_shards`). If you say that Elasticsearch rolls an operation back it is evocative of behavior that people think they understand in databases (e.g., a transaction rollback) that simply does not apply here. It's not nitpicking to point this out.
It's natural in a technical discussion, nay any discussion, for important points to be expanded upon and corrected. It's how we engage, progress, and learn from each other. Your expectations are wrong if you think it should be otherwise.
---
Fair enough; I don't want to imply that Elasticsearch is doing something wrong. David's short answer could imply the same.
[quote="hossein_ey, post:11, topic:85733"]
the most important issue is lost data in Insert?
[quote="dadoonet, post:11, topic:85733"]
Yes.
[/quote]
[/quote]
The parameter write_consistency impacts client-facing behavior, but could have a variety of implementations and the internals don't matter. I have been explaining how to design an ETL pipeline that does not lose data in the face of backpressure.
Unlike FileBeat, Logstash does not handle backpressure. When all Logstash instances are busy, the network router will still send events, but those will be lost. So, one could say that the Elastic Stack is doing something wrong and loses data.
I'm just saying that one can implement an ETL pipeline that handles indexing backpressure by using Kafka to buffer the incoming events for Logstash. This approach is endorsed by Elastic in the above mentioned two-part blog.
My conclusion is that following my design suggestions, I highly recommend Elasticsearch as a good option for the gold-copy primary database for Time-Series Event systems.
---
[quote="hossein_ey, post:1, topic:85733, full:true"]
we have a project with 50,000 rows insert per second in database.
after inserts, we process data and make some reports and analysis...
[/quote]
We've all been assuming that your project is inserting 50K rows/sec 24x7. But, this text sounds like the project is inserting 50K rows/sec for a time period, stops inserting, and then generates reports.
My large-company customers get 15K network-events per second 24x7 for the entire corporation, over 1 billion documents per day.
Is your project (1) 50K rows/sec 24x7 or (2) 50K rows/sec for a time period?
---
[quote="geena.rollins, post:29, topic:85733"]
Is your project (1) 50K rows/sec 24x7 or (2) 50K rows/sec for a time period?
[/quote]
you are right, it is 50K rows/sec 24x7
---
Wow, at 500 bytes/event that is 2TB/day. How many days do you want for a Retention Period? That will be the primary driver of cost.
---
[quote="geena.rollins, post:28, topic:85733"]
Logstash does not handle backpressure. When all Logstash instances are busy, the network router will still send events, but those will be lost
[/quote]
This is not an accurate statement. Logstash has had a backpressure mechanism specifically to avoid data loss for many years.
For "the network router will still send events, but those will be lost" I assume you mean UDP syslog data? Your statement is not specific enough for me to really say with precision what you should expect.
The chosen communication protocol will dictate when and how data loss occurs. In a network partition (network outage, software failure, machine fault, software slowness, etc), UDP and TCP are both lossy under various scenarios. You can try to minimize these scenarios with buffering (Logstash's persistent queue feature, etc), but you cannot eliminate it without improving the communication protocol.
No application can prevent the OS or network from dropping UDP packets or prevent the OS from lying to the client about having successfully transmitted TCP payloads. It's good to design for your requirements, if you are able to do so; I know from experience that network appliances often don't give you a choice when it comes to accessing logs.
---
[quote="geena.rollins, post:31, topic:85733, full:true"]
Wow, at 500 bytes/event that is 2TB/day. How many days do you want for a Retention Period? That will be the primary driver of cost
[/quote]
yes this is an issue...
for now, we use HP Vertica for storing data and it has compression and encoding mechanism that increases disk usage...
but still this issue is important. so we keep original data for 20 days and create summary for older data...
but for elasticsearch we should estimate the days that we can keep original (network packets) data and for now I have no idea about this ...
---
[quote="jordansissel, post:32, topic:85733"]
This is not an accurate statement. Logstash has had a backpressure mechanism specifically to avoid data loss for many years.
[/quote]
Sorry I was inaccurate. Yes, Logstash handles back pressure from Elasticsearch by suspending indexing requests.
[quote="jordansissel, post:32, topic:85733"]
You can try to minimize these scenarios with buffering.
[/quote]
And that is why I recommended Kafka. Logstash persistent queues are GA for just a few weeks, so it hasn't been an option for product teams.
These scenarios have nothing to do with log files. My customer makes the router and programs the embedded Linux VM. As the router does its job, it also sends HTTP Posts with event info such as "connection established".
We could have used the Logstash HTTP Input Plugin, but we wanted buffering. Until otherwise proven, Kafka is still the leading fire-hose drinker.
My customers have built many _non-log_ Event systems using Elasticsearch:
* Enterprise desktops, servers, phones and other devices send Security Events.
* QA Automation uses hundreds of VMs in Build, Test, Destroy. The testing VMs send QA Events.
* Network appliances send Network Events.
* Advertising widgets send Ad Events (impression, click, action, ...)
---
[quote="hossein_ey, post:33, topic:85733"]
but for elasticsearch we should estimate the days that we can keep original (network packets) data and for now I have no idea about this ...
[/quote]
Yes, you are right. Very roughly speaking, a single Elasticsearch node on an i2.4xlarge can handle 2TB of documents. With replication=1, you need two nodes per day. 40 data nodes and 3 dedicated master nodes would be a good data lake for this project. For Snapshots, you need 40TB of mountable storage just for one incrementally-maintained snapshot.
Elastic recommends max 50GB per shard. So, you need 40 primary shards per day. At this scale, query latency will be so long that only periodic jobs are feasible.
Also see:
https://www.elastic.co/elasticon/conf/2017/sf/small-medium-or-large-evolve-your-elastic-stack-to-fit https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing---
thanks this is very helpful
---
50K msg/per second is to store raw packets? or Just for the metadata for the TCP/HTTP connections, like 5-tuples plus other attributes? We built a system (Kafka + ES) that can handle 60K msg/per (metadata) second easily.
---
I don't understand metadata but we should save packet information like protocols of each layer,
source and destination IPs and ports and so on.
how many nodes and do you have in your Elasticsearch cluster?
and how many shards does it have ?
---
NEW: Monitor These Apps!