I would be suspicious of most tech talks on this. If someone is giving a tech talk on their analytics systems, they are either working at enormous scale (Facebook, Google), selling something (Splunk), or over engineering their system (many startups).
I second advice elsewhere in this thread. Log it into PostgreSQL. If you start overloading that, look into sampling your data before you look into a fancier system. Make sure you have identifiers in each row for each entity a row is part of: user, session, web request. If you're not building replicated PostgreSQL (which you probably won't need for this), log to files first, and build another little process that tails the files and loads the rows into PostgreSQL. That log then load advice is hard learned experience from working at Splunk.
That's actually what we do at Rakam. Postgresql fits in many analytics workloads with partitioned tables, parallel queries, and BRIN indexes. The only limitation is that since it's not horizontally scalable, your data must fit in one server. `it just works` up to ~10M events per month.
The SDKs provide ways to send the event data with the following format:
The event types and attributes are all dynamic. The API server automatically infers the attribute types from the JSON blob and creates the tables which correspond to event types and the columns which correspond to event attributes and inserts the data into that table. It also enriches the events with visitor information such as user agent, location, referrer, etc. The users just run the following SQL query:
SELECT url, count(*) from pageview where _city = 'New York' group by 1
To scale a PG db horizontally, you may want to look at https://www.citusdata.com/ (they were recently bought by Microsoft but I don't expect any change on the Open Source part).
This also makes a lot of things really easy. Want to join against your products table to see what product categories are most popular for certain customer segments? It’s a single query or Tableau drag-and-drop away. You don’t know what you’ll need to access fast to answer business questions, so use a system designed for flexibility until you can’t.
Dumping out to a text file quickly and have some aysnronous queue insert that into your database Is one solution but you have to watch uniqifiers, reconciliation and handling failed inserts so you can fix and reinject any failed records.
I second advice elsewhere in this thread. Log it into PostgreSQL. If you start overloading that, look into sampling your data before you look into a fancier system. Make sure you have identifiers in each row for each entity a row is part of: user, session, web request. If you're not building replicated PostgreSQL (which you probably won't need for this), log to files first, and build another little process that tails the files and loads the rows into PostgreSQL. That log then load advice is hard learned experience from working at Splunk.