I was working for a startup implementing analytics tools. In my opinion, our setup was over-engineered, but I wasn't there at the beginning, so I might be wrong. Also, requirements changed a couple of times, so this could also explain why something that looked necessary for scaling and speed, ended up being this over-engineered mess. This is how it worked: After javascript tracker fired, we got log files, passed them through Kafka, then parsed the log files and performed calculations through Storm (Java). For storage, we used Cassandra. The system also had other parts, but I don't remember why they were there, tbh.
My thought process for solving your problem would be the following. First, you need to understand what's good for you and for your company might not be the same. You want the challenge, you want to implement something that could scale and you want to use exotic tools for achieving this. It's interesting and looks good in your CV. Your company might just want the results. You need to decide which is more important.
If we prioritize your companies needs over keeping you entertained, I'd follow this thought process:
Can't you just use Google Analytics? You can also connect it to BigQuery and do lots of customizations. Maybe time would be better spent learning GA. It's powerful, but most of us cannot use it well.
Second question: if for some reason, you don't want to use Google Analytics, can you use another, possibly open-source and/or self-hosted analytics solution? Only because you can design it from scratch, it doesn't mean you should.
Third: Alright, you want to implement something from scratch. For this scale, you can probably just log and store events in an SQL database, write the queries, and display it in a dashboard.
Then, if you really want to go further, there are many tools that are designed to scale well and perform analytics, "big data". By looking for talks about these tools, you will get a better understanding of how things work. There are various open-source projects you should read more about: Cassandra, Scylla, Spark, Storm, Flink, Hadoop, Kafka, Hadoop, Parquet, just to name a few.
The analytics are not just limited to web clients. There would be API clients too. The deployment will be in a private enteprise vpn and so talking to external services may not be an option.
I am aware of these tools like cassandra/flink/spark/kafka etc. But I am more curios about the best tools and architectural patterns that work well with each other.
>>I'm more curios about the best tools and architectural patterns that work well with each other.
Well you can go with:
Fancy: Hdfs(distributed file system) as storage - oozie as workflow scheduler for your load(python/hive/scala/spark) - Tableau for visualization (your business ussers will love it.
Mid range: SQL Server as storage - Informatica for your workload - power BI /SSRS for visuals
Open/low budget: PostgreSQL / Cassandra for storage - make your own scheduler/ there is a post for ETL open score yesterday that might help - for visuals you can make it from scratch but hire a good designer!
This is based on my experience on industries like Gambling, Banking and Telecom
I worked in a 15m/year revenue product for 3 years. Our Analytic system was screw by cookie messages and now GDPR. Marketeers wanted to serve Analytics through Google Tag Manager, which helped customers to block our analytics launcher, meaning 0 data for most of the visits.
My thought process for solving your problem would be the following. First, you need to understand what's good for you and for your company might not be the same. You want the challenge, you want to implement something that could scale and you want to use exotic tools for achieving this. It's interesting and looks good in your CV. Your company might just want the results. You need to decide which is more important.
If we prioritize your companies needs over keeping you entertained, I'd follow this thought process:
Can't you just use Google Analytics? You can also connect it to BigQuery and do lots of customizations. Maybe time would be better spent learning GA. It's powerful, but most of us cannot use it well.
Second question: if for some reason, you don't want to use Google Analytics, can you use another, possibly open-source and/or self-hosted analytics solution? Only because you can design it from scratch, it doesn't mean you should.
Third: Alright, you want to implement something from scratch. For this scale, you can probably just log and store events in an SQL database, write the queries, and display it in a dashboard.
Then, if you really want to go further, there are many tools that are designed to scale well and perform analytics, "big data". By looking for talks about these tools, you will get a better understanding of how things work. There are various open-source projects you should read more about: Cassandra, Scylla, Spark, Storm, Flink, Hadoop, Kafka, Hadoop, Parquet, just to name a few.