on
Big data and anomaly detection
This week I spent a really productive evening in Milpitas (near the In-N-Out Burger) to learn more about big data analytics and anomaly detection. My personal interest in the topic was from a network traffic analysis and intrusion detection perspective, and it provided a great opportunity for me to learn more about data science and anomaly detection.
An unexpected part of the talk was that one of the founders of Streamlio was in the audience right in front of me, and he commented in depth on how their company’s offering differs from StreamAnalytix, primarily in that it is focused on processing streaming data as it arrives in real-time. This is different than the StreamAnalytix demo which was centered more on static data sets, which implies that the two products aren’t directly competing with one another for now.
There were about 30-35 attendees at the talk, and the speakers were from the engineering team at Impetus Technologies, a company building software focused on big data analytics. Maxim Shkarayev, the Lead Data Scientist from Impetus, talked first about anomaly detection in general and approaches tailored to different disciplines. The uses cases are quite different, ranging from detecting credit card fraud and insider trading, or possibly detecting mechanical faults in machinery as the infrastructure ages.
There are three main approaches to anomaly detection:
- Point based anomaly detection: This could be useful to find something like an unusually large financial transaction
- Contextual anomaly detection: This approach would readily identify a large power spike at night
- Collective anomaly detection: This looks at everything to find data instances that are anomalous with respect to the entire data set
There are several ways of classifying anomalous data:
- Supervised classification: You know what the anomalous behavior us, and the ability to label the data and train the model
- Semi supervised classification : You learn over time what normal data looks like, and you assume that the data set will provide that definition to you
- Unsupervised classification: No details about the data are given, and you hope to see a dense collection of points representing normal behavior, with everything outside of that normal behavior falling into the anomalous category
The second part of the talk was primarily focused on the StreamAnalytix software product. The product focuses on stream processing and machine learning for enterprise. Some of the use cases they are pushing are IoT and log analytics, fraud and risk detection, and predictive maintenance. I was impressed by the UI and how simple it looked to put together a very complex stream of data processing without knowing much about the low level technologies. The presenter was Punit Shah, the architect for StreamAnalytix, and he walked us through a real time demo of the software.
The demo was quite impressive and focused on quickly and easily building a data analytics stream using your choice of channel (Kafka, S3, etc), processors, analytics, and emitters. To me is seems that one of the main selling points of StreamAnalytix is that if you just want to focus on data science or analytics, you can do that quickly without having to know Spark, ZooKeeper, and other technologies first.
From my own experience in enterprise security, one of the immediate uses I could see for StreamAnalytix beyond network traffic anomaly detection would be in analyzing operating systems audit trail data from a collection of systems. A useful and fun experiment in this area would be to generate days or weeks of security audit trail data from various systems in your data center, and then combine them in StreamAnalytix with various processors and analytics. You could then easily visualize what would be a good combination for pointing out unusual security events or administrative events that you would want to send alerts to your security team about.
They have a trial version of StreamAnalytix (limited to one node) available which I plan to try out.