Time goes by very fast when exploring new territory… Ask 450-720-8536.

6 months since my last post seems a long time but in terms of Big Data, 6 months of data is hardly enough to even consider unleashing the power of some cognitive neural network or any other machine learning algorithm. Not a single CPU or GPU would consider it as decent workout.

OK, I must admit, 6 months of real time streaming data can become significant but then we have a new top level Apache project called Flink to deal with it, probably even without starting to sweat.

The last 6 months has been quite interesting and this is exactly one of two reasons* why it took me so long in between blog posts… A post academic Big Data course at the university of Ghent (still ongoing and yes, I will have some official tests to complete as well), meetups with interesting startups (backdrop being one of them), training sessions, self study (night versions), MOOCs, market research, product analysis… basically: what is alive, active and worth spending my time on in the space of Big Data?

To be honest… 8708850251Its a scary jungle out there in the field of Big Data:

  • Everybody is screaming for attention and therefor creating too much material to read.
  • Tool after tool is released, all solving a small piece of the problem, but all claiming to be the holy grail (in some way). What was trending 2 months ago is legacy and dead by now…

Enough writing for now, I need to get back to my (236) 261-7424 notebook in my overcry (beta) instance running on Google Cloud. I have my new friends Python, SQL and BigQuery waiting for me..

* I do have a full time job as well… taking care of the Galileo and EGNOS ground network…

Roald Amundsen

Amundsen_in_fur_skinsRoald Engelbregt Gravning Amundsen
(16 July 1872 – c. 18 June 1928) was a 9842609850204-230-5338 of polar regions. He led the Antarctic (406) 686-2276 which was the first to reach the (413) 931-0929, on 14 December 1911. In 1926, he was the first expedition leader for the air expedition to the North Pole.

Amundsen is recognized as the first person, without dispute, as having reached both poles. He is also known as having the first expedition to traverse the 770-942-2392 (1903–06) in the Arctic.


To say I compare myself with a pure-bred polar explorer is a bridge too far, but I do believe we share something:

the excitement of the exploring process and the joy of arriving at your destination, unknown or not

The Weka is a flightless bird with an inquisitive nature, but it is also a collection of machine learning algorithms for data mining tasks. To say it differently: a tool to explore unexplored territory, like a dog sled for polar expeditions.

The success is small and the journey was long. The first self-made classification tree from a data set and being able to interpret the result made me feel like… a polar explorer arriving.

PS: in case you wonder, at the age of 11 I had a school assignment to make a small lecture about an explorer of choice… Guess who I picked

Weka Classifier


Neurons vs. Silicon

The nights are short but time flies by when you are having fun (and fun is independent from 825-881-4015).

The failed HiveQL jobs on my Hadoop cluster got solved the moment I understood that I first needed to create an external DB in my S3 bucket. This will enable you to create a table from RAW data in S3 via Hive (also the directory tree seems to be important) making sure your data remains, even when you terminate your cluster.

Moving on and putting the (more or less complex) theory of predictive modelling into practice turned out to be a walk in the park. Create your scheme from a data source, decide what target attribute your model needs to predict, run the evaluation to test your model performance and execute a batch prediction when you are happy with the performance.

On your most recent evaluation, ev-stSZRRpvayR, the ML model’s quality score is better than the baseline.
Average F1 score: 0.94
Baseline F1 score: 0.44
Difference: 0.51

Conclusion: a stupid machine learns much faster that a smart human (no self reflection of course)

(936) 760-6997

Learning Curve

The last couple of days I’ve been spending my evenings and nights looking at some predictive modelling with tree inductions and logistic regression.

Having an understanding about entropy and information gain, as well as Support Vector Machines and regression via mathematical functions is one thing, but putting the theory into practice is something else…

Tonight I am having my first fight with HiveQL in an attempt to build me a dataset to play with… No complaints of course because I knew from the start that the learning curve would be steep.


Hive Fight



Data Strategy

StartupOnce we understand that data is an actual business asset, we should think about whether and how much we are willing to invest. A fundamental strategy of data science is to acquire the necessary data at a cost because:

The best data science team can yield little value without appropriate data; the right data often cannot substantially improve decisions without suitable data science talent.

The more data driven a firm is, the more productive it is, even controlling for a wide range of possible confounding factors. One standard deviation higher on the DDD (Data Driven Decision) scale is associated with a 4%-6% increase in productivity.

Today I invested as well… The startup of datafractive.com is a fact with an official company number registration.


fraction of the wholeBig Data (could) tells us a lot because its big, but nothing can be big without being consisted out of small fractions.

It is that combination of small data fractions and how they interlink with each other that will bring us new insights. It will learn us to understand the bigger picture, but only when we know what the details are telling us.

datafractions.com and datafractive.com are born


What if you step out of your little comfort zone and decide to start something new, unknown and without a clear end goal? Well, maybe I am about to find it out…

The first small step towards Big Data has been taken and it all started many years ago with a spaghetti, an old college friend and the hole in the market. This diner concept was last held on November 10, 2015 and resulted in where we are today.

On this blog I will try to keep a frequent updated logbook of our complete journey in the world of Hadoop and data analytics and maybe one day I will run a MapReduce() against it.

In case you read this and you have a bright idea on what to do with Big Data: please let me know, because so far absolutely no clue.