First Principles – a data science use case

By Paul Koot, Data Scientist at Notilyze - As a data science consultant, I have experienced that a project often starts with a lot of stakeholders who have an abundance of cool ideas and a lot of enthusiasm to start using data. 

However, this enthusiasm can quickly vanish after realizing that the first, second and maybe third Proof of Value (PoV) all have one factor in common; a customer (or his/her manager) concluding with one of the following remarks: ‘but what can we do with this insight?’ or ‘but now my problem is still not solved, because…’ or even worse: ‘we cannot do anything with your analysis’. 

Although the idea is cool, the solution does not offer any new value to the company. In fact, Gartner (2019) predicted 80% of analytics insights would not deliver business outcomes through 2022 (

A good thing is that a data scientist can play a big role in reducing the risk of such an event happening. In this article, I would like to focus on a great tool that can avoid a disappointing PoV: First-principles thinking. I will do this by elaborating on two use cases that I have been working on: a life-science manufacturer that wants to improve its sustainability and an organization that wants a better internal search system to find their documents. Before I do so, I will quickly go over the basics of First Principles. 

First Principles thinking
The basic idea of First Principles is to distinguish the most fundamental building blocks of a problem, by answering the question: what are the basic elements of the complex problem that you encounter? By identifying and addressing these basic elements, it is easier to find creative ideas to solve the complex problem as a whole. As systems become more complex and interrelated, recognizing these building blocks of such a problem will not get easier. However, this same trend underlines the importance of using First-principles thinking. One of the strategies to detect these basic elements, is asking ‘Why’ five times. For a more elaborate discussion on this First Principles framework, see

Sensor data in Manufacturing 
I will now describe the first use case, which was a life-sciences manufacturer who wanted to do ‘something with his data to develop a more sustainable business process’. His data consisted of multiple sensors in a distillation process, manual quality samples and a number of gas meters. Two paths that were discussed initially were: 

  • Create a simulation of the distillation process to assess any hiccups in the production process; where is energy lost? 
  • Build a Machine Learning (ML) model to explain/forecast the gas usage of the process. 

 Both paths sounded very reasonable. However, some important missing sensors caused the first path not to be viable and data for the other sensors was too scarce to build a complicated ML model with any significant result. So we asked ourselves; ‘why are we doing this exercise?’ Basically the answer was to save energy. Then we asked ourselves ‘why we actually want to save energy?’ Well, to become more sustainable and to save costs, as the gas prices have risen with a factor 10 over the last year. ‘Okay, but why do we think saving energy is possible in this process?’ Actually this was an interesting question, as we discovered there had not been done any previous research regarding the energy usage; so this would be the first time to connect energy usage to the business process. So we started with the very basics: we looked at the products created in this line (in total 7 products) and excluded all of the sensor data that was not related to Product B, which was produced almost 70% of the time. We recalculated the energy usage to an energy usage per hour and we joined this information to the manual quality measures which includes the yield of the process. We then divided the yield by the energy usage and we saw something that stood out: at some point in time there was a period with a decrease in energy usage of 18%. However, this came with a decrease in yield of only 5%. This means that almost 20% of the energy costs could be attributed to only 5% of the yield. This might have been profitable when gas prices were 10x lower, but this definitely appeared to be a first insight that management started to discuss. Note also that we started with two paths that were not viable because of the fact that we missed some important sensor data and we ended up with a very valuable, actionable insight by not using any sensor data at all. Actually, for the other part of the data we also looked at only one of the seven products, reducing the amount of data used by another 30%. 

Text analytics to create an internal search engine
A second customer wanted an improvement of their internal search engine. This search engine was used to check historical communication with customers. The organization had a large corpus of documents and to answer questions of customers these answers needed to be in agreement with earlier replies. In the current interface the documents were returned only if a 100% match could be found between the search string and text in the document; no NLP was used and e.g. typos or verb conjugations were not detected and dealt with. Another part of their search engine was a classification of the documents into different categories, which could be utilized as an initial filter when searching. Therefore, part of the PoV was to see how this classification (labelling) could be done using ML. However, during the process we got stuck at an accuracy of around 60%. To get some better idea on possible ways to improve this performance, I brought up the question at the customer when and why they started with this labelling. After some follow-up questions it became clear that labelling was particularly done to overcome the problem that searches were done in two languages. As the perfect matching did not work in that case, labels had been invented (1a, 2a, 2b) and translated to both languages. In this way, there was some way to filter the database on a subset of data based on the labels, which were language-agnostic. So it became clear that with introducing NLP and some translating behind the search string, we could get rid of the whole labelling. This labelling purely was a means to an end, and should not be at all the goal of our PoV. 

I think both of these cases show an important caveat of most data scientists. As problem solvers, we as data scientists often start too soon with answering the ‘how’; how can we solve this problem? What ML model should we use? What technical solution is the best? Which database should we use? How do we connect to the data sources? How is the interface going to look? Instead, we should take more time answering the ‘why’; why does a customer wants to answer this question? Why would it help his/her business? Why can’t this be solved without a complicated ML model? Why do we need data to solve this issue? Asking the ‘why’ questions will help us getting closer to the basic elements of a business problem and therefore helps in avoiding creating ‘wrong solutions’; i.e. solutions that do not answer the core problem and therefore do not deliver any value. 

// Contact

Notilyze B.V.
Stationsplein 45 A4.004
3013 AK Rotterdam
+31 10 798 62 95

// Ask a question