Some notes about machine learning
We thought you might like to read some notes about machine learning. This is a phrase that is common in the developer world. But it’s a complex and often misunderstood subject. So, here’s some of the basics.
Data scientists study and research for years to qualify, so they’re the experts in this field. But developers are likely to find themselves needing to get up to speed in some way.
We’ll cover some of the fundamentals to think about. Because it is possible to build a machine learning project – without years of education.
What is data?
Data means different things to different people. For developers, we often think of it as the stuff you store in your database. Or import from CSV/JSON files. And that does qualify as data in the general sense.
For machine learning projects, data is a more precise definition. The first thing to keep in mind is:
The input data has to be in a format your computer can understand
That might be stating the obvious but let me extend that a little. It is helpful to think of data in a table with rows and columns. Each row of data represents a data point. Then, there are properties that make up that data point.
For example, let’s say you were storing data about people. Their properties might be height, weight, age and so on. You will often hear each data point described as an entity. And each entity has a collection of variables. That would be height, weight, age etc. given in this example.
Those variables are sometimes called features – to cover off more terminology.
Why input data is critical
The input data collected for your machine learning model has to be complete. Take the people entity example above. Then, imagine you were only collecting surname as a variable. There’s no algorithm that would be able to predict gender. The data isn’t there, pure and simple.
That leads to some guiding questions when thinking about a machine learning project.
Stuff to think about
Here’s some good things to keep in mind when you’re planning a project.
- Is the data right?
Be clear on the question(s) you are going to answer. Because that is the only way you can be sure about your data. I mean sure in the sense that the data can answer the question(s).
- How do I phrase my question(s)?
It’s one thing to think of a question, another to think of it in a machine learning context. Are you asking a question in a way that the data can answer?
- Do I have enough data?
This is an easy point to fall foul of. Jumping in to build the app with a small sample of migration data is a big fail. You have to have enough data in the system to represent the problem you are trying to solve.
- Have I got the right variables?
Remember those variables (features) for each entity? Did you get and extract the ones you need to enable the right predictions?
- What does success look like?
Simple this one. How will you know it’s working?
Imagine you were getting data from movement sensors around the home of an elderly person. You want an algorithm that triggers an alert when unusual movement (or lack of) occurs. When the sensor detects movement in the kitchen (for example) the algorithm learns from it.
When there is movement in the kitchen at 3am that breaks the normal pattern it raises questions. Why was this entity (person) in their kitchen at that time?
In the case of an elderly person, does it mean they are waking up dehydrated? Or is it that they can’t sleep?
Either way, the algorithm detects something out of the ordinary and triggers an alert.
One final thing for now
You will be able to read about supervised and unsupervised machine learning in upcoming posts. But since data is at the heart of any machine learning project, the information here is a small start.
Data is king, in the same way content is often cited to be on the web. And that is a fact that is growing in significance all the time.
If you have a data problem we can help, why not contact us and ask about our data sciences services today?