What is data – and what makes a dataset?
Here’s another of our technology background articles to help you become better acquainted with the fundamentals of our work. In this post we’ll take a short look at what is data – and what makes a dataset. This sounds rather obvious perhaps, but as data science is dependent on this you need to understand this correctly along with what datasets are.
Data comes from datum – a piece of information which is an abstraction of a real world entity. That could be:
- A person
- An object
- An event
You may find the terms variable, feature and attribute also crop up when describing an abstraction. An entity definition will consist of some attributes. A car for example, might have colour, make, engine size, engine type, doors etc.
Data can be continuous (coming from data streams) like stock prices for example. Or it can be categorical: Which car has the best MPG, for example.
So, what makes a dataset?
A dataset contains the data relating to a collection of entities. Each entity has its own set of attributes. The n * m data matrix, referred to as the analytics record is the way data storage works in a basic form. Where n is the number of entities (rows) and m is the number of attributes (columns). Both analytics record and dataset mean the same thing in data science. You’ll find both terms in use.
Creating a dataset
Building the analytics record is key to doing data science. Much of the effort involves creating, cleaning and updating the analytics record. But it takes time to merge data from a multitude of sources. Including:
- Computer files (CSV, spreadsheets, JSON files for example)
- Scraped from the web
- Social Media streams
- Data warehouse
- Derived from sensors
Meaning that the data management in itself can be tricky.
There’s important questions to answer about the attributes collected for an entity. Collecting data takes time and what is collected – and used – is vital; collecting data attributes not useful in the analysis process is wasteful.
That means you first need to create a plan about the data needed to answer the business question. The challenge you have is to keep the dataset specific enough to deliver useful insights, while ensuring the scope of the planned dataset is wide enough to answer multiple, relatable questions. This is important because redundant data will cause analytical algorithms to fail. Or produce spurious results – not good in either case.
Data types are also important. The standard types of data are:
- Numeric, which uses scales and ratios
- Nominal, also known as categorical attributes, are names for categories, classes or state of things
- Ordinal, which are similar to nominal ones but can have a rank order on values applied. For example, a response to a survey question (like, dislike etc)
Are you data-driven?
While we are defining data, it’s worth considering what it means to be data-driven, or data-centric. More and more businesses across all sectors talk about being data-driven. Which is why data science is gaining momentum. Both phrases essentially are the same; strategic decisions are made from data analysis and interpretation. Those decisions are dependant on data being collected and/or analysed.
We’ll discuss what being data-driven, and it’s benefits, in another post.