What is data - and what makes a dataset?

What is data – and what makes a dataset?

Here’s another of our technology background articles to help you become better acquainted with the fundamentals of our work. In this post we’ll take a short look at what is data – and what makes a dataset. This sounds rather obvious perhaps, but as data science is dependent on this you need to understand this correctly along with what datasets are.

What is data?

Data comes from datum – a piece of information which is an abstraction of a real world entity. That could be:

A person
An object
An event

You may find the terms variable, feature and attribute also crop up when describing an abstraction. An entity definition will consist of some attributes. A car for example, might have colour, make, engine size, engine type, doors etc.

Data can be continuous (coming from data streams) like stock prices for example. Or it can be categorical: Which car has the best MPG, for example.

So, what makes a dataset?

A dataset contains the data relating to a collection of entities. Each entity has its own set of attributes. The n * m data matrix, referred to as the analytics record is the way data storage works in a basic form. Where n is the number of entities (rows) and m is the number of attributes (columns). Both analytics record and dataset mean the same thing in data science. You’ll find both terms in use.

Creating a dataset

Building the analytics record is key to doing data science. Much of the effort involves creating, cleaning and updating the analytics record. But it takes time to merge data from a multitude of sources. Including:

Computer files (CSV, spreadsheets, JSON files for example)
Scraped from the web
Social Media streams
Data warehouse
Derived from sensors

Meaning that the data management in itself can be tricky.

Choosing attributes

There’s important questions to answer about the attributes collected for an entity. Collecting data takes time and what is collected – and used – is vital; collecting data attributes not useful in the analysis process is wasteful.

That means you first need to create a plan about the data needed to answer the business question. The challenge you have is to keep the dataset specific enough to deliver useful insights, while ensuring the scope of the planned dataset is wide enough to answer multiple, relatable questions. This is important because redundant data will cause analytical algorithms to fail. Or produce spurious results – not good in either case.

Data types are also important. The standard types of data are:

Numeric, which uses scales and ratios
Nominal, also known as categorical attributes, are names for categories, classes or state of things
Ordinal, which are similar to nominal ones but can have a rank order on values applied. For example, a response to a survey question (like, dislike etc)

Are you data-driven?

While we are defining data, it’s worth considering what it means to be data-driven, or data-centric. More and more businesses across all sectors talk about being data-driven. Which is why data science is gaining momentum. Both phrases essentially are the same; strategic decisions are made from data analysis and interpretation. Those decisions are dependant on data being collected and/or analysed.

We’ll discuss what is data-driven, and it’s benefits, in another post.

image: Maksym Kaharlytskyi