What data do you have? A Data Definition Framework
In the last blog, we discussed data visualisation mistakes everyone can make. Also, we all know that the quality of any data visualisation and data analysis is conditional on the quality of the data. During data analysis, a proper new finding can be revealed only if data is accurate and trustworthy. Thus, this time, we are going to look deeper at types of data with an eye to ordering and understanding your data assets better.
Although the terminology is not clear, it is still a common approach to classifying data to unstructured, semi-structured, and structured. As big data refers to remarkably massive datasets that are hard to analyse with regular tools, it is strictly connected with unstructured data. Unstructured datasuch as images, audio files, presentations, messages etc does not have pre-defined data scheme or any intrinsic data model. At the same time, such data can be very valuable for a company. Thanks to Big Data technologies such as Hadoop etc, previously challenging unstructured data analysis is becoming handy and thus popular. Procedures such as data mining, Natural Language Processing (NLP), and text analytics equip data scientists with the variety of techniques to discover patterns and provide interpretation for the unstructured data.
Opposing to unstructured data,structured data such as web server logs, sensory data etc can be promptly organised. It can be both human and machine generated. Structured data usually remains in a relational database, and therefore, it is seldom called relational data.
Besides structured and unstructured data, there’s also in-between category: semi-structured data. Semi-structured data does not “live” in a relational database but at the same time, it has few supervising properties that make it more accessible to interpret. For instance, XML documents and NoSQL databases might be examples of semi-structured data.
Returning to the subject of the data definition framework we should also denote that business data come from both internal and external sources. Internal sources of data usually are under the control of the company as such data reside in CRM, financial, and operational systems etc. Therefore, the quality of the data is mostly under control too. External data sources consist of social media, open data sources, to name a few.
Below you can see useful data framework produced by Bob Hayes, PhD, breaking down data sets in 6 categories (originally published here).
This framework can help a business analyst or newbie data scientist organise and understand company’s data assets, as well as identify strengths and gaps in data collection efforts.