top of page
  • Writer's pictureNikhil Kumawat

Types of Data: Structured, Semi-Structured and Unstructured

Updated: Jan 23, 2023



Data is a term you all have heard about so let's first define it. Data is any information that is stored in digital format, and in this era, this information is growing at an exponential rate. The reason is development of technologies in the last decade that creates digital data, some of the sources are self-driving cars, e-commerce websites, social media platforms, banking transactions, IoT devices, smartphones, etc. Data generated from these sources are huge in size (in Bytes) and the term coined for this huge amount of data is Big Data.


Now, Big Data is used by businesses to derive better business decisions to succeed in the modern world by analyzing it. Here it comes now, 'Analysis', for analyzing the data, the technical teams of the company have defined the data in 3 different categories which are 'Structured Data, Semi-Structured Data, and Unstructured Data'. The purpose of defining categories of data is because it impacts how data can be stored, how it should be organized, and how easy it is to process and analyze it. So let's see the definition of these 3 formats of data.

 

Structured Data

Data having a pre-defined structure or schema and is typically categorized as quantitative data which is well-organized defined as Structured Data. Because of having pre-defined structure-property, data can be organized into tables - columns and rows just like in spreadsheets. Most of the time when data is having relations and can't store in spreadsheets due to the large size in this case structured data stored in relational databases. Let's see some tools and use cases for Structured Data.

Tools

  1. PostgreSQL: PostgreSQL is an object-relational database management system (ORDBMS). It supports a large part of the SQL standard and offers many modern features like complex queries, transactional integrity, and multi-version concurrency control.

  2. SQLite: SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite reads and writes directly to ordinary disk files.

  3. MySQL: MySQL is a widely used relational database management system (RDBMS). It is free and open-source and ideal for both small and large applications.

  4. Oracle Database: This is an advanced database management system with a multi-model structure. It can be used for data warehousing, online transaction process, and mixed database workloads.

  5. Microsoft SQL Server: Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications.

Use Cases

  • Customer Relationship Management: CRM software runs structured data through analytical tools to create datasets that reveal customer behavior patterns and trends.

  • Online Booking: Hotel and ticket reservation data (e.g, dates, prices, destinations, etc.) fits the "rows and columns" format indicative of the pre-defined data model.

  • Accounting: Accounting firms or departments use structured data to process and record financial transactions.

A glimpse of how structured data is organized in a database or how structured data looks like.

In the above image, tables are defined with a fixed number of attributes and have relations with each other. This is how structured data is stored in the database in an organized manner having pre-defined structure-property or pre-defined schema.

Unstructured Data

Unstructured data is typically categorized as qualitative rather than quantitative. It doesn't have a pre-defined structure or specific format. Data that lies in this category are audio, video, images, and text files contents which have different properties for making these data available for analysis and can't be stored in relational databases (How would you store images in interrelated spreadsheets?) as you can see these data lacks attributes and relations between each other. So these are stored in their raw format and analysis is done by applying Image processing, Natural Language Processing, and Machine Learning.

Tools

  1. MongoDB: MongoDB is a non-relational document database that provides support for JSON-like storage.

  2. Hadoop: A distributed storage that can store any file format in a distributed and scalable manner.

  3. Data Lake: A datalake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data.

Use Cases

  • Data Mining: Enables businesses to use unstructured data to identify consumer behavior, product sentiment, and purchasing patterns to better accommodate their customer base.

  • Predictive Data Analytics: Alert businesses of important activity ahead of time so they can properly plan and accordingly adjust to significant market shifts.

  • Chatbots: Performs text analysis to route customer questions to appropriate answer sources.

In the above image, you can see that data is available in image format which can't be stored in relational databases and required non-relational databases for storing and analysis hence this type of data is categorized as unstructured data which doesn't have a pre-defined schema. These are analyzed using Image Processing.

Semi-Structured Data

Semi-Structured data contains elements of both structured and unstructured, it's schema is not fixed as structured data and with the help of metadata (which enables users to define some partial structure or hierarchy), it can be organized to some extent so not unorganized as unstructured data. Metadata includes tags and other markers just like in JSON, XML, or CSV which separates the elements and enforces the hierarchy, but the size of the element varies and order is not important. Use cases and Tools for storing and processing are:

Tools

  1. Cassandra: Apache Cassandra is an open-source NoSQL distributed database having scalability and high availability without compromising performance and provides availability (CAP theorem).

  2. MongoDB: MongoDB is a non-relational document database that provides support for JSON-like storage and provides consistency (CAP theorem).

Use Cases

  • E-commerce: Products with different attributes which have tags as metadata makes it partially organized. We can see in the below sample data that different types of products have different attributes (Dynamic Schema) which is not pre-dfined and on the fly attributes can be add.

    • For mobile phones: {"storage": "64GB", "network": "5G", "color": "black"}

    • For books: {"publisher": "Oxford Press", "writer": "John Doe", "pages": 250}

In the above image, we can see that tags are there which is defining the attributes and acting as metadata. The above snippet is of a JSON file which is not completely structured as the 'SALES' number 648666 has 1 more attribute which is not present in 'SALES' number 648229 and these attributes can vary for different tags making it semi-structure. You can also see the inherent tree-like structure that gives some degree of organization, but it is less strong than in the table.


Hopefully, this blog helps you to understand the difference between structured data, semi-structured data, and unstructured data. If anything skips here you can comment down, I will try to update it ASAP.


Thanks for visiting.

36 views0 comments
bottom of page