Transforming Unstructured Data into Structured Using AI

mindsdbteam

MindsDB Team

Posted on November 22, 2024

Transforming Unstructured Data into Structured Using AI

Written by Martyna Slawinska, Technical Product Manager @ MindsDB

In this digital age, the vast majority of data generated is unstructured, ranging from emails and social media posts to audio and video content. This presents a significant challenge for organizations seeking to harness this data for insights and decision-making. Leveraging artificial intelligence (AI) to structure unstructured data is crucial, as AI can process and analyze large volumes of data quickly and accurately.

Traditional methods for structuring unstructured data involve manual processes or rule-based systems, which can be time-consuming, error-prone, and difficult to scale. While these methods can provide some level of organization, they lack the efficiency and adaptability of AI-powered solutions. AI, or specifically large language models (LLMs), simplifies the process of extracting relevant information from unstructured data. This accelerates data processing and enhances accuracy, enabling real-time data analysis.

One notable solution in this space is provided by MindsDB, an open source AI platform that brings structured and unstructured data together for enterprise AI. The solution utilizes MindsDB's integration with Large Language Models to extract relevant data from unstructured text and insert it into a database. This way you can convert large amounts of unstructured data into structured formats using only a few SQL commands. Read along to follow the tutorial.

Tutorial

Start by setting up MindsDB locally via Docker or Docker Desktop.

Connect your data source that contains unstructured data to MindsDB. See all supported data sources here.

In this example, we use the sample data containing property descriptions as follows.

We have unstructured data in the form of property descriptions that contain details about properties such as address, square footage, and number of bedrooms and bathrooms.

id description
1 Discover luxury at 123 Maple Street, Los Angeles, CA 90001, USA. This stunning 4-bedroom, 3-bathroom home spans 3,200 sq ft, featuring modern amenities and a spacious backyard perfect for entertaining.
2 Charming 3-bedroom, 2-bathroom cottage at 456 Oak Avenue, Austin, TX 73301, USA. Enjoy a cozy fireplace, updated kitchen, and lush garden in this 1,500 sq ft home located in a quiet neighborhood.
3 Beautiful waterfront condo at 789 Pine Lane, Miami, FL 33101, USA. This 2-bedroom, 2-bathroom gem boasts 1,200 sq ft of space, ocean views, a private balcony, and access to resort-style amenities.

We want to extract relevant data from descriptions and store it in another table. To do that we use the OpenAI model with the JSON Extract feature.

Firstly, we create an OpenAI engine providing the OpenAI API key.

CREATE ML_ENGINE openai_engine
FROM openai
USING
    openai_api_key = 'your-openai-api-key';
Enter fullscreen mode Exit fullscreen mode

Secondly, we create an AI model to extract relevant data values from descriptions.

CREATE MODEL extract_data_model
PREDICT json
USING
    engine = 'openai_engine',
    json_struct = {
        'nobath': 'number of bathrooms',
        'nobed': 'number of bedrooms',
        'sqft': 'square footage of the property (only integer)',
        'address': 'address includes house number and street name',
        'city': 'city',
        'zipcode': 'zipcode includes two letters (state) and five digits',
        'country': 'country'
    },
    prompt_template = '{{description}}';
Enter fullscreen mode Exit fullscreen mode

The json_struct parameter stores data fields and their descriptions; these data fields will be extracted from unstructured data. MindsDB has custom-implemented it to ease the process of extracting desired data from unstructured volumes of data.

The prompt_template parameter stores the column name that contains property descriptions in double curly braces, which will be replaced by property descriptions from the input data upon joining the model with the input data.

SELECT d.description, m.json
FROM data_source.property AS d
JOIN extract_data_model AS m;
Enter fullscreen mode Exit fullscreen mode

Here is the output:

json_table

Now that we have extracted relevant data from descriptions into JSON format, we can insert it into another table.

INSERT INTO data_source.property_details

    SELECT json_extract(json, '$.address') AS address,
           json_extract(json, '$.city') AS city,
           json_extract(json, '$.country') AS country,
           json_extract(json, '$.zipcode') AS zipcode,
           json_extract(json, '$.sqft') AS sqft,
           json_extract(json, '$.nobed') AS nobed,
           json_extract(json, '$.nobath') AS nobath

    FROM (SELECT m.json
          FROM data_source.property AS d
          JOIN extract_data_model AS m);
Enter fullscreen mode Exit fullscreen mode

Here is the resulting table:

address city country zipcode sqft nobed nobath
"123 Maple Street" "Los Angeles" "USA" "CA 90001" "3200" "4" "3"
"456 Oak Avenue" "Austin" "USA" "TX 73301" "1500" "3" "2"
"789 Pine Lane" "Miami" "USA" "FL 33101" "1200" "2" "2"

You can adapt this tutorial to your use case by modifying the json_struct parameter value when creating the model, which allows you to specify what data should be extracted from the given unstructured data.

Automating Data Structuring for Newly Added Unstructured Data

With MindsDB, you can automate the transformation of newly added unstructured data into a structured format. Setting up a job ensures that every time new data becomes available, it is automatically processed and stored in the desired format.

CREATE JOB structure_data (

INSERT INTO data_source.property_details

    SELECT json_extract(json, '$.address') AS address,
           json_extract(json, '$.city') AS city,
           json_extract(json, '$.country') AS country,
           json_extract(json, '$.zipcode') AS zipcode,
           json_extract(json, '$.sqft') AS sqft,
           json_extract(json, '$.nobed') AS nobed,
           json_extract(json, '$.nobath') AS nobath

    FROM (SELECT m.json
          FROM data_source.property AS d
          JOIN extract_data_model AS m
          WHERE d.id > LAST)
)
EVERY 1 day
IF (
SELECT *
FROM data_source.property AS d
WHERE d.id > LAST
);
Enter fullscreen mode Exit fullscreen mode

This job, upon its execution, transforms new loads of unstructured data into the defined format and inserts it into a data table.

This is a conditional job that attempts its execution EVERY 1 day but actually executes the INSERT INTO statement only IF there is new data available in the property table. The LAST keyword used in the WHERE clause checks if any new data has been added since the last check. If new data has been added, then the query inside the IF clause returns data, and subsequently, the INSERT INTO statement is executed.

The key benefits of this approach include:

  • Efficiency: It automates routine data transformation tasks.
  • Conditional Execution: It avoids unnecessary processing by executing only when new data is available.
  • Seamless Integration: It ensures that structured data tables stay up-to-date without manual intervention.

Note that conditional jobs can serve a multitude of different use cases. Check out more use cases of MindsDB here.

Conclusion

Transforming unstructured data into valuable insights doesn’t have to be complex. With MindsDB, you can quickly extract desired details from text using just a few simple SQL commands – making the process fast and straightforward. It’s also highly adaptable: customize the json_struct settings to align perfectly with your data requirements.

If you have any questions, join our Community Slack — we’re here to help!

💖 💪 🙅 🚩
mindsdbteam
MindsDB Team

Posted on November 22, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related