IntaLink: A New NL2SQL Technology Distinct from Large Models
intalink
Posted on October 29, 2024
IntaLink: A New NL2SQL Technology Distinct from Large Models
Hidden Gem
Wide Application Scenarios of IntaLink
Background Review: In previous articles, it was mentioned that "the goal of IntaLink is to achieve automated data linking in the field of data integration." From the discussion, it is clear that IntaLink addresses the issue of automatic linking of "relational data and multiple tables."
Now, let's discuss whether this issue has broad application scenarios or if it is merely a pseudo-proposition without practical demand.
01 Relational Data Remains One of the Most Important Data Assets
Although large models, big data platforms, and other technologies can utilize various types of information, including documents, images, audio, and video, such as multimodal generative AI capable of producing videos and facilitating voice interactions, the results are often open-ended and subjective, occasionally leading to "hallucinations." Thus, while using them for reference or assistance is acceptable, in certain rigorous working environments, we cannot rely on this information or large models to complete tasks. In sectors like banking, finance, transportation, trading, accounting, production, and energy, core business data must be managed using structured relational data.
02 Data Construction is Inevitable and Distributed
(1) The Design Paradigm of Relational Databases requires data to be reasonably divided to avoid significant redundancy. If the data generated during the construction phase contains a lot of redundancy, not only is the data collection workload duplicated, but data consistency is also difficult to ensure. From another perspective, if all related data are stored in a single table, but the data items come from different business sources, with varying data collectors and generation times, maintaining such data records becomes impossible. Thus, data construction will inherently organize data based on object orientation and business activities, leading to its distribution across different tables.
(2) Data Must Originate from Multiple Systems. Since information technology construction is not completed in one go, there will inevitably be a sequence of developments. Even within the same system, there may be variations in implementation timelines. Moreover, different application scenarios require different technological choices; for instance, business data, real-time data, and log information may be realized through various technologies, making data inherently multi-sourced.
03 Integration is the Most Effective Means of Unlocking Data Value
Data needs to be integrated for application. The demand for data integration applications has various possibilities. For example, integrating production data and planning data can assess the status of plan completion; integrating production data and sales data can identify product backlogs or fulfillment of order deliveries; and integrating production data with financial data can evaluate production costs and profitability. Therefore, data integration is the most effective way to maximize data value and empower business processes.
In summary, the integration application of relational data will remain one of the most important data application scenarios for a long time. As long as this scenario exists, IntaLink will have broad adaptability.
Comparison of IntaLink and Large Model Data Integration Methods
T2SQL (Text to SQL) and NL2SQL (Natural Language to SQL) automatically generate the required data queries through text or natural language input. The terms T2SQL and NL2SQL essentially describe the same concept: utilizing AI technology to transform semantic understanding into data operation methods, which is the same idea but with different terminologies. This is a research direction in data applications. In recent years, with the emergence of large model technologies, this field has seen significant advancement. I have researched technical reports from Alibaba and Tencent and tried out open-source projects like DB-GPT. These technologies are largely similar, at least in their underlying technical logic, while IntaLink’s approach is entirely different.
Let’s set aside the underlying technical logic for now and conduct a comparative analysis based on implementation methods:
1. Utilizing Large Model Technology for Automatic Data Queries Requires Data Training
Suppose we have a set of tables named T1, T2, ..., Tn, each containing several data items labeled C1, C2, ..., Cn, with varying counts of items per table. Consider a simulated dataset for table T1 as follows:
C1 | C2 | C3 | C4 | C5 | C6 |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
From this content alone, we cannot derive any useful information. We are unclear about the meaning of the data above. Let’s simulate two meanings for the data:
Fruit Type | Warehouse No. | Shelf No. | Stock | Shelf Life | Warehouse Manager ID |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
Hotel Name | Warehouse Hotness Ranking | Star Rating | Years in Business | Remaining Rooms | Discount Available |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
We won't dwell on the validity of these datasets or the existence of such tables. However, it is evident that without understanding the meaning of the tables and data items, the data cannot be applied. One cannot link data application needs to the data itself, let alone discuss more complex data operations.
Using a dataset for testing NL2SQL, let’s illustrate the application pattern of large model technology in this field.
The Spider dataset is a T2S dataset for multi-database, multi-table, single-round queries and is recognized as the most challenging large-scale cross-domain evaluation leaderboard. It was proposed by Yale University in 2018, annotated by eleven Yale students. The dataset contains ten thousand one hundred eighty-one natural language questions and five thousand six hundred ninety-three SQL statements, covering over two hundred databases across one hundred thirty-eight different domains. Seven thousand questions are used for training, one thousand thirty-four for development, and two thousand one hundred forty-seven for testing. In other words, by providing questions along with their corresponding answers (SQL), the large model learns to utilize the data. For simplicity, we can condense the logic as follows:
- Question 1: How many red lipsticks are in stock?
-
Answer 1:
select amount from warehouse where good_name='lipstick' and color='red'
After training the model with such a dataset, we can pose the following test question:
- Test Question: How many blue lipsticks are in stock?
-
Output Answer:
select amount from warehouse where good_name='lipstick' and color='blue'
From this, we see that NL2SQL emphasizes deriving possible SQL queries based on semantic and contextual understanding, relying on a trained dataset.
IntaLink’s Data Integration Method
IntaLink's data integration does not require users to provide any training data. The relationships between data are generated through an inter-table relationship analysis model. This relationship generation does not require understanding the actual significance of the tables and data items but is derived through a set of methods that analyze the data's characteristic values to deduce associations between tables. Below, we illustrate the establishment of inter-table relationships using two sample tables.
Tab_1
Name | Student_ID | CLASS | Age | Sex |
---|---|---|---|---|
Zhang San | 2021_0001 | 2021_01 | 19 | Male |
Li Si | 2021_0002 | 2021_01 | 18 | Female |
Wang Wu | 2021_0003 | 2021_01 | 19 | Male |
Tab_2
Student_ID | Course | Grade | Rank |
---|---|---|---|
2021_0001 | Math | 135 | 18 |
2021_0001 | Chinese | 110 | 23 |
2021_0002 | Math | 120 | 25 |
2021_0002 | Chinese | 125 | 10 |
In Tab_1, the Student_ID matches the Student_ID in Tab_2, sharing the same characteristic values. Therefore, to link these two tables, the condition Tab_1.Student_ID = Tab_2.Student_ID holds true. This analysis of inter-table linkage requires consideration of numerous factors. In IntaLink, we replicate the data characteristic value memory database as an analysis tool, utilizing a set of optimized analytical methods to produce inter-table relationship analysis results. Due to the complexity of the content involved, we will not elaborate further here. A separate article will discuss the implementation logic.
Differences Between IntaLink and Large Model Technologies in Implementing NL2SQL
- 1) There is no need to prepare a training question set for the large model; rather, relationships are derived through data analysis. Therefore, IntaLink can be applied to a wide range of data. The more data that can be integrated, the greater its advantages.
- 2) Focuses on data integration, specifically the generation of relational conditions during integration, without concentrating on data usage methods. Note: Data integration concerns establishing relationships between multiple tables, while data usage methods can vary, such as summation, counting, averaging, minimum and maximum values, etc. NL2SQL selects appropriate data operation methods based on semantics, like SUM, COUNT, AVG, MIN, MAX, etc.
- 3) High accuracy: Excluding data quality issues, the relational conditions generated by IntaLink theoretically can achieve one hundred percent accuracy.
Potential Combination of IntaLink and Large Model Technologies
Large model technologies excel in semantic understanding and generative content, while IntaLink has advantages in data association analysis with lower upfront workload and higher accuracy. Ideally, large model technologies could be integrated to understand user input requirements, converting that information into the necessary data tables and items, which IntaLink would then generate for data sets, followed by the large model generating the desired outcomes (e.g., reports, charts, etc.) for user presentation.
Join the IntaLink Community!
We would love for you to be a part of the IntaLink journey! Connect with us and contribute to our project:
🔗 GitHub Repository: IntaLink
💬 Join our Discord Community
Be a part of the open-source revolution and help us shape the future of intelligent data integration!
Posted on October 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.