Mrunmay Shelar
Posted on August 23, 2024
LangDB provides a powerful arsenal of functions for developers to deal with unstructured data. These functions are designed to streamline common tasks in data extraction, and text chunking. Let's dive into some of the key functions and see how they can make your life easier.
load
The load
function converts any webpage/file into bytes. These bytes can be used to extract text or layout from the file/webpage.
SELECT * FROM load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf');
content |
---|
[37,80,68,70,45,49,46,54,10,37,-30,-29,-49,-45,10,53,32,48,32,111,98,106,10,60,60,10,47,66,77,32,47,78,111,114,109,97,108,10,47,99,97,32,49,10,62,62,10,101,110,100,111,98,106,10,56,32,48,32,111,98,106,10,60,60,10,47,70,105,108,116,101,114,32,47,70,108,97,116,101,68,101,99,111,100,101,10,47,76,101,110,103,116,104,32,50,57,54,10,47,78,32,51,10,62,62,10,115,116,114,101,97,109,10,120,-100,125,-112,-67,74,-61,96,20,-122,31,107,65,20,-59,65,-121,14,14,25,28,92,-44,-2,104,127,-64,-91,-83,88,92,91,-123,86,-89,52,77,-117,-40,-97,-112,-90,-24,5,-24,-26,-32,-22,38,46,-34,-128,-24,101,40,8,14,-30,-32,37,-120,-96,-77,111,26,36,5,-87,-25,-16,-26,123,120,-13,-110,47,-25,64,36,-122,42,26,-121,78,-41,115,-53,-91,-126,81,-83,29,24,83,-17,76,-88,-121,101,90,125,-121,-15,-91,-44,-9,75,-112,125,94,-3,39,55,-82,-90,27,118,-33,-46,-7,33,121,-82,46,-41,39,27,-30,-59,86,-64,-89,62,-41,3,-66,-16,-7,-60,115,60,-15,-75,-49,-18,94,-71,40,-66,19,-81,-76,70,-72,62,10] |
extract_text
The extract_text()
function extracts text from various file types, with specific options available for PDF files.
Parameters
Parameter | Type | Optional | Description | Possible Values | Sample Value |
---|---|---|---|---|---|
path |
String | No | The file path to extract text from | Any valid URL | 'https://example.com' |
type |
String | Yes | Type of file | PDF, Markdown, Text, HTML | 'pdf' |
page_rage |
Array(Int) | Yes | Extra parameter for PDF file type for the range of page numbers | Array of Start and Ending page numbers | [1, 10] |
per_page |
Bool | Yes | Extra parameter for PDF file type to chunk per Page | true, false | true |
Usage with load
function
SELECT * FROM extract_text((SELECT * from load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf')),
type => 'pdf' ,
per_page => false
);
content | metadata | page_no |
---|---|---|
JUST DESERTS Aniela Spring © Copyright Aniela Spring 2024 This is an authorised free edition from www.obooko.com Although you do not have to pay for this book, the author’s intellectual property rights remain fully protected by international Copyright laws. You are licensed to use this digital copy strictly for your personal enjoyment only: it must not be redistributed commercially or offered for sale in any form. If you paid for this free edition, or to gain access to it, we suggest you demand an immediate refund and report the transaction to the author and Obooko. All characters are fictitious and any resemblance to real persons, living or dead, is utterly coincidental. 1 |
{"total_pages":2,"page_range":"(0, 2)"} | 0 |
These all functions are best suited for raw text. However, if you want to get the layout information from a document, LangDB has support for it too.
extract_layout
The extract_layout
function enables structured data extraction with layout information from a document.
Parameters
Parameter | Type | Optional | Description | Possible Values | Sample Value |
---|---|---|---|---|---|
path |
String | No | The file path to extract text from | Any valid file URL | 'https://example.pdf' |
type |
String | Yes | Type of file | Raw, PDF, Image | 'pdf' |
page_range |
Array(Int) | Yes | Extra parameter for PDF file type for the range of page numbers | Array of Start and Ending page numbers | [1, 10] |
parallelism |
Int | Yes | Extra parameter for PDF file type to process pages parallelly | 2, 4, 5 | 2 |
Extracting Layout information from a PDF
SELECT * FROM extract_layout(
path => 's3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf',
type=> 'pdf'
);
page | block_idx | block_id | block_type | row_id | col_id | text | confidence | entity_types | relationships |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | c7261e9c-be58-4776-a1de-70adf6e4e6e6 | PAGE | 0 | 0 | 0 | [] | [["CHILD",["23112b0d-4062-424d-bbb3-4f4aa82f4d80","3e3c5562-b018-4f75-85d9-6e7771489ba0","f08a9210-eedb-4150-99e2-a5d22b26e029","f3087bee-7680-4024-aeff-60ab0bdc1dac"]]] | |
0 | 1 | 23112b0d-4062-424d-bbb3-4f4aa82f4d80 | LINE | 0 | 0 | Don't forget about your past, because it never forgets about you. | 99.88849 | [] | [["CHILD",["102e10d5-fd45-46ee-9890-b70279c6e532","af6bad3a-34fc-462e-9033-c1af2bd5aa1a","aab2849a-4a4b-499c-a16f-43c55fb5dffd","78ef1f76-d8a5-413f-be87-cff88194b7e1","9f41e657-f307-487f-872e-569272305ad4","01ae2b2a-755f-4ef9-9ddc-24eaefbaabd4","7484031d-5259-48ad-a3c7-bbcb862d34f0","783ddab6-47a3-48aa-b56b-adc564daa8cd","d7a69ab3-c601-4d9d-9632-7d4f176b2462","d8f537c7-4c64-4a01-9792-088660b1631d","fb7ad2cb-e72d-4013-8399-fa32d46cb21d"]]] |
0 | 2 | 3e3c5562-b018-4f75-85d9-6e7771489ba0 | LINE | 0 | 0 | JUST DESERTS | 98.635315 | [] | [["CHILD",["5e4fc404-7326-4195-a3d7-343a4dea7a8f","f3efd0b2-0c54-49bc-9867-6830eab05403"]]] |
0 | 3 | f08a9210-eedb-4150-99e2-a5d22b26e029 | LINE | 0 | 0 | ANIELA SPRING | 99.87999 | [] | [["CHILD",["f4e95636-470f-45eb-a599-4d3e00f754d6","5e8676d2-f90c-4455-b179-083da72c647e"]]] |
0 | 4 | 102e10d5-fd45-46ee-9890-b70279c6e532 | WORD | 0 | 0 | Don't | 99.96765 | [] | [] |
0 | 5 | af6bad3a-34fc-462e-9033-c1af2bd5aa1a | WORD | 0 | 0 | forget | 99.908676 | [] | [] |
0 | 6 | aab2849a-4a4b-499c-a16f-43c55fb5dffd | WORD | 0 | 0 | about | 99.9353 | [] | [] |
0 | 7 | 78ef1f76-d8a5-413f-be87-cff88194b7e1 | WORD | 0 | 0 | your | 99.92315 | [] | [] |
0 | 8 | 9f41e657-f307-487f-872e-569272305ad4 | WORD | 0 | 0 | past, | 99.73978 | [] | [] |
0 | 9 | 01ae2b2a-755f-4ef9-9ddc-24eaefbaabd4 | WORD | 0 | 0 | because | 99.9515 | [] | [] |
Extracting Layout information from an Image
Similarly, you can extract layout information from an image through the following code:
SELECT * FROM extract_layout(
path => 'https://langdb-sample-data.s3.ap-southeast-1.amazonaws.com/Screenshot+from+2024-08-09+09-49-18.png',
type => 'image'
);
chunk
The chunk
function breaks down large texts into smaller, manageable pieces. This is particularly useful for processing long documents, especially when working with models that have input size limitations.
Parameters
Parameter | Type | Optional | Description | Possible Values | Sample Value |
---|---|---|---|---|---|
raw_text |
String | No | The raw text which needs to be chuncked | Any String | 'In a quaint village...' |
type |
String | No | Unit of chunking | Char, Word, Sentence, Paragraph | 'Char' |
chunk_size |
Int | Yes | Number of units to be present in a Chunk | Any non-negative integer | 100 |
overlap |
Int | Yes | Number of units to overlap between consecutive chunks | Any non-negative integer | 20 |
trim |
Bool | Yes | Whether to trim whitespace from the start and end of each chunk | true, false | true |
Chunking Raw text into Char with Chunk Size
SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
type => 'Char',
trim => true,
chunk_size => 200);
text | index |
---|---|
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. | 0 |
Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting. | 1 |
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. | 2 |
As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. | 3 |
The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. | 4 |
She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance. | 5 |
Chunking Raw text into Word with Chunk Size and Overlap
SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
type => 'Word',
chunk_size => 30,
overlap => 10);
text | index |
---|---|
In a quaint village nestled in the heart of the countryside there lived a young girl named Lily She was known throughout the village for her vibrant imagination and her | 0 |
known throughout the village for her vibrant imagination and her love for adventure Every day Lily would set out to explore the lush forests and rolling hills that surrounded her | 1 |
explore the lush forests and rolling hills that surrounded her home always eager to discover something new and exciting
One particularly sunny morning Lily decided to venture deeper into the |
2 |
particularly sunny morning Lily decided to venture deeper into the woods than she ever had before As she walked she stumbled upon a hidden grove filled with the most beautiful | 3 |
stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen The colors were so vivid and the petals so delicate that Lily couldnt help but | 4 |
and the petals so delicate that Lily couldnt help but marvel at their beauty She spent hours in the grove carefully examining each flower and breathing in their sweet fragrance | 5 |
Chunking Raw Text into Sentences
SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
type => 'Sentence');
text | index |
---|---|
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily | 0 |
She was known throughout the village for her vibrant imagination and her love for adventure | 1 |
Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting | 2 |
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before | 3 |
As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen | 4 |
The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty | 5 |
She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance | 6 |
Chunking Raw Text into Paragraphs
SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
type => 'Paragraph');
text | index |
---|---|
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting. | 0 |
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance. | 1 |
Combining functions
We have seen how these functions behave individually, but the real power of these functions and LangDB lies within combining. Let's take an example of a job description pdf.
Firstly, we will use load
to convert the file into bytes and then extract_text
to get all the raw text from it.
After that, we will Chunk by Char
with a chunk_size
of 2000.
select * from chunk(
(
select content from extract_text((
select * from load('https://www.stjohneyehospital.org/wp-content/uploads/2024/05/Job-Description-Accountant.pdf',
type=> 'pdf')
))
),
chunk_size => 2000,
type => 'Char',
trim => false
)
text | index |
---|---|
ST. JOHN EYE HOSPITAL – JERUSALEM JOB DESCRIPTION Title Accountant Department Finance Section Reports to Director of Finance Hours 40 hrs per week (inc of lunch breaks) Date February 24 formulated/updated General Statement of Duties: To play a major role in controlling the costing system of purchases and payroll by supporting the existing accountants and providing reports as instructed by the Director of Finance. Main Responsibilities:
|
0 |
1. All staff are expected to report for work on time and fulfil their hours of duty, from time to time some
flexibility may be required in order to meet the needs of the job and this may be outside regular hours of
work.
|
1 |
sexual orientation.
|
2 |
By leveraging functions like load
, extract_text
, extract_layout
and chunk
, LangDB equips developers with a powerful toolkit for overcoming unstructured data challenges. Whether you're dealing with disorganized text, intricate document layouts, or vast amounts of data, these functions provide the versatility and efficiency needed to convert raw information into structured, actionable insights. LangDB not only simplifies the complexity of data extraction and processing but also enhances the overall productivity of your development workflow.
Posted on August 23, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.