Open-source SPL: The Breaker of Closed Database Computing System
jbx1279
Posted on June 11, 2023
Databases have closed data processing capacity. Being closed means that data to be computed or processed must be loaded into databases in advance. There is a clear line between in-database data and outside database data.
Databases are generally used for OLTP and OLAP. OLTP requires data consistency that makes sense only when discussed within a defined scope. This automatically results in a closed database system that will and can only ensure consistency of data within it.
OLAP, however, does not demand data consistency. But customarily, data warehouses for performing them use databases. As a result, these data warehouses also become closed.
Sometimes the “base” part in the name makes it easy to wrongly believe that databases are intended for data storage – though it is not the case. They are mostly used for computing data in real-world situations, particularly in big data scenarios, and almost completely for OLAP scenarios. Database storage aims primarily to serve computations, too. Databases have rather good computing abilities, far better than most other software products, so they are often deployed to meet computing needs.
Yet a closed system is not only unnecessary but harmful to a product intended for computations. That’s why today databases become more and more awkward in handling data processing tasks.
Problems of the closed database system
ETL becomes ELT, or even LET
A typical phenomenon is that ETL has become ELT, or even LET. An ETL process covers three steps, E, T and L, in sequence. It would be the best if the reasonably designed order could be maintained. As databases lock computing abilities inside them and cannot process outside data, we need to first load data into them in order to perform both E and T, which are actually some kinds of computations. To make matters worse, often the source data in ETL processes isn’t database data, or at least isn’t in the database used for handling the current computation but originates from multiple databases. In any case, data should be gathered together into a single database before being able to be computed. This means the L step comes before E and T, disrupting the planned order of E-T-L.
The closed database system also sets up a checkpoint where both imported and exported data must pass through and undergo certain checks. This is necessary – though not IO efficient – for OLTP that requires data consistency, but it is useless and a waste of time for ETL.
Intermediate tables that are resource-consuming and cause tight coupling
Intermediate tables are another problem. When there is a large volume of the original data, summarizing it directly (such as report queries) will bring poor performance and bad user experience. The solution is to perform certain aggregate operations to get their results in advance and achieve further computations or data presentation/visualization based on these intermediate results. Other times the computing logics are complex and report development processes will become too complicated with ad hoc computations at report generation, and to obtain clear processes, the original data will be pre-computed and results are stored as intermediate data. The intermediate results or data is generally stored in databases in the form of tables, known as intermediate tables. The aim of storing intermediate results or data in databases is to make use of the database’s computing abilities. Intermediate tables, rather than being used directly, need a bit more processing for report queries. Data tables, compared with other forms of storage, are convenient to compute in SQL.
Aggregate operations at the frontend are not a type of static ones but will experience frequent changes and additions, accumulating more and more intermediate tables.
The ETL variations – ELT and LET – also result in many intermediate tables due to the compulsory prior data loading.
Intermediate tables, on the one hand, occupy large precious database storage space, which leads to expensive scaling; and on the other hand, consume database computing resources as they are periodically updated according to changes in stored procedures, reducing database performance.
Too many intermediate tables also bring about management problem. Databases organize and manage data tables in the linear way, which works well when there are only a small number of them but makes a mess when there are too many of them (like tens of thousands of). The more popular way to administer heaps of items is the multilevel tree structure. But it is not supported by relational databases (though they have the concept of schema that can be understood as a two-level structure). RDBs cope with this by giving tables long and awkward names to classify and distinguish them, which is inconvenient and requires high development and management skills. But failures to observe naming rules by giving improper names to tables at certain times, especially when there are urgent deadlines to meet, will generate a huge number of messy intermediate tables over time.
Intermediate tables cause tight coupling between applications. The database is a separate process providing independent computing abilities that do not belong to any applications. Multiple applications share one database and access its resources. It is probably that intermediate tables generated by an application (or module) are used by another or more applications (or modules), causing high coupling between applications (or modules). An intermediate table cannot be deleted, even if its originator is already offline, because it could be used by another or more applications. The endless accumulation of intermediate tables exacerbates the shortage of database storage space and computing resources.
Inability to handle diverse data source scenarios
Diverse data sources have become common for today’s applications. There is no shortage of data originating from external services. Loading external data into a database for computations is very inconvenient. Ad hoc data loading is too inefficient (because database IO is time-consuming) to meet intensive database access requests. Batch loading of data at regular times is prone to miss the latest data, getting a less real-time result.
Loading and storing external data in a database produces a lot of intermediate tables that will cause relative problems. Usually, data scrapped directly from websites is of multilevel JSON or XML format, and multiple associative tables need to be created in a relational database to store it, making the intermediate table problem worse.
Security and coupling problems brought by stored procedures
The frequently used stored procedures have numerous disadvantages (such as un-migratable and hard to code and debug), but they support stepwise coding and enable users to make use of databases’ computing abilities. The closed databases can only perform data computations inside them, but it is hard to express complex computing logic in a single SQL statement.
There are frequent changes to stored procedures for queries and analyses. As stored procedures are closely knitted with the database, submitting each modification of a stored procedure to database administrator for examination and compilation will definitely reduce work efficiency. Giving programmers the privilege of compiling stored procedures maintains efficiency but poses serious security threats. The privilege is much too high and there is the possibility of programmers’ mistakenly deleting stored procedure code, even data of other applications. Actually, read privilege is enough for report queries and won’t cause database writes errors if stored procedures are not used.
Stored procedures can also result in tight coupling between applications. The underlying reason is the same as that behind intermediate table problems – one stored procedure is used by multiple applications (modules).
Low big data processing performance
Data accumulates up to an immense size as business expands and query performance decreases, sometimes to a level that affects the production system. To maintain a stable system, people separate the historical cold data from the current hot data by storing them in different databases. Querying the huge amount of cold data will not affect system stability and querying the small volume of hot data will not put a lot of pressure on the production system.
Yet, the separation of cold and hot data makes it hard to implement real-time queries (T+0 queries). The closed database system does not allow or is hard to compute data outside the working database, such as cross-database queries, which are necessary for querying data in real-time after cold data and hot data are stored separately. There are techniques for achieving queries across database of same type though their performance is bad. In real-world situations, cold data and hot data are often stored in different types of databases. Usually in a production system, the hot data is stored in the RDB good at handling OLTP and the cold data is stored in a data warehouse intended for OLAP. It is particularly difficult for the database to perform queries between databases of different types.
Problems triggered by the closed database system has become increasingly noticeable while technologies are advancing. It seems that the conventional “base”-style computing becomes awkward in meeting requirements of today’s application frameworks.
The open-source esProc SPL gives effective responses to those database problems.
SPL (Structured Process Language) is a computing engine specializing in structured data computations. It provides a wealth of class libraries, supports procedural computation, and excels at accomplishing various complex computations. It has the open computing system that supports diverse data source mixed computations without the need of loading data into the “base”. The language supports hot swap and offers the standard JDBC driver.
SPL’s open computing solution
Compute diverse/multiple sources directly
SPL can compute data from different sources directly, avoiding inefficient and non-real-time data loading databases require. As each data source has their own strength – files have efficient IO, NoSQL can store document data, and RDBs has excellent computing abilities, loading data into databases keeps these benefits away.
The highly open data source support includes almost all existing data sources, enabling both data retrieval and efficient cross-data-source mixed computations by making effective use of the data sources’ advantages.
Let ETL goes as E-T-L
SPL has independent computational capabilities for handling various data processing scenarios with diverse or multiple data sources. It can accomplish E and T steps of ETL outside databases and then load the prepared data into the database, achieving ETL in its original sequence.
At times the result of ETL, rather than being loaded into the database, can be stored as text files or in SPL’s proprietary data format that helps obtain high query performance and reduce more database load.
Outside-database “stored procedures” avoid security problems
SPL’s independent computational capacity enables it to compute data without databases. Like stored procedures, SPL also supports procedural coding that implements any complex computation step by step. So stored procedures’ two strengths – computing ability and stepwise coding – find a better vehicle, which can be called “outside database stored procedures”.
The creation and use of “outside database stored procedures” do not require database write privilege, getting rid of security risks brought by in-database stored procedures.
Replace intermediate tables with files to reduce coupling and database load
Since SPL can compute files directly thanks to its diverse data source support, we can move data in the previously indispensable intermediate tables out of the database to store in a file system and further compute the intermediate data using SPL’s independent computational capabilities.
The above SPL actions can be regarded as the implementation of “outside database intermediate tables”, if viewed from the side of in-database intermediate tables. “Outside database intermediate tables” disengage intermediate data from databases, lessening database load and making it possible to manage intermediate files with the file system’s tree structure. The structure stores intermediate data together with its corresponding application or module (as stored procedures are stored), preventing its shared usage by multiple applications or modules and thus eliminating coupling.
Separation of cold data and hot data to achieve T+0 queries
SPL offers data routing service that locates the right source(s) according to computing requirements and performs cross-data-source mixed computations. The separation of cold data and hot data makes it convenient to achieve full data queries in SQL and achieves transparency to the upper-level application as if queries are based on one database.
We can store the historical cold data as files (which have fast IO, flexible compression ratio and convenient parallel processing) and perform mixed computations between the database (where hot data is stored) and the files to achieve high-performance T+0 queries using SPL’s cross-data-source support ability. Besides, SPL’s high-performance proprietary data format can help boost computing performance further.
The open SPL system is integration-friendly with today’s application frameworks. SPL can replace Java to accomplish data processing jobs by being integrated into microservices and it can act as the computing engine in a data center to process data, making it highly adaptable to nowadays application frameworks.
Origin: https://blog.scudata.com/open-source-spl%ef%bc%9a-the-breaker-of-closed-database-computing-system/
SPL Source Code: https://github.com/SPLWare/esProc
Posted on June 11, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.