SPL: The Open-source Java Library to Process Structured Data
jbx1279
Posted on March 6, 2023
The general trend of contemporary Java application frameworks is towards the separation of data storage from data processing, aiming to be more maintainable, more scalable and more migratable. One typical example is the currently hot microservices. The new framework requires that business logic, rather than performed in the database as the conventional framework designs, should be implemented within the Java program.
On most occasions, the business logic in an application involves structured data processing. Databases (based on SQL) give a lot of supports for the processing, enabling to implement the business logic in a relatively simple way. Such supports, however, has always been absent in Java, making it complicated and inefficient to implement the business logic with the language. As a result, development efficiency becomes sharply lower while the advantages of framework are clear.
The Java difficulty would be solved if we could offer the high-level language a complete set of structured data processing and computing class libraries, letting it boasting the merits of framework and maintaining a high development efficiency.
What abilities the library should have?
What kind of features an ideal structured data processing class library under Java needs to have? Let’s try to sum them up according to SQL:
1 Ability to achieve set-oriented operations
Structured data often exists in batch (in the form of a set). The ability of performing set-oriented operations is thus important for computing structured data.
With only the basic array data type (which is the counterpart of set) and without a special class library of handling set-oriented operations, even a simple sum of members in a set needs a four or five lines of loop statement, and a filtering/grouping & aggregation operation involves hundreds of lines of code.
SQL gives direct support for many set-oriented operations, including SUM/COUNT for aggregation operations, WHERE for filtering, GROUP for grouping operation, and basic operations such as intersection, union and difference to produce quite short code.
2 Lambda syntax
Are the set-oriented abilities represented by a set of class libraries enough to make Java as capable as SQL?
The fact is that it takes more than those libraries to obtain SQL’s capabilities.
Take the filtering operation as an example. The operation selects certain members of a set according to a specified condition. SQL expresses the condition with an expression, such as WHERE x>0 for retaining members that make result of computing x>0 true. The expression x>0 is computed on each member of the set during traversal rather than in advance before execution of the SQL statement. The expression, in essence, is a function that takes members of the current set as parameters. And the WHERE operation, on the other hand, uses a function defined in the form of an expression as its parameter.
There is a special term for this kind of syntax, known as Lambda syntax, or functional language.
Without the Lambda syntax, we need to define functions on an ad hoc basis, which generates extremely complicated code and where a name collision is likely to occur.
The Lambda syntax is extensively used in SQL to significantly simplify code, not only for filtering and grouping operations that require it but also in scenarios, such as calculating computed columns, which do not have a necessary demand for it.
3 Direct references of fields in Lambda syntax
Structured data does not simply consist of single values, but records made up of fields.
In most cases, a SQL expression parameter references a field of record directly by the name without specifying the corresponding record. Only when fields to be referenced have same names that they need to be preceded by table names (or their aliases) to be distinguished from each other.
Java began to support the Lambda syntax in its newest versions. It takes the current record as a parameter to pass it to a function defined with the Lambda syntax, and then has to always write the record before each field. To calculate amount by multiplying the unit price by the quantity, for instance, the formula will be the complicated “x.UnitPrice*x.Quantity”. SQL, however, is able to write it in the intuitive form of “UnitPrice*Quantity”.
4 Support of dynamic data structure
SQL supports dynamic data structure excellently.
Often the return value of a structured data computation is structured data, too. The data structure of a result set is determined by the current operation, and cannot be created in advance before coding. It is thus necessary for a language to have good support for dynamic data structure.
Any SQL SELECT statement will generate a new data structure. SQL allows adding or deleting a field as needed during coding, without defining the structure (class) beforehand. Java is different. The high-level language requires defining the future structure (class) during code compilation phase, and does not permit, in principle, generating a new structure dynamically during execution.
5 The interpreted language
According to the above analysis, we conclude that Java is not suitable for processing structured data. Its Lambda syntax mechanism does no support feature 3, and as a compiled language, Java is unable to achieve feature 4.
Actually, a compiled language is not the best vehicle for the Lambda syntax. The compiler alone cannot decide whether it should first calculate the expression written in the place of a parameter immediately to pass the value, or it needs to compile the whole expression as a function to pass. Additional syntactic symbols are needed to help it determine what to do. An interpreted language does not have the problem. Whether to calculate the expression acting as a parameter instantly or during traversal of members of the set is determined by the function itself.
SQL is such an interpreted language.
Introduction of SPL
Stream is the structured data processing class library released officially with Java8. Yet it does not meet the five features’ requirements. It does not have special structured data types and lacks functions for many important structured data computations. It isn’t the interpreted language, does not support dynamic data types, and provides too complicated Lambda interface.
Kotlin is a part of the Java ecosystem. It is given small improvements based on Stream and is equipped with structured data types, but it does not have enough functions for structured data computations, isn’t the interpreted language, does not support dynamic data types and offers complicated Lambda interface, too. It is not a good choice for computing structured data.
Scala offers a larger number of structured data computing functions, but a compiled language cannot make a suitable structured data computation class library.
Is there a really suitable library that can run in a Java ecosystem to handle structured data?
We have esProc SPL.
The SPL programming language is interpreted and executed based on Java, and offers rich structured data computation class libraries, simple Lambda syntax, and convenient, easy-to-use dynamic data structures. These make it the best structured data computation class library run under Java.
Rich functions for set-oriented operations
SPL supplies table sequence, the special structured data type. Same as a SQL data table, a table sequence is a set made up of a batch of records and offers functionalities a structured data type generally has. For example:
To parse the source data and generate a table sequence:
Orders=T("d:/Orders.csv")
To generate a new table sequence with same column names from an existing table sequence:
Orders.new(OrderID, Amount, OrderDate)
To generate a computed column:
Orders.new(OrderID, Amount, year(OrderDate))
To rename fields:
Orders.new(OrderID:ID, SellerId, year(OrderDate):y)
To reference fields by their ordinal numbers:
Orders.groups(year(_5),_2; sum(_4))
To rename a table sequence (a left join):
join@1(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))
A table sequence allows all structured data computation functions to compute it, and the result is also a table sequence instead of any other data type, such as Map. To further process the structured result set returned by a grouping & aggregation operation, for instance, SPL has the following code:
Orders.groups(year(OrderDate):y; sum(Amount):m).new(y:OrderYear, m*0.2:discount)
Based on table sequence, SPL provides a great number of structured data processing functions, including filtering, sorting, grouping, distinct, rename, getting computed column, joins, subquery, set-oriented computations and order-based computations. These functions possess powerful computational abilities, enabling them to achieve computations independently without the help of hardcoding.
A composite query:
Orders.select(Amount>1000 && Amount<=3000 && like(Client,"*bro*"))
Sorting:
Orders.sort(-Client,Amount)
Grouping & aggregation:
Orders.groups(year(OrderDate),Client; sum(Amount))
Inner join:
join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))
Concise Lambda syntax
SPL supports simple Lambda syntax, without the need of defining function name and body. It can directly use an expression as a function’s parameter. To perform a filtering, for instance:
Orders.select(Amount<1000)
There is no need to reconstruct the function when business logic is modified. Users just need to change the expression:
Orders.select(Amount>1000 && Amount<2000)
As an interpreted language, SPL does not require defining the parameter type for a parameter expression, creating simpler Lambda. To try to calculate squares during sum for calculation of the sum of squares, for instance, the SPL code is intuitive:
Orders.sum(Amount*Amount)
Same as SQL, SPL syntax also supports using the field names directly when performing computations on a single table:
Orders.sort(-Client, Amount)
Dynamic data structure
SPL, as an interpreted language, intrinsically supports dynamic data structure. It can generate a new table sequence dynamically according to the structure of the result set, which makes it particularly suited to handling a series of scenarios, including getting computed column, grouping & aggregation and joins. To further compute the result set of a grouping & aggregation operation, for instance:
Orders.groups(Client;sum(Amount):amt).select(amt>1000 && like(Client,"*S*"))
And to compute the result set of a join directly:
join(Orders:o,SellerId ; Employees:e,Eid).groups(e.Dept; sum(o.Amount))
A complicated computation is often implemented step by step. The intermediate result of almost each step has its own data structure. With the support for dynamic data structure, SPL does not require defining structures of the intermediate results. For instance, we are trying to find customers whose payments rank in top 10 in each month based on the customer payment table in a certain year:
Sales2021.group(month(sellDate)).(~.groups(Client;sum(Amount):sumValue))
.(~.sort(-sumValue)) .(~.select(#<=10)).(~.(Client)).isect()
Executing SQL directly
SPL also implements SQL interpreter to execute SQL directly, from the basic WHERE and GROUP to JOIN and even WITH:
$select * from d:/Orders.csv
where (OrderDate<date('2020-01-01') and Amount<=100)
or (OrderDate>=date('2020-12-31') and Amount>100)
$select year(OrderDate),Client ,sum(Amount),count(1) from d:/Orders.csv
group by year(OrderDate),Client<br>having sum(Amount)<=100
$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o
join d:/Employees.csv e on o.SellerId=e.Eid
$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t
left join ClientTable ct on t.Client=ct.Client
More syntactic advantages
As a special structured data processing language, SPL offers all computational capabilities SQL has, and moreover, boasts more merits in terms of syntax:
Thorough set-orientation based on discreteness
Set-orientation is one of SQL’s basic characteristics, which allows data to be computed in the form of a set. But SQL has weak discreteness, requiring that members of a set be computed as a whole instead of separately. High-level languages, such as Java, has good discreteness and supports computing members of an array individually.
To achieve deeper set-orientation needs high discreteness that allows members of a set to exist independently and to be able to form a new set together with other data for computations.
SPL incorporates SQL’s set-orientation and Java’s discreteness to achieve the thorough set-orientation.
For example, it is easy to express “a set of sets” in SPL, which makes it convenient to implement a post-grouping computation. To find students whose scores of all subjects rank in top 10:
A
1 =T("score.csv").group(subject)
2 =A2.(~.rank(score).pselect@a(~<=10))
3 =A1.(~(A3(#)).(name)).isect()
A SPL table sequence’s field can store records or sets of records, making it possible to express any number of join relationships intuitively using object references. To find male employees whose managers are female, for instance:
Employees.select(gender:"male",department.manager.gender:"female")
Order-based computations are the typical product of the cooperation of discreteness and set-orientation. A member gets orderliness only when it belongs to a set, which is the effect of set-orientation. An order-based computation, on the other hand, requires independent members that can be separated from their neighbors, which is the discreteness feature stresses. SPL synthesizes and reinforces set-orientation and discreteness to be able to support order-based computations intrinsically.
Specifically speaking, SPL can reference a member according to its absolute position. SPL gets the 3rd orders record, for instance, using Orders(3), and retrieves the 1st, the 3rd and the 5th records with Orders([1,3,5]).
SPL can also reference a member by its relative position. To calculate the growth rate of amount in each record compared with the directly previous amount, the SPL code is:
Orders.derive(amount/amount[-1]-1)
Employees.group(#%2==1) The language uses the pound sign # to represent the sequence number of the current record. For instance, to group employees by different types of sequence numbers (those with odd sequence numbers are put into one group and those with even ones are placed to the other group), SPL generates the code :
Employees.group(#%2==1)
Convenient function syntax
The numerous powerful structured data computation functions should have been a great advantage, but those with similar functionalities are inconvenient to distinguish from each other, making a language harder to learn.
SPL solves the problem by offering the unique function options syntax. Functions having the same kind of functionality share one name but use options to serve different purposes. The basic functionality of select function, for instance, is to perform filtering. To select only the first-found record that meets the specified condition, SPL provides @1 option:
Orders.select@1(Amount>1000)
When the volume of data to be handled is huge, SPL uses @m option to implement parallel processing in order to increase performance:
Orders.select@m(Amount>1000)
With the ordered data, SPL has @b option to perform a fast filtering using the binary search:
Orders.select@b(Amount>1000)
SPL allows function options to work together, for instance:
Orders.select@1b(Amount>1000)
Usually, parameters in a structured data computation function are complicated. SQL uses a lot of different keywords to divide parameters in one statement into groups, resulting in inconsistent statement structures.
SPL writes parameters on three different levels with semicolon, comma and colon in descending order. The general way greatly simplifies the expression of complicated parameters:
join(Orders:o,SellerId ; Employees:e,EId)
Extended Lambda syntax
The general Lambda syntax needs to both specify the expression (the functional parameter) and define a complete set of parameters for the expression, which makes it strictly logical in mathematical form but complicates the syntax. To filter set A using the loop function select and only retain members whose values are even numbers, the general Lambda formula is:
A.select(f(x):{x%2==0})
x%2==0 is the expression, whose parameter is the x in f(x). x is the loop variable representing a member in set A.
SPL uses the tilde ~ to represent a loop variable. Programmers do not need to specially define a parameter that is a loop variable. So, the above Lambda formula can be simply written as :
A.select(~%2==0)
Generally, Lambda syntax will define each parameter in an expression. Besides the loop variable, another commonly-used parameter is the loop count, whose definition makes a Lambda formula more complicated.
SPL offers character # to represent a loop count variable. To filter set A using the select function and only retain members whose sequence numbers are even, the SPL code can be
A.select(#%2==0)
Complex computations often involve the hard to express relative positions. Their expression is extremely complicated.
SPL represents a relative position using the fixed form of [sequence number]:
A B
1 =T("Orders.txt") /The orderstable sequence
2 =A1.groups(year(Date):y,month(Date):m; sum(Amount):amt) /Group and sum by the year and month
3 =A2.derive(amt/amt[-1]:lrr, amt[-1:1].avg():ma) /Calculate the link relative ration and moving average
//SPL code from http://c.raqsoft.com/article/1642764366712
Seamless integration, loose coupling & hot swap
As a scripting language interpreted by Java, SPL offers JDBC driver to be able to be integrated into a Java program seamlessly.
Similar to a SQL statement, A simple SPL statement can be executed directly:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement("=T(\"D:/Orders.txt\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))");
ResultSet result=st.execute();
A more complicated piece of SPL code can be stored as a script file and invoked in the way of calling a stored procedure:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript1(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
Placing script files outside the Java program reduces code coupling, as well as achieving hot swap based on interpreted execution. Any change of the business logic will take into effect immediately by just modifying the script, rather than restarting the application as Java does. The mechanism is particularly suitable for writing business logic within the microservice architecture.
Origin: https://blog.scudata.com/spl-the-open-source-java-library-to-process-structured-data/
SPL Source code: https://github.com/SPLWare/esProc
Posted on March 6, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.