From Stream to Kotlin and Finally to SPL

jbx1279

jbx1279

Posted on January 5, 2023

From Stream to Kotlin and Finally to SPL

It is not rare that during Java development we encounter structured data computation scenarios where databases are unavailable or inconvenient to use. Java did not provide special class libraries to compute structured data in its early versions. We had to hardcode even basic computations, such as sorting and grouping. The development efficiency was extremely low. It later released Stream in Java8. The package features Lambda expression, chain coding and set-oriented functions, finally equipping the language with structured data computation class libraries.

Stream simplifies structured data computations

For example, sorting:

Stream<Order> result=Orders
.sorted((sAmount1,sAmount2)->Double.compare(sAmount1.Amount,sAmount2.Amount))
.sorted((sClient1,sClient2)->CharSequence.compare(sClient2.Client,sClient1.Client));
Enter fullscreen mode Exit fullscreen mode

In the above code, sorted is a set-oriented function for sorting data conveniently. The syntax "(parameter)->function body" is a Lambda expression that can simplify the definition of an anonymous function. The continuous use of two sorted functions is the style of chain coding, which makes the multi-step computing process more intuitive.

Stream does not have enough computing ability

Take the previous sorting example, it should have been enough for the sorted function to know the sorting field and direction (ascending or descending), like the SQL sort syntax "…from Orders order by Client desc, Amount", but actually data type of the sorting field must be specified. The sorting direction could have been simply represented by asc/desc (or +/-), but Stream uses a compare function to express it. Moreover, it is somewhat counterintuitive that the desired order of the sorting field is contrary to the order explicitly written.

Let’s take look at the grouping & aggregation operation:

Calendar cal=Calendar.getInstance();
Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(
        r->{
            cal.setTime(r.OrderDate);
            return cal.get(Calendar.YEAR)+"_"+r.SellerId;
            },
            Collectors.summarizingDouble(r->{
                return r.Amount;
            })
        )
);
    for(Object sellerid:c.keySet()){
        DoubleSummaryStatistics r =c.get(sellerid);
        String year_sellerid[]=((String)sellerid).split("_");
        System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum()+"\t count is:"+r.getCount());
    }
Enter fullscreen mode Exit fullscreen mode

In the above code, each field name is preceded by the corresponding table name, which is "table name.field name" . The SQL equivalent, however, just writes the field name. The anonymous function has complex syntax, which quickly becomes more complicated as the code becomes longer. The code is even more difficult to decipher for a nested query formed by two anonymous functions. Achieving a grouping & aggregation computation involves multiple functions and classes, including groupingBy, collect, Collectors, summarizingDouble and DoubleSummaryStatistics, which sets a high cost of learning. The type of grouping & aggregation result is a Map rather than a structured data type. We need to specifically define a new structure and perform type conversion for further computations. The process is not simple. It is common that there are two grouping fields in a structured data computation. But, grouping function only supports one grouping variable. In order to let one variable represents two fields, certain tricks are needed. One trick is to create a bi-field structured data type, and another is to concatenate two fields with an underline character. Each will only complicate the code further.

*The underlying reason for Stream’s inadequate computing ability is that its base language, Java, is a compiled language and does not offer special structured data objects. This makes Stream lack solid low-level support.
*

As a compiled language, Java defines the structure of a result value in advance. A multi-step computing process involves definitions of multiple data structures, which makes the code rather complicated and parameter handling inflexible. A set of complicated rules are thus used to implement the anonymous function syntax. An interpreted language, however, naturally supports dynamic structure and can conveniently specify a parameter expression as a value parameter or a function parameter, generating an anonymous function in a much simpler way.

Kotlin was specifically designed in an effort to improve the Java situation. It is a Java-based advanced programming language. The advancement is mainly manifested in the improvement of Java syntax, particularly for Stream. The result is that Stream gets more concise Lambda expressions and more set functions.

Kotlin has stronger computing ability than Stream

Take sorting as an example:

var resutl=Orders.sortedBy{it.Amount}.sortedByDescending{it.Client}
Enter fullscreen mode Exit fullscreen mode

Kotlin does not need to specify the data type of sorting field, express sorting direction with a function, and specifically define a parameter for an anonymous function – it directly refers “it” as the default parameter instead. The code is much shorter than the corresponding Stream code.

Advancements of Kotlin are not sufficient to meet computing needs

Let’s still look at the sorting operation. Though Kotlin provides “it” as the default parameter, there is no need to write the table name (it) explicitly since theoretically it is enough to know the field name only. A sorting function can only sort one field instead of receiving multiple fields dynamically.

Another instance is the grouping & aggregation:

data class Grp(var OrderYear:Int,var SellerId:Int)
data class Agg(var sumAmount: Double,var rowCount:Int)
var result=Orders.groupingBy{Grp(it.OrderDate.year+1900,it.SellerId)}
    .fold(Agg(0.0,0),{
        acc, elem -> Agg(acc.sumAmount + elem.Amount,acc.rowCount+1)
    })
.toSortedMap(compareBy<Grp> { it. OrderYear}.thenBy { it. SellerId})
result.forEach{println("group fields:${it.key.OrderYear}\t${it.key.SellerId}\t aggregate fields:${it.value.sumAmount}\t${it.value.rowCount}") }
Enter fullscreen mode Exit fullscreen mode

In the above code, a grouping & aggregation action involves multiple functions, including the complicated nested function, and each field name is preceded by the table name. The grouping & aggregation result is not a structured data type, so Kotlin needs to define data structure for each intermediate result.

After looking at more computations, such as set operations and joins, we find that even though Kotlin code is shorter than the Stream equivalent, all the Stream steps appear in it. The changes are mostly insignificant and trivial rather than radical.

Kotlin does not support dynamic data structure and offer special structured data. This means that it cannot truly simplify Lambda syntax, reference a field without a prefixed table name, and perform dynamic multi-field computations (like multi-field-based sorting) intuitively.

Yet, esProc SPL will completely get the structured data processing out of the insoluble dilemma of the Java ecosystem.

esProc SPL is a JVM, open-source structured query language. It provides specific structured data objects, a rich set of built-in functions, agile and concise syntax and integration-friendly JDBC driver, making it really good at simplifying complex computations.

SPL has rich built-in functions to implement basic calculations

Sorting: =Orders.sort(-Client, Amount)
Enter fullscreen mode Exit fullscreen mode

SPL does not need to specify data type for the sorting field, use a function to specify the sorting direction, and precede a field with corresponding table name. Also, it uses one function to sort multiple fields dynamically.

Grouping & aggregation: =Orders.groups(year(OrderDate),Client; sum(Amount),count(1))
Enter fullscreen mode Exit fullscreen mode

The result sets of both calculations are still structured data objects, which can be directly computed in the next step. For a double-field grouping or summarization, there is no need to define the data structure beforehand. There are no extra functions in each piece of SPL code. And uses of sum and count functions are concise and easy to understand, with little traces of nested anonymous functions.

Same simple coding for other calculations:

Distinct:

=Orders.id(Client)
Enter fullscreen mode Exit fullscreen mode

Fuzzy query:

=Orders.select(Amount*Quantity>3000 && like(Client,"*S*"))
Enter fullscreen mode Exit fullscreen mode

Join:

=join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))
Enter fullscreen mode Exit fullscreen mode

SPL offers JDBC driver to be seamlessly invoked by a Java program.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\"). Orders.groups(year(OrderDate),Client; sum(Amount))";
ResultSet result = statement.executeQuery(str);
Enter fullscreen mode Exit fullscreen mode

SPL syntax is agile and concise, and has powerful computational capability

SPL streamlines computations with complex logics, such as stepwise computations, order-based computations and post-grouping computations. It is easy to handle many computations that SQL/stored procedures find it hard to deal with. For instance, we are trying to find the first n big customers whose orders amount takes up at least half of the total amount, and sort records by amount in descending order:

    A   B
1   … / Retrieve data
2   =A1.sort(amount:-1) / Sort records by amount in descending order
3   =A2.cumulate(amount)    / Generate a sequence of cumulative amounts
4   =A3.m(-1)/2 / Get the total amount, which is the last cumulative value
5   =A3.pselect(~>=A4)  / Get the position of record whose cumulative amount is over half of the total
6   =A2(to(A5)) / Get eligible records
Enter fullscreen mode Exit fullscreen mode

Besides remarkable computational capability, SPL has unique advantages in system framework design, data source support, intermediate data storage and performance enhancement, enabling it to compute structured data outside of the database conveniently and efficiently.

SPL supports hot swap and stores code separately to reduce coupling

Let’s save the above SPL code as a script file and invoke the file name in Java in the way of invoking a stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call getClient()");
Enter fullscreen mode Exit fullscreen mode

SPL is interpreted execution, which supports real-time execution after modification without recompilation and restarting the Java service. The SPL code is stored outside the Java application and can be invoked through the name. Being independent of Java code helps reduce coupling.

SPL supports diverse data sources and cross-data-source/cross-database mixed computations

SPL supports various types of databases and files (like txt\cs\xls), NoSQL databases including MongoDB, Hadoop, Redis, ElasticSearch, Kafka and Cassandra, as well as multilevel data, such as WebService XML and Restful Json:

    A
1   =json(file("d:/Orders.json").read())
2   =json(A1).conj()
3   =A2.select(Amount>p_start && Amount<=p_end)
Enter fullscreen mode Exit fullscreen mode

To perform a cross-data-source join between the text file and the database in SPL:

    A
1   =T("Employees.csv")
2   =mysql1.cursor("select SellerId, Amount from Orders order by SellerId")
3   =joinx(A2:O,SellerId; A1:E,EId)
4   =A3.groups(E.Dept;sum(O.Amount))
Enter fullscreen mode Exit fullscreen mode

SPL offers proprietary storage format to store data temporarily or permanently and to enable high-performance computations

The language support btx storage format for temporarily storing data coming from slow data sources, like CSV:

    A   B
1   =[T("d:/orders1.csv"), T("d:/orders2.csv")].merge@u()   / Union records
2   file("d:/fast.btx").export@b(A1)    / Write to bin file
Enter fullscreen mode Exit fullscreen mode

A btx file is small and fast to read and write. It can be computed as an ordinary text file:

=T("D:/fast.btx").sort(Client,- Amount)
Enter fullscreen mode Exit fullscreen mode

Storing data in a btx file in a certain order can obtain high performance for computations like parallel processing and binary search. SPL also supplies ctx storage format that brings extremely high performance. The ctx format supports data compression, column-wise/row-wise storage, distributed computing and large concurrency, is fit for storing a massive amount of data permanently, and achieving high-performance computations.

In short, Stream made breakthroughs in implementing structured data computations outside the database. Kotlin has gone further on it, but features of compiled language limit its progress. Only SPL, the special structured data processing language, effects a thorough solution to data handling outside the database.

Origin: https://blog.scudata.com/from-stream-to-kotlin-and-finally-to-spl/

SPL source code: https://github.com/SPLWare/esProc

💖 💪 🙅 🚩
jbx1279
jbx1279

Posted on January 5, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related