Memory views: Handling strings
Lorenzo (Mec-iS)
Posted on August 26, 2019
Brace yourself as this post is going to be a little less straightforward than the last ones as we are going to talk about how to read/write UTF-8 strings from/to WebAssembly (Wasm) memory views!
Unicode (a resume)
feel free to skip to the next paragraph if you think you are already familiar with
Even if Rust-compiled Wasm and Python are very different languages in many points, they are probably at the opposite of the spectrum for the most popular programming languages, they both take advantage of Unicode Format for Network Interchange; fortunately nowadays an accepted and well-supported standard.
Some basic generics about Unicode:
... it encodes a character as a sequence of one to four bytes [1]
... only the shortest encoding for any given code point is considered well-formed; you can’t spend four bytes encoding a code point that would fit in three [1]
Every group of code points (for UTF-8, a code point can be as small as 1 byte, up to 4) in the string encodes a character (in other words UTF-8 is an 8-bit variable width encoding for Unicode); usually there is a base code point plus code points afterwards that add details to the base. Unicode version 12.1 allows the recording of 137,994 code points that covers all characters for a large amount of languages.
UTF-8 supports any Unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL) and emoticons [source]
More information about this wonderful Internet-omnipresent tool at Unicode Consortium.
Strings
What we need to know for this experiment is that we can represent strings of characters as lists of 8-bit integers that stand for Unicode (or group of) code points. To pass strings from/to Python and Wasm using python-ext-wasmer we need to know:
- In Python strings are immutable sequences of characters; open the Python IDLE and try:
# a Unicode string containing non-ASCII chars
>>> st = 'ABW@☀CC☯'
# same string as bytestring encoded in UTF-8 (hex)
>>> bytes(st, 'UTF-8')
b'ABS@\xe2\x98\x80CC\xe2\x98\xaf'
# UTF-8 (decimal) represention of the same bytestring
>>> [b for b in bytes(st, 'UTF-8')]
# elements for Unicode code points as Python integers
[65, 66, 83, 64, 226, 152, 128, 67, 67, 226, 152, 175]
The black sun with rays (U+2600) code point is represented in UTF-8 encoding with: 3 hexadecimal values or equivalently 3 positive decimal values:
#
# same code point, same encoding (different formats)
#
\xe2\x98\x80 ----> ☀ as UTF-8 (hex)
226 152 128 ----> ☀ as UTF-8 (decimal)
- Any sequence of Unicode points can be represented as
List[int]
in Python; in the particular case, if we want to use UTF-8 to inter-operate with code in Rust we need to use on its side aVec<u8>
(more details here); finally at Wasm-level we have a linear memory array of cells that index its elements:
# pseudo-code
array_memory[0] = 226
array_memory[1] = 152
array_memory[2] = 128
It is now evident how we can store the decimal UTF-8 points in Wasm linear memory.
Wasmer memory views
We can think at every Wasm Instance as a mini-sandbox with its own abstractions over memory. Memory is represented as "views", basically tables that manage references between values and their position in linear memory.
Details: Wasmer runtime core provides these very powerful abstractions that allow to leverage Wasm memory; in this case we want to take a look to python-ext-wasm Memory Views; the interface methods are generated from a macro for every primitive type.
Passing a string
Let's enter the main part of the experiment: how can we use all these powerful tools to make Python talk fluently to Rust/Wasm and receive back?
Briefly:
- we are going to tell from Python what to write to Instance's memory using Wasmer views
- we are going to call a Wasm function (compiled from Rust) reading from Instance's exported functions
- we are going to pass the result back to Python
All this process is demonstrated by this function:
def test_reverse(instance, func, bytestr):
# return a Wasmer Memory View for 8-bit integers
mem_view = allocate_bytes(instance)
# write the UTF-8 (decimal) code points to the view
mem_view = write_to_memory(bytestr, mem_view, offset=0)
# call the exported function by reading the offset of the
# view for the length of the bytes string (b'...')
result = func(0, len(bytestr))
# read the result value at a given position in linear memory
return address_to_utf8(mem_view, result, len(bytestr))
The methods used are loaded from this mini Python-to-Wasmer API.
The result is easily showed by this chunk of code:
Reversing b'Test sTRing' >>>
b'gniRTs tseT'
Thanks for your attention.
Sources:
[1] O'Relly "Programming Rust" ISBN-10: 9781491927281, pg. 392-393
[2] kunststube.net/encoding
Posted on August 26, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.