statically-allocated lists
- list1.nelua - stack-allocated lists
- list2.nelua - iota() with preallocated cons cells
- list3.nelua - 100 million
- list4.nelua - now with GC
- list5.nelua - slower than Lua?
- list6.nelua - arena allocation
You probably shouldn't use linked lists, but they're simple while needing a lot of allocation, so they're illustrative.
list1.nelua - stack-allocated lists
Considering the following program:
require 'stringbuilder'
local M = @record{}
-- singly linked list of integers. generics come later.
global M.Cons = @record{
car: integer,
cdr: *M.Cons,
}
-- for ZII: an empty List is a valid zero-length list
global M.List = @record{
head: *M.Cons
}
function M.List:print()
local sb: stringbuilder <close>
local list = self.head
sb:writebyte('{'_u8)
while list ~= nilptr do
sb:write(list.car)
if list.cdr ~= nilptr then sb:write(', ') end
list = list.cdr
end
sb:writebyte('}'_u8)
local s <close> = sb:promote()
print(s)
end
## if not pragmas.testlist1 then
return M
## end
local function main()
local list: M.List
list:print() -- output: {}
local c3: M.Cons = {3, nilptr}
local c2: M.Cons = {2, &c3}
local c1: M.Cons = {1, &c2}
list.head = &c1
list:print() -- output: {1, 2, 3}
end
main()This defines, using Lisp terminology, a cons cell that carries an integer value in its 'car' field, and either nilptr or a pointer to another cons cell in its 'cdr' field. Because the zero value of this type isn't terribly useful (it would be a one-element list containing zero), an additional List is created to hold the head cons cell of the singly-linked list. There's also a function to print lists as a Lua table, and some demonstration code.
The demonstration code hand-assembles a three-element list and prints it. Usage:
$ nelua -Ptestlist1 list1.nelua
{}
{1, 2, 3}
The module returns and the demo code are slightly odd, but are purposeful. Returning a module lets the following examples reuse this code easily. Conditionally doing that lets the demo be run more as a script, rather than requiring something like nelua -Ptestlist1 -i 'local m = require "list1.nelua"'
If this is supposed to be more like a script, why not put the code at the toplevel instead of in main()?
That actually changes codegen here: instead of stack-allocated cons cells, these would become statically allocated cons cells.
Why not instead define M.List:__tostring()?
That would permit a more natural print(list), but that more natural usage would also leak under pragmas.nogc because creating the string requires allocation and print() doesn't free it (although io.print does!). So it'd have to be like this anyway, if a GC weren't used:
require 'string' local s <close> = tostring(list) print(s)
The further explanation for this is that print() is supposed to be for debugging output.
Does the GC matter though?
Well... no. Not at all, because only a tiny amount of I/O would allocate with the GC at the end of the program, which is how the GC gets to do anything. I'm just keeping concerns like this in mind.
list2.nelua - iota() with preallocated cons cells
This example adds a function iota() that generates a list of numbers. Because this function certainly can't return cons cells that it allocated on the stack, the cons cells are provided to it.
local lists = require 'list1'
require 'span'
function lists.iota(upto: integer, cells: span(lists.Cons))
local list: lists.List
while upto > 0 do
local cdr = list.head
list.head = &cells[upto-1]
list.head.car = upto
list.head.cdr = cdr
upto = upto - 1
end
return list
end
## if not pragmas.testlist2 then
return lists
## end
local function main()
local cells: [3]lists.Cons
local list = lists.iota(3, cells)
list:print() -- output: {1, 2, 3}
end
main()This works, and it could be adapted in some obvious directions:
- pass a huge source of cells instead of exactly the amount needed, so that more complex list functions can share that with their own called functions.
- pass a function instead that can return cells on demand, by some means
- instead of only producing cons cells, blocks of memory could be produced, which could be used for cons cells or whatever other resources were required
i.e., it can be developed in the direction of being an allocator.
list3.nelua - 100 million
Let's end with some cons-wasteful code, a program that builds a 100-million-long list with iota, only to fetch the contents of the last cell.
local lists = require 'list2'
function lists.List:last()
assert(self.head)
local list = self.head
while list.cdr ~= nilptr do
list = list.cdr
end
return list.car
end
## local function comma(s) s = string.gsub(s, ',', '') return tonumber(s) end
local cells: [#[comma '100,000,000']#]lists.Cons
local list = lists.iota(#[comma '100,000,000']#, cells)
print(list:last()) -- output: 100000000And for reference, a Lisp version of this program:
(defun iota (n) (loop for m upto n collect m)) (defun main () (princ (first (last (iota 100000000)))) (terpri)) (main)
And a comparison:
nelua -o list3 -M -P testlist3 -P nogc list3.nelua
$ hyperfine ./list3 'sbcl --dynamic-space-size $((5*1024)) --script list3.lisp'
Benchmark 1: ./list3
Time (mean ± σ): 838.0 ms ± 44.0 ms [User: 417.4 ms, System: 420.2 ms]
Range (min … max): 752.3 ms … 886.8 ms 10 runs
Benchmark 2: sbcl --dynamic-space-size $((5*1024)) --script list3.lisp
Time (mean ± σ): 4.614 s ± 0.127 s [User: 3.477 s, System: 1.135 s]
Range (min … max): 4.454 s … 4.839 s 10 runs
Summary
'./list3' ran
5.51 ± 0.33 times faster than 'sbcl --dynamic-space-size $((5*1024)) --script list3.lisp'
That's not a fair comparison, but it's fun, and SBCL is doing incredibly well compared to some of its competitors. MaxRSS: Nelua with 1.5G, SBCL with 2.6G
Here's a better comparison:
list4.nelua - now with GC
local lists = require 'list3'
function lists.gciota(upto: integer)
local list: lists.List
while upto > 0 do
local cdr = list.head
list.head = new(lists.Cons)
list.head.car = upto
list.head.cdr = cdr
upto = upto - 1
end
return list
end
## if not pragmas.testlist4 then
return lists
## end
local list = lists.gciota(50e6)
print(list:last())That's only 50 million, because this tablet only has 8GB of RAM, and 50 million already eats 4.6G of it.
$ nelua -o list4 -r -P testlist4 list4.nelua $ hyperfine ./list4 ./list3 Benchmark 1: ./list4 Time (mean ± σ): 16.454 s ± 1.071 s [User: 14.539 s, System: 1.903 s] Range (min … max): 14.952 s … 18.545 s 10 runs Benchmark 2: ./list3 Time (mean ± σ): 929.3 ms ± 53.0 ms [User: 484.8 ms, System: 441.9 ms] Range (min … max): 871.3 ms … 1014.2 ms 10 runs Summary './list3' ran 17.71 ± 1.53 times faster than './list4'
Don't take it from this that you shouldn't ever use the GC, as it'll look a lot better when asked to do more serious things.
list5.nelua - slower than Lua?
Even a completely naive Lua program is faster and uses less memory than this. Taking 8.5s and 2G GB of RAM in nelua --script (which runs Lua code), or 900ms and 1G of RAM in LuaJIT:
local function iota(n)
local result = {}
for i=1, n do table.insert(result, i) end
return result
end
local list = iota(100e6)
print(list[#list])But, this program isn't using linked lists. The comparable Nelua program is twice as fast and uses 3/4ths the memory of LuaJIT.
require 'sequence' local function iota(n: integer) local result: sequence(integer) for i=1, n do result:push(i) end return result end local list = iota(100e6) print(list[#list])
list6.nelua - arena allocation
local lists = require 'list3'
function lists.allociota(upto: integer, alloc: auto)
local list: lists.List
while upto > 0 do
local cdr = list.head
list.head = alloc:new(lists.Cons)
list.head.car = upto
list.head.cdr = cdr
upto = upto - 1
end
return list
end
## if not pragmas.testlist6 then
return lists
## end
require 'allocators.arena'
local arena = new(@ArenaAllocator(2*1024*1024*1024))
defer delete(arena) end
local list = lists.allociota(100e6, arena)
print(list:last())Though still using linked lists, this slightly outperforms the sequence version:
$ nelua -r -Ptestlist6 -o list6 list6.nelua
$ hyperfine ./list3 ./list5 ./list6
Benchmark 1: ./list3
Time (mean ± σ): 764.6 ms ± 56.6 ms [User: 399.8 ms, System: 363.5 ms]
Range (min … max): 697.1 ms … 898.6 ms 10 runs
Benchmark 2: ./list5
Time (mean ± σ): 1.659 s ± 0.060 s [User: 1.452 s, System: 0.205 s]
Range (min … max): 1.562 s … 1.770 s 10 runs
Benchmark 3: ./list6
Time (mean ± σ): 1.575 s ± 0.090 s [User: 1.144 s, System: 0.427 s]
Range (min … max): 1.383 s … 1.706 s 10 runs
Summary
'./list3' ran
2.06 ± 0.19 times faster than './list6'
2.16 ± 0.17 times faster than './list5'