Lua Notes: lists/statically-allocated

statically-allocated lists

list1.nelua - stack-allocated lists
list2.nelua - iota() with preallocated cons cells
list3.nelua - 100 million
list4.nelua - now with GC
list5.nelua - slower than Lua?
list6.nelua - arena allocation

You probably shouldn't use linked lists, but they're simple while needing a lot of allocation, so they're illustrative.

list1.nelua - stack-allocated lists

Considering the following program:

require 'stringbuilder'

local M = @record{}

-- singly linked list of integers. generics come later.
global M.Cons = @record{
  car: integer,
  cdr: *M.Cons,
}
-- for ZII: an empty List is a valid zero-length list
global M.List = @record{
  head: *M.Cons
}

function M.List:print()
  local sb: stringbuilder <close>
  local list = self.head
  sb:writebyte('{'_u8)
  while list ~= nilptr do
    sb:write(list.car)
    if list.cdr ~= nilptr then sb:write(', ') end
    list = list.cdr
  end
  sb:writebyte('}'_u8)
  local s <close> = sb:promote()
  print(s)
end

## if not pragmas.testlist1 then
  return M
## end

local function main()
  local list: M.List
  list:print() -- output: {}
  local c3: M.Cons = {3, nilptr}
  local c2: M.Cons = {2, &c3}
  local c1: M.Cons = {1, &c2}
  list.head = &c1
  list:print() -- output: {1, 2, 3}
end
main()

This defines, using Lisp terminology, a cons cell that carries an integer value in its 'car' field, and either nilptr or a pointer to another cons cell in its 'cdr' field. Because the zero value of this type isn't terribly useful (it would be a one-element list containing zero), an additional List is created to hold the head cons cell of the singly-linked list. There's also a function to print lists as a Lua table, and some demonstration code.

The demonstration code hand-assembles a three-element list and prints it. Usage:

$ nelua -Ptestlist1 list1.nelua 
{}
{1, 2, 3}

The module returns and the demo code are slightly odd, but are purposeful. Returning a module lets the following examples reuse this code easily. Conditionally doing that lets the demo be run more as a script, rather than requiring something like nelua -Ptestlist1 -i 'local m = require "list1.nelua"'

If this is supposed to be more like a script, why not put the code at the toplevel instead of in `main()`?

That actually changes codegen here: instead of stack-allocated cons cells, these would become statically allocated cons cells.

Why not instead define `M.List:__tostring()`?

That would permit a more natural print(list), but that more natural usage would also leak under pragmas.nogc because creating the string requires allocation and print() doesn't free it (although io.print does!). So it'd have to be like this anyway, if a GC weren't used:

require 'string'
local s <close> = tostring(list)
print(s)

The further explanation for this is that print() is supposed to be for debugging output.

Does the GC matter though?

Well... no. Not at all, because only a tiny amount of I/O would allocate with the GC at the end of the program, which is how the GC gets to do anything. I'm just keeping concerns like this in mind.

list2.nelua - iota() with preallocated cons cells

This example adds a function iota() that generates a list of numbers. Because this function certainly can't return cons cells that it allocated on the stack, the cons cells are provided to it.

local lists = require 'list1'
require 'span'

function lists.iota(upto: integer, cells: span(lists.Cons))
  local list: lists.List
  while upto > 0 do
    local cdr = list.head
    list.head = &cells[upto-1]
    list.head.car = upto
    list.head.cdr = cdr
    upto = upto - 1
  end
  return list
end

## if not pragmas.testlist2 then
  return lists
## end

local function main()
  local cells: [3]lists.Cons
  local list = lists.iota(3, cells)
  list:print() -- output: {1, 2, 3}
end
main()

This works, and it could be adapted in some obvious directions:

pass a huge source of cells instead of exactly the amount needed, so that more complex list functions can share that with their own called functions.
pass a function instead that can return cells on demand, by some means
instead of only producing cons cells, blocks of memory could be produced, which could be used for cons cells or whatever other resources were required

i.e., it can be developed in the direction of being an allocator.

list3.nelua - 100 million

Let's end with some cons-wasteful code, a program that builds a 100-million-long list with iota, only to fetch the contents of the last cell.

local lists = require 'list2'

function lists.List:last()
  assert(self.head)
  local list = self.head
  while list.cdr ~= nilptr do
    list = list.cdr
  end
  return list.car
end

## local function comma(s) s = string.gsub(s, ',', '') return tonumber(s) end

local cells: [#[comma '100,000,000']#]lists.Cons
local list = lists.iota(#[comma '100,000,000']#, cells)
print(list:last()) -- output: 100000000

And for reference, a Lisp version of this program:

(defun iota (n)
  (loop for m upto n collect m))

(defun main ()
  (princ (first (last (iota 100000000))))
  (terpri))

(main)

And a comparison:

nelua -o list3 -M -P testlist3 -P nogc list3.nelua
$ hyperfine ./list3 'sbcl --dynamic-space-size $((5*1024)) --script list3.lisp'
Benchmark 1: ./list3
  Time (mean ± σ):     838.0 ms ±  44.0 ms    [User: 417.4 ms, System: 420.2 ms]
  Range (min … max):   752.3 ms … 886.8 ms    10 runs
 
Benchmark 2: sbcl --dynamic-space-size $((5*1024)) --script list3.lisp
  Time (mean ± σ):      4.614 s ±  0.127 s    [User: 3.477 s, System: 1.135 s]
  Range (min … max):    4.454 s …  4.839 s    10 runs
 
Summary
  './list3' ran
    5.51 ± 0.33 times faster than 'sbcl --dynamic-space-size $((5*1024)) --script list3.lisp'

That's not a fair comparison, but it's fun, and SBCL is doing incredibly well compared to some of its competitors. MaxRSS: Nelua with 1.5G, SBCL with 2.6G

Here's a better comparison:

list4.nelua - now with GC

local lists = require 'list3'

function lists.gciota(upto: integer)
  local list: lists.List
  while upto > 0 do
    local cdr = list.head
    list.head = new(lists.Cons)
    list.head.car = upto
    list.head.cdr = cdr
    upto = upto - 1
  end
  return list
end

## if not pragmas.testlist4 then
  return lists
## end

local list = lists.gciota(50e6)
print(list:last())

That's only 50 million, because this tablet only has 8GB of RAM, and 50 million already eats 4.6G of it.

$ nelua -o list4 -r -P testlist4 list4.nelua
$ hyperfine ./list4 ./list3
Benchmark 1: ./list4
  Time (mean ± σ):     16.454 s ±  1.071 s    [User: 14.539 s, System: 1.903 s]
  Range (min … max):   14.952 s … 18.545 s    10 runs
 
Benchmark 2: ./list3
  Time (mean ± σ):     929.3 ms ±  53.0 ms    [User: 484.8 ms, System: 441.9 ms]
  Range (min … max):   871.3 ms … 1014.2 ms    10 runs
 
Summary
  './list3' ran
   17.71 ± 1.53 times faster than './list4'

Don't take it from this that you shouldn't ever use the GC, as it'll look a lot better when asked to do more serious things.

list5.nelua - slower than Lua?

Even a completely naive Lua program is faster and uses less memory than this. Taking 8.5s and 2G GB of RAM in nelua --script (which runs Lua code), or 900ms and 1G of RAM in LuaJIT:

local function iota(n)
  local result = {}
  for i=1, n do table.insert(result, i) end
  return result
end

local list = iota(100e6)
print(list[#list])

But, this program isn't using linked lists. The comparable Nelua program is twice as fast and uses 3/4ths the memory of LuaJIT.

require 'sequence'

local function iota(n: integer)
  local result: sequence(integer)
  for i=1, n do result:push(i) end
  return result
end

local list = iota(100e6)
print(list[#list])

list6.nelua - arena allocation

local lists = require 'list3'

function lists.allociota(upto: integer, alloc: auto)
  local list: lists.List
  while upto > 0 do
    local cdr = list.head
    list.head = alloc:new(lists.Cons)
    list.head.car = upto
    list.head.cdr = cdr
    upto = upto - 1
  end
  return list
end

## if not pragmas.testlist6 then
  return lists
## end

require 'allocators.arena'
local arena = new(@ArenaAllocator(2*1024*1024*1024))
defer delete(arena) end
local list = lists.allociota(100e6, arena)
print(list:last())

Though still using linked lists, this slightly outperforms the sequence version:

$ nelua -r -Ptestlist6 -o list6 list6.nelua
$ hyperfine ./list3 ./list5 ./list6
Benchmark 1: ./list3
  Time (mean ± σ):     764.6 ms ±  56.6 ms    [User: 399.8 ms, System: 363.5 ms]
  Range (min … max):   697.1 ms … 898.6 ms    10 runs
 
Benchmark 2: ./list5
  Time (mean ± σ):      1.659 s ±  0.060 s    [User: 1.452 s, System: 0.205 s]
  Range (min … max):    1.562 s …  1.770 s    10 runs
 
Benchmark 3: ./list6
  Time (mean ± σ):      1.575 s ±  0.090 s    [User: 1.144 s, System: 0.427 s]
  Range (min … max):    1.383 s …  1.706 s    10 runs
 
Summary
  './list3' ran
    2.06 ± 0.19 times faster than './list6'
    2.16 ± 0.17 times faster than './list5'

Lua Notes