Still, I wanted to build the fastest JSON decoder in Go.

Introduction

Sugawara Yuuta
5 min readApr 25, 2023

Hello, This is Sugawara Yuuta, the author of the previous article “How I created the fastest JSON decoder in Go”. If you haven’t read the previous article yet, I hope you can read it with me because it will help me explain the process.

Afterwards

After that, I received an issue from goccy, the author of go-json saying He did his own benchmarking and found that the performance was not as good as the README. goccy seemed to have done his benchmarking with relatively small JSON files, so I did it myself and found that the results were much, much slower than I expected. Perhaps the fact that I only benchmarked the decoding of larger JSON files backfired on me, but the difference was still significant.
So this time, I have made a number of improvements based on my reflections, and I would like to introduce them to you.

sugawarayuuta/sonnet

Features

  • Unlike last time, it passes all tests for the Go standard library, “encoding/json”.
  • Added encoder for compatibility. This also passes all tests.
  • Decoding speed to small structures is now twice as fast.
  • Several techniques (described below) have greatly reduced memory usage.

Benchmarks

Benchmarks were burrowed from goccy’s repository.

benchmark showing my library. (it’s the fastest in stream decoder, at least in this case)

If anyone is interested, I recommend that you take your own benchmarks

Implementation

  • How to create a map that can be accessed concurrently

Previously I used “sync.Mutex”, but if you can use “unsafe”, “sync/atomic.XXXPointer” often works faster for reading. (The reason I can’t say that about writes is that as far as I know, the only way to do it is to create a new map each time, copy the contents, and then insert the pointer if it hasn’t changed.)

  • Perfect hash functions

Previously, I used the Robin Hood hashing method was used, but after I finished creating it, I realized that I could create a perfect hash function if the key/value pair was known from the beginning and no changes could happen.
So I used a hash function that takes a seed (in this case “hash/maphash.String”, etc.) and shuffles the seed until there are no more duplicates on insertion, which is simple O(1) lookup time.

  • Hash function for the perfect hash functions

At first, hashing was done with the fast maphash package written in assembly, but benchmarks showed that direct use of memhash or strhash, which are used internally, was faster and eliminated unnecessary operations, so I switched to them.
In the end, I changed to FNV-1a, which is manually inlined and unrolled. The reason was that manual inlining was easier because it reads one character at a time (I was thinking of this because the bottleneck of the hash function was the function call, not the bitwise operation), and if the characters are all in ASCII, you can use a lookup table that changes all 256 characters to lowercase. ) If the characters consist entirely of ASCII, the lookup table with all 256 characters changed to lowercase can be used to simultaneously lowercase the characters and generate the hash.

  • Bitwise operations (SWAR)

SWAR is the topic I wanted to talk about the most in this session. The formal name is “SIMD within a register,” which simply means a technology that can process multiple bytes at the same time, even in environments where SIMD is not available. As you may have guessed when you heard the term “multiple bytes,” it can be used for character search. If the character you are looking for is singular, then “bytes.IndexByte”, which uses assembly behind the scenes, is probably best, but if not, you can use this. A working example is shown below.

package main

import (
"encoding/binary"
"fmt"
"math/bits"
)

const mapper uint64 = 0x0101010101010101

func main() {
u8s := []uint8("01234567")
// Use encoding/binary for simplicity
u64 := binary.LittleEndian.Uint64(u8s)
// XOR a uint64 with byte '5' mapped to all bytes.
// XOR outputs 0 if a match is found.
u64 ^= mapper * '5'
// Subtract uint64 where 1 is mapped to all bytes.
// overflows if the previous result is 0.
u64 -= mapper * 1
// AND uint64 with 0x80 mapped to every byte
// (scoop out the most significant bit of each uint8)
u64 &= mapper * 0x80
if u64 != 0 {
// If the output is not zero, it means it was found somewhere, so,
// divide by 8 to see how far the bits are 0.
fmt.Println("found at:", bits.TrailingZeros64(u64)/8)
return
}
fmt.Println("couldn't find the character")
}

In my code I use “unsafe” instead of “encoding/binary” but what I am doing is the same. If you want to see more tricks like this, I suggest you look at the article here

  • Staged pool

While “sync.Pool” is great — parallel access, faster than using channels, etc. — it can be inconvenient because you have no way of knowing which object will come back the next time you “Get”. (For example, “I want a byte slice of size 2048.”)
To solve this I created an array of pools and sized each one. To determine where to put them back in the array or where to get them from, I divide by 1024 and calculate the length on that bit. This allows me to create (and reuse) slices that increase by roughly a factor of 2 with little or no allocation.

- String Cutting

A problem that arises when reusing byte slices as above is how to separate the return value. If you cast the reused array to a string using “unsafe”, the return value, the content, will be changed when the value is rewritten later. In order to solve this problem, I had to make a copy of each string every time, but since the allocation became too large if I had to do it for each string, I needed a solution to this problem. It was to create an array of about 1024 strings first, cut out only what was needed, and pool the remainder for reuse.

--

--