My high-performance JSON decoder is now memory-safe and even faster.

Sugawara Yuuta
10 min readSep 21, 2023

--

Introduction

Hello, my name is Sugawara Yuuta. Thank you for reading this again if you have read “How I created the fastest JSON decoder in Go” or “Still, I wanted to build the fastest JSON decoder in Go”. If you haven’t yet, I hope you’ll read that as well, as it will make it easier to tell the story of how it happened.

Later

I discovered a repository called go-json-experiment/json, which is a new version of the JSON library being worked on by members of the Go language team. (Strictly speaking, I knew about it, but had not looked at the contents etc.). I submitted an issue hoping that what I learned there would be of some use, but I found what was missing in the JSON decoder at the time, so I hope to talk about the issues I found and how I tried to solve them/solved it.

Challenges found

Performance is important, but the Go language team also places great importance on making something that is “easy to use” and “hard to misuse”.
I was again impressed by the fact that such a great language was created by prioritising ‘correctness’ over speed.
There is one more thing I was interested in. The readability of the code. When I first wrote the last version, I didn’t pay much attention to this, but I think I will be working more and more with other programmers in the future, so I thought it was also important to get into the habit of writing easy-to-read code.
Finally, memory safety. The previous version, like most other libraries, aimed to speed things up by using `unsafe`. Runtime reflection is heavy work, so the technique was to use `unsafe` directly, and then convert offsets, cast to another type with the same memory layout, etc. However, considering the fact that memory safety is now not in full power, shouldn’t this be avoided (even if the usage is not dangerous)? The question was raised.
`unsafe` also brings a strong weapon in the form of `go:linkname`. The idea is that you can bring aliases for non-public functions from other packages. In previous work, this was used frequently to avoid unnecessary allocations and copying. Of course, since it is a non-public function, there is no guarantee of compatibility etc. — in fact, several bugs related to the use of this occurred immediately after release.

How to solve the problem

Although it’s difficult, most of these problems can be solved by using a wrapper called `reflect`. Memory safety is basically preserved and the code is more readable (partly because `reflect` is a higher-layer part than `unsafe`, so you get helper functions and so on at the same time, which you don’t get with the former).
Of course, as mentioned earlier, reflection is heavy work. Is it possible to improve speed even after switching to this?

Solving the problem

Answer to the question in the previous section: it is possible.
First of all, look at the benchmark results!

Compared to the previous version, 12 test cases were used (6 encoding, 6 decoding), of which 11 achieved faster results.
It also outperforms `go-json-experiment/json` introduced earlier, despite being under the same circumstances, in all cases, and the difference is obvious in most cases.
Of course, performance depends greatly on the machine used, the state, the size of the JSON, etc., so if you are concerned about this, we strongly recommend that you measure it yourself.

Why it’s possible — behind the scenes to combine safety and speed

I would like to give four examples of decoder speed-up.

  • Decoding integers at high speed

What do you use to decode integers? Most people would say `strconv`. However, there is no such thing as a perfect implementation, so we can try to improve it for more detailed use cases. To simplify matters and make it easier to observe changes, try the following.
‘Pack the unsigned integer representation of a string of type `string` into `uint64` as fast as possible.’
Of course, it is possible to go further than what we are making here by using SIMD, etc., but that would make it impossible to write in Go alone, and the effort to keep cross-platform cannot be ignored.
First, we’ll consider the simplest implementation we can think of. It might look something like the following.

func parseUint1(str string) (uint64, error) {
var dst uint64
for len(str) >= 1 {
num := uint64(str[0] - '0')
if num > 9 {
return 0, strconv.ErrSyntax
}
add := dst*10 + num
if add < dst {
return 0, strconv.ErrRange
}
str = str[1:]
}
return dst, nil
}

In ASCII code, `’0'` — `’9'` are aligned horizontally, so you get the actual integer by subtracting the character 0 from the byte you are reading. As this is an unsigned integer, characters other than those cannot be present. Therefore, if any other character is encountered, an error is returned (this time `ErrSyntax` is returned due to the dummy implementation).
The `* 10` is rewritten by the compiler into a shift instruction combination, so there is little risk of slowdown. Now let’s see how this performs compared to `strconv`.

BenchmarkStrconvParseUint-4      1903638              3124 ns/op               0 B/op          0 allocs/op
BenchmarkParseUint1-4 2862594 2091 ns/op 0 B/op 0 allocs/op

It is quite good. It is about 1.5 times faster than the standard library (above). The reason could be that the standard library supports more than just decimal numbers, so branches and function calls overlap.
Can it be made faster than this?
Yes, it can. Did you notice that most CPUs in existence today have 64-bit registers, but the code above handles them 8 bits at a time? By using this to the full, you can aim to increase speed even further. Here’s an example of how.

const (
x01 = 0x0101010101010101
)

func makeUint64(u64 uint64) uint64 {
const mask = 0x000000ff000000ff
const fst = 100 + 1000000<<32
const sec = 1 + 10000<<32
u64 -= x01 * '0'
u64 = u64*10 + u64>>8
return (u64&mask*fst + u64>>16&mask*sec) >> 32
}

func canMakeUint64(u64 uint64) bool {
return u64&(u64+x01*0x06)&(x01*0xf0) == x01*0x30
}

Aren’t both inputs integers? If you think so, you are right. However, this `uint64` is the result of what is called a 64-bit load, and is completely different from the output number. As I said earlier, a 64-bit register contains eight 8-bit — that is, eight bytes, so you can operate eight characters at the same time. The mechanics of this are well explained on this person’s blog, so please see there. As the implementation is 32-bit, see Daniel Lemire’s article for a 64-bit implementation (he also He has created a fast JSON library). Now let’s look at the performance of this as well.

BenchmarkParseUint2-4            4444332              1349 ns/op               0 B/op          0 allocs/op

It is also faster. Unfortunately, it should be noted that if you don’t know how many digits are in the object, such as JSON, it may be read in vain.

  • Fast object name checks for existence in the structure

The field information in the structure is determined for each type and can be calculated first to check for name collisions and to apply priority. However, matching this structure information with the names of the objects that actually flow in is a frequent and often time-consuming task. So this time we will consider speeding this up.
The first thing that comes to mind would be `map`, but as `map` is designed to work somewhat faster for inserts and loops, it may not be suitable for tasks with a high proportion of lookups. In this case, you can see from the previous explanation alone that lookups are much more important than insertions. Not only that, but you can even create a data structure with all the information first, making it read-only and requiring no insertion.
We had been using a perfect hash function for this before, in the form of shuffling the seed, but the distribution was not the best because we were using FNV1-a, where the hash was loop-unrolled for simplicity. From this it can be considered that the hash needs to be improved. The slightly tricky part is that there are many cases where a case-independent search has to be performed.
The problem can also be simplified as follows.
‘Get the hash as fast as possible with the alphabet in lower case.’
So first, we will use 3000 English words to see which hashes are fastest in the standard library, including lowercasing.

BenchmarkFNV1a-4           35215            166097 ns/op           28488 B/op       3000 allocs/op
BenchmarkCRC32-4 34110 177475 ns/op 28488 B/op 3000 allocs/op
BenchmarkAdler32-4 35414 167569 ns/op 28488 B/op 3000 allocs/op
BenchmarkMaphash-4 37741 161913 ns/op 28488 B/op 3000 allocs/op

With `bytes.ToLower`, it looks like each hash is not giving its real power because of the allocation. Why not reuse the byte array instead of recreating it each time? The characters are all ASCII, so it should not be difficult to implement. An example is shown below.

func appendToLower(dst, src []byte) []byte {
for _, char := range src {
if char >= 'A' && char <= 'Z' {
dst = append(dst, char|('a'-'A'))
continue
}
dst = append(dst, char)
}
return dst
}

The only interesting part here is probably the OR(`|`). If you are curious, the ASCII table is easy to understand. If you replace `ToLower` with the above implementation and measure again…

BenchmarkFNV1a-4          105688             55860 ns/op               0 B/op          0 allocs/op
BenchmarkCRC32-4 82906 70441 ns/op 0 B/op 0 allocs/op
BenchmarkAdler32-4 95647 61397 ns/op 0 B/op 0 allocs/op
BenchmarkMaphash-4 101689 59381 ns/op 0 B/op 0 allocs/op

The difference has increased. It seems that `maphash` and loop-unrolled `FNV1-a` are faster.
By the way, the speeds of the hash functions I actually use are as follows…

BenchmarkHash32Lo-4       346429             16303 ns/op               0 B/op          0 allocs/op

The results are many times faster than `maphash`, which uses assembly. This is because it makes full use of the 64-bit load and 32-bit load we talked about earlier. For example, if you use this to read the first 8 bytes, you can combine it with the `|(‘a’-’A’)` introduced above to combine the hashing with the lowercasing operation (but this is only available if you know that the string consists only of some ASCII — for example, alphabets and numbers.) `’a’-’A’` is 0x20 in 32, hexadecimal, so in 64-bit it looks like `0x2020202020202020`. This means that you can simply OR these large integers to obtain a hash of the lower-cased and uppercase results. An example implementation is shown below.

var (
lit = binary.LittleEndian
)

const (
wy1 = 0x53c5ca59
wy2 = 0x74743c1b
wy3 = wy1 | wy2<<32
)

const (
lo8 uint8 = 0x20
lo32 uint32 = 0x20202020
lo64 uint64 = 0x2020202020202020
)

func hash32Lo(buf []byte, seed uint32) uint32 {
hash := uint64(seed^wy1) * uint64(len(buf)^wy2)
if len(buf) != 0 {
for ; len(buf) > 8; buf = buf[8:] {
hash = hash ^ (lit.Uint64(buf) | lo64) ^ wy3
hash = (hash >> 32) * (hash & 0xffffffff)
}
hi, lo := hash>>32, hash&0xffffffff
if len(buf) >= 4 {
hi ^= uint64(lit.Uint32(buf) | lo32)
lo ^= uint64(lit.Uint32(buf[len(buf)-4:]) | lo32)
} else {
lo ^= uint64(buf[0] | lo8)
lo ^= uint64(buf[len(buf)>>1]|lo8) << 8
lo ^= uint64(buf[len(buf)-1]|lo8) << 16
}
hash = (lo^wy1)*(hi^wy2) ^ wy3
hash = (hash >> 32) * (hash & 0xffffffff)
}
return uint32(hash>>32) ^ uint32(hash)
}

The actual hash function uses the 32-bit version of Wyhash. This hash function is also used in the internal implementation of the Go language.

  • Fast string representation of decimals to `float64`.

The truth about this is that it is not possible to simplify the problem and start with a simple example. This is because `float32`/`float64` do not represent numbers as completely and accurately as integer types. Besides, `strconv` already uses a fast process in hotpaths. Are there places where it can be improved? What I noticed was that I needed to find the end of the fractional literal to use `ParseFloat`. If you put an extra string into that function, you will get an error if the whole string does not represent a valid decimal. So I decided to port Daniel Lemire’s fast double parser to Go to use so that the buffer array only needs to be iterated once. As you can see from the graph I pasted at the beginning, this resulted in a considerable speed-up. (Especially as canada_geometry is data with a lot of floating
point numbers.)

  • Skip spaces at high speed

The most common thing in a formatted JSON document are spaces and tabs, which can be completely ignored in JSON, as indentation is irrelevant in the grammar. Now, let’s consider how to skip them fast. Perhaps the simplest is to compare all four types of spaces in order. However, there are other ways.

const spaces uint64 = 1<<' ' | 1<<'\n' | 1<<'\r' | 1<<'\t'

if 1<<char&spaces != 0 {}

This is an example of using the fact that one byte is an 8-bit number, shifting it by that number of times and then ANDing it to a previously created 64-bit integer to see if there is a match.
Another advantage of this method is that, unlike the method using a fixed-length table introduced in the previous section, it uses much less memory.
In addition, you can also use the speed-up (generally known as SWAR) by packing into 64-bit registers, which I introduced earlier.
The thing that was missing from my implementation was inlining. I realised that the function call was taking too much time because it was called so often, so I made so that if the first character was not a space, it would return. In other cases, it would call another function and skip the space using the technique described above. This implementation reduced the speed reduction in JSON files and other files where there were no spaces. Also, `<math/bits>.TrailingZeros64` can be used to know where the first non-space character appears when using SWAR.
it is a good idea to rely on the math/bits package in this case, as it is not only the functions actually defined in the .go file, but also the possibility of direct replacement by the compiler for performance reasons.

Conclusion

So this time I have introduced four examples and what I have done for new problems.
I’d also like to do a continuation of this and an encoder section, so feel free to check back for new articles. I would also like to conclude this article by putting up the repository that I have improved this time.
https://github.com/sugawarayuuta/sonnet
Thank you very much.

--

--

No responses yet