How strings work in Zig?

A quick introduction to string literals in Zig, and how strings in Zig differ from strings in other programming languages.
Zig
Strings
Encoding
Author
Affiliation

Pedro Duarte Faria

Blip

Published

December 12, 2023

Introduction

Zig is a new general-purpose and low level programming language that is being very promising. The documentation of the language1 is very good in quality, but I missed some more specific details about strings in there.

That is why, I decided to write this article, to discuss in more depth how strings work in Zig, and give you some more specific details about it.

What strings in Zig are?

In Zig, a string literal (or a string object if you prefer) is a pointer to a null-terminated array of bytes. Each byte in this array is represented by an u8 value, which is an unsigned 8 bit integer.

Zig always assumes that this sequence of bytes is UTF-8 encoded. This might not be true for every sequence of bytes you have it, but is not really Zig’s job to fix the encoding of your string (you can use iconv for that). Today, most of text data in our modern world should be UTF-8 encoded. So if your string literal is not UTF-8 encoded, then, you will probably have problems in Zig.

Let’s take for example the word “Hello”. In UTF-8, this sequence of characters (H, e, l, l, o) is represented by the sequence of decimal numbers 72, 101, 108, 108, 111. In xecadecimal, this is the sequence 0x48, 0x65, 0x6C, 0x6C, 0x6F. So if I take this sequence of hexadecimal values, and ask Zig to print this sequence of bytes as a sequence of characters (i.e. a string), then, the text “Hello” will be printed into the terminal:

const std = @import("std");

pub fn main() !void {
    const bytes = [_]u8{0x48, 0x65, 0x6C, 0x6C, 0x6F};
    std.debug.print("{s}\n", .{bytes});
}
Hello

If you want to see the actual bytes that represents a string in Zig, you can use a for loop to loop trough each byte in the string, and ask Zig to print each byte as an hexadecimal value to the terminal, using a print() statement with the X formatting specifier. Like this:

const std = @import("std");

pub fn main() !void {
    const string_literal = "This is an example of string literal in Zig";
    std.debug.print("Bytes that represents the string object: ", .{});
    for (string_literal) |byte| {
        std.debug.print("{X} ", .{byte});
    }
    std.debug.print("\n", .{});
}
Bytes that represents the string object: 54 68 69 73 20 69 73 20 61 6E 20 65 78 61 6D 70 6C 65 20 6F 66 20 73 74 72 69 6E 67 20 6C 69 74 65 72 61 6C 20 69 6E 20 5A 69 67 

Strings in C

This is very similar to how C treats strings as well. That is, string values in C are also treated internally as an array of bytes, and this array is also null-terminated.

But one key difference between a Zig string and a C string, is that Zig also stores the length of the array inside the string object. This small detail makes your code safer, because is much easier for the Zig compiler to check if you are trying to access an element out of bounds, or if your trying to access memory that does not belong to you.

To achieve this same kind of safety in C, you have to do a lot of work, that kind of seems pointless. So getting this kind of safety is not automatic and much harder to do in C. For example, if you want to track the length of your string troughout your program in C, then, you first need to loop through the array of bytes that represents this string, and find the null element ('\0') position to discover where exactly the array ends, or, in other words, to find how much elements the array of bytes contain.

To do that, you would need something like this in C. In this example, the C string is 25 bytes long:

#include <stdio.h>
int main() {
    char* array = "An example of string in C";
    int index = 0;
    while (1) {
        if (array[index] == '\0') {
        break;
    }
        index++;
    }
    printf("Number of elements in the array: %d\n", index);
}
Number of elements in the array: 25

But in Zig, you do not have to do this, because the object already contains a len field which stores the length information of the array. As an example, the string_literal object below is 43 bytes long:

const std = @import("std");

pub fn main() !void {
    const string_literal = "This is an example of string literal in Zig";
    std.debug.print("{d}\n", .{string_literal.len});
}
43

A better look at the object type

Now, we can inspect better the type of objects that Zig create. To check the type of any object in Zig, you can use the @TypeOf() function.

If we look at the type of the simple_array object below, you will find that this object is a array of 4 elements. Each element is a signed integer of 32 bits (i32). That is what an object of type [4]i32 is.

But if we look closely at the type of the string_literal object below, you will find that this object is a constant pointer (*const) to an array of 43 elements (or 43 bytes). Each element is a single byte (more precisely, an unsigned 8 bit integer), that is why we have the [43:0]u8 portion of the type below. In other words, the string stored inside the string_literal object is 43 bytes long. That is why you have the type *const [43:0]u8 below.

Now, if we create an pointer to the array object, then, we get a constant pointer to an array of 4 elements (*const [4]i32), which is very similar to the type of the string_literal object. This demonstrates that a string object (or a string literal) in Zig is already a pointer to an array.

Just remember that a “pointer to an array” is different than an “array”. So a string object in Zig is a pointer to an array of bytes, and not simply an array of bytes.

const std = @import("std");
const print = std.debug.print;

pub fn main() !void {
    const string_literal = "This is an example of string literal in Zig";
    const simple_array = [_]i32{1, 2, 3, 4};


    print("Type of array object: {}\n", .{@TypeOf(simple_array)});
    print("Type of string object: {}\n", .{@TypeOf(string_literal)});
    print("Type of a pointer that points to the array object: {}\n", .{@TypeOf(&simple_array)});
}
Type of array object: [4]i32
Type of string object: *const [43:0]u8
Type of a pointer that points to the array object: *const [4]i32

Byte vs unicode points

Is important to point out that each byte in the array is not necessarily a single character. This fact arises from the difference between a single byte and a single unicode point.

The encoding UTF-8 works by assigning a number (which is called a unicode point) to each character in the string. For example, the character “H” is stored in UTF-8 as the decimal number 72. This means that the number 72 is the unicode point for the character “H”. Each possible character that can appear in a UTF-8 encoded string have its own unicode point.

For example, the Latin Capital Letter A With Stroke (Ⱥ) is represented by the number (or the unicode point) 570. However, this decimal number (570) is higher than the maximum number stored inside a single byte, which is 255. In other words, the maximum decimal number that can be represented with a single byte is 255. That is why, the unicode point 570 is actually stored inside the computer’s memory as the bytes C8 BA.

const std = @import("std");

pub fn main() !void {
    const string_literal = "Ⱥ";
    std.debug.print("Bytes that represents the string object: ", .{});
    for (string_literal) |char| {
        std.debug.print("{X} ", .{char});
    }
    std.debug.print("\n", .{});
}
Bytes that represents the string object: C8 BA

This means that to store the character Ⱥ in an UTF-8 encoded string, we need to use two bytes together to represent the number 570. That is why the relationship between bytes and unicode points is not always 1 to 1. Each unicode point is a single character in the string, but not always a single byte corresponds to a single unicode point.

All of this means that if you loop trough the elements of a string in Zig, you will be looping through the bytes that represents that string, and not through the characters of that string. In the Ⱥ example above, the for loop needed two iterations (instead of a single iteration) to print the two bytes that represents this Ⱥ letter.

Now, all english letters (or ASCII letters if you prefer) can be represented by a single byte in UTF-8. As a consequence, if your UTF-8 string contains only english letters (or ASCII letters), then, you are lucky. Because the number of bytes will be equal to the number of characters in that string. In other words, in this specific situation, the relationship between bytes and unicode points is 1 to 1.

But on the other side, if your string contains other types of letters… for example, you might be working with text data that contains, chinese, japanese or latin letters, then, the number of bytes necessary to represent your UTF-8 string will likely be much higher than the number of characters in that string.

If you need to iterate through the characters of a string, instead of its bytes, then, you can use the std.unicode.Utf8View struct to create an iterator that iterates through the unicode points of your string.

In the example below, we loop through the japanese characters “アメリカ”. Each of the four characters in this string is represented by three bytes. But the for loop iterates four times, one iteration for each character/unicode point in this string:

const std = @import("std");

pub fn main() !void {
    var utf8 = (try std.unicode.Utf8View.init("アメリカ")).iterator();
    while (utf8.nextCodepointSlice()) |codepoint| {
        std.debug.print("got codepoint {}\n", .{std.fmt.fmtSliceHexUpper(codepoint)});
    }
}
got codepoint E382A2
got codepoint E383A1
got codepoint E383AA
got codepoint E382AB