Simplistic Comptime Column Safety in SQLite Queries

12 Likes

Nice post, Loris.

Taking the query as a comptime string is interesting. I’d explore going the other direction: build a query as a type which has a way to render into the query string. I think that’s distinct from ORM because it does not try to force OOP; it only creates types for various kinds of queries.

i.e. something like this:

pub fn Rows(comptime T: type) type {
    return struct {
        db_rows: db.Rows,

        pub const Field = std.meta.FieldEnum(T);

        pub const Row = struct {
            db_row: db.Row,

            pub fn int(row: *Row, field: Field) u32 {
                return row.db_row.int(std.meta.fieldIndex(T, @tagName(field)));
            }

            pub fn text(row: *Row, field: Field) []const u8 {
                return row.db_row.text(std.meta.fieldIndex(T, @tagName(field)));
            }
        };

        pub fn next(this: *@This()) ?Row {
            return this.db_rows.next();
        }

        pub fn deinit(this: *@This()) void {
            this.db_rows.deinit();
        }
    };
}

const TableName = enum {messages};

pub fn selectFrom(comptime T: type, table_name: TableName) Rows(T) {
    const query = "SELECT " ++ queryColumns(T) ++ " FROM " ++ @tagName(table_name);
    return .{ .db_rows = db.rows(query, .{}) catch db.fatal(@src()) };
}

inline fn queryColumns(comptime T: type) []const u8 {
    // TODO return field names of T joined with commas
}

test {
    var rows = selectFrom(struct {
        id: u32,
        origin: u32,
        channel: u32,
        author: u32,
        body: []const u8,
    }, .messages);
    defer rows.deinit();

    while (rows.next()) |r| {
        const msg: Message = .{
            .id = r.int(.id),
            .origin = r.int(.origin),
            .channel = r.int(.channel),
            .author = r.int(.author),
            .text = gpa.dupe(u8, r.text(.body)) catch oom(),
        };
        try state.latest_messages.pushFront(gpa, msg);
        log.debug("loaded chat message: {f}", .{msg});
    }
}

Here I used a struct to hint that a further enhancement would be to add type safety, but I ignored the field types in this example to make it match more closely your example code from the blog post.

You could also more simply make it return instances of your struct type from next(). I think that’s the direction I would take this code.

11 Likes

I think my personal preference is to use the SQL syntax itself as the ā€œinterfaceā€ as much as possible, but this is definitely a valid approach, and AFAIK not yet explored in Zig.

Previous art about query builders that I’m aware of:

Here are a couple of things that would complicate usage of struct definitions like in your example:

  1. How do you encode things like nested queries and usage of sql functions? Like this query for example:
    var rows = db.conn.rows(
    \\SELECT
    \\  messages_search.author,
    \\  users.handle,
    \\  messages_search.channel,
    \\  channels.name,
    \\  highlight(messages_search, 2, char(0x1b) || ā€˜\[7m’, char(0x1b) || ā€˜\[27m’)
    \\FROM messages_search
    \\INNER JOIN users ON messages_search.author == users.id
    \\INNER JOIN channels ON messages_search.channel == channels.id
    \\WHERE body MATCH ?;
    , .{query}) catch db.fatal(@src());
    defer rows.deinit();
    
  2. Defining types like you did in a struct can help pinning down the expected result type but it can’t be the authoritative source. The type of that column is actually defined by the type of the original columns involved in the operation, and then the result of running the query on those. So ultimately writing types there is essentially the same as calling sqlite’s r.int.

One last advantage from using naked SQL directly is that you can leverage existing SQL tooling, like editors that can help you with refactoring, which can be facilitated even more by keeping queries in .sql files and @embedFileing them.

4 Likes

I think the difference between these approaches is summed up with ā€œwhich flavor of impedance mismatch do you prefer?ā€ The type-first approach is limited in what valid queries it can generate, and the SQL-first approach is limited in what valid queries it can parse. Both require writing tons of glue code to marshal data between the db’s tabular format and the application’s in-memory data structures.

IMO the only way out is to use a database that works the same as in-memory data structures (like xitdb which I wrote) so there is no impedance mismatch at all. But if I had to stick to SQLite I would probably lean towards Loris’ approach because I’ve suffered through too many query builders and always end up needing to drop down to raw SQL at some point anyway.

3 Likes

My idea would be something like this as an API for the user of the library:

fn select(db: *DB, Columns: type, comptime restOfStatement: []const u8) !RowIter(Columns);

which would be used like this:

var answer = try db.select(
    struct {
        id: u64,
        origin: u64,
        channel: u64,
        author: u64,
        body: []const u8
    },
    "FROM messages",
);
defer answer.deinit();

while (answer.next()) |row| {
    //...
}

Internally it could (very simply; fieldnames are the columnnames in the query) at comptime create the query string while still being be typesafe and not needing the user to repeat everything.

So internally it would do something like this (some things are omitted):

const fields = std.meta.fields(Columns);
const queryString: []const u8 = comptime queryString: {
    var queryString: []const u8 = "SELECT " ++ fields[0].name;
    for (fields[1..]) |f| {
        queryString = queryString ++ ", " ++ f.name;
    }
    break :queryString queryString ++ " " ++ restOfStatement;
};
1 Like

The more I think about it, the more I like the idea of just parsing naked SQL queries. The only downside I can think of is that, depending on which DB you decide to target, the syntax will have to change a little. Everything else will be pretty much the same compared to the query builder approach.

I’m not going to go down this rabbit hole any time soon, but a system that is able to analyze db queries will be able to look at schema definitions and use that information to resolve types of queries, and it will additionally be able to compute migration queries for schema upgrades.

That seems a really cool thing to have, similar to Django’s ORM (minus the ORM part), which in the past used to be a separate project called south, for those who remember.

1 Like

I’m sure I could come up with a type-safe wrapper for that style of query. But you can also bypass the abstractions and use the lower level APIs for complex queries. Like you said, this is about getting a big bang for your buck.

That also looks like a PITA to parse at comptime.

Regarding separate sql files and tooling and such, I really can’t fathom wanting to do that. I’ve made nontrivial applications before using sql databases and all my queries were CRUD.

But hey, maybe there’s some software development landscape that I have a blind spot for.

Agreed, my preferred approach would be build time parsing and codegen for wrappers.

No I think you’re right, if the schema is designed well enough you don’t do too much nesting most of the time, except when creating analytics dashboards, but even then you don’t want those to run off of your main db.

That being said, a wrapper alone doesn’t guarantee that columns have the type that you think they have, only knowledge of the db schema does.

let me throw my own simple sqlite wrapper: zigqlite: zigqlite

the last update is over a year old, so i guess it no longer compiles without some work.

i wrote it to explore how it feels to combine the things i like most about zig:

  • type safety
  • libraries that accept and adapt to user’s structs instead of forcing its types onto the user

and yes, using naked SQL is a big part of usability IMHO. i’m still trying to find how to do some query construction without having to ā€œhideā€ the SQL (and without really parsing it, if possible)

1 Like

Can somebody with much deeper understanding than mine connect the dots between this ā€œidea of parsing naked SQL queriesā€ and the lower-level Zitron @mnemnion just posted ? I appreciate that Zitron is more ā€œnormallyā€(?) suited to provide a parser for the likes of a sqlite interface lib itself (perhaps a native-zig sqlite replacement, as an alternative to zig sqlite wrappers(?)), but could it be valuable to higher-level use? For instance, if @kristoff’s use case was NOT as ā€œsimpleā€ as he proposes, but was a bigger project with nested queries, etc., and he still wanted type mapping, would a zig-native Zitron-created sqlite parser be a resource? (Sorry if I’m way off base in my understanding - I’m more used to comfortable territory like python sqlite3 usage as a magic blackbox.)

EDIT: I should clarify that I am a big raw-SQL fan, and have some distaste for ORMs, so this middle-ground of specified-fields ā€œcheckingā€ (ā€œsimple ORMish flat-mappingā€), but with raw SQL exposed right up at the front-end of the code, is my cup of tea. In case that leaning colors any interpretation of my interest in general use of Zitron.

1 Like

Indeed, being able to parse your (complex) SQL queries opens up the possibility of extracting their semantics in a structured way, which could lend itself to all kinds of additional advantages, such as linking each column with an SQL data type, validating grouping / windowing calls, etc.

1 Like

Nice article, Loris.

Some tangential ramblings related to the broader idea, and @andrewrk’s remarks:

Awhile ago at this point, I made a rendering interface that heavily used comptime in its ergonomics. Here’s an example shader definition

const World = struct {
    u_offset: [2]f32,
};

const PerVertex = struct {
    a_position: [2]f32,
    a_color: [3]u8,

    pub const a_color_usage = rhi.IntegerUsage.normalized_float;
};

const variants = &[_]rhi.Variant{
    .{ .name = "v_color", .type = "vec3" },
};

const Program = rhi.Program(struct { World: World }, PerVertex, null, variants,
    \\void main(void) {
    \\  gl_Position = vec4(a_position + u_offset, 0.0, 1.0);
    \\  v_color = a_color * a_color;
    \\}
,
    \\out vec4 finalColor;
    \\void main(void) {
    \\  finalColor = vec4(sqrt(v_color), 1.0);
    \\}
);

This thing was pretty neat. It does things like automatically writing definitions for the types in glsl, and prepending it to the shaders. So it was fully type safe between Zig and the shaders. All of the offsets are automatically calculated for setting up the VertexAttribPointer on the fields, for example. Then the draw commands would take arguments that match the types provided, eg:

global.program.draw(.{ .World = &global.uniform }, &global.vao);

global.uniform has type rhi.Uniform(World), and vao has type rhi.VertexArrayObject(PerVertex, null).

Despite being neat, ultimately when a future project needed similar rendering interface, I pulled way back on the complexity of what I was doing with comptime. In practice, it didn’t actually help solve real problems, and it made actually getting stuff done more frustrating. Looking at the two major claimed benefits of the above API:

  • Autogeneration of glsl types from Zig types: ā€œWrite onceā€ and ā€œuniversal source of truthā€ sounds really nice. But how much time does it actually take to write something twice? A few seconds? In practice, the drawbacks more that lost this time.
  • The type safety would catch any bugs….that I would have immediately found anyways. The nastiest of bugs are those which occur under specific values in large ranges of possible program states. Getting types wrong here is very likely to cause large and immediate problems, running it once is often sufficient testing. Any half decent integration test will give sufficient confidence in the code, I feel.

On the other hand, there were significant downsides:

  • The types weren’t actually defined the same way always, so more info had to be added to the Zig side. Even just the back thinking of ā€œI want the glsl to be generated to be __, so I need to define it in as _ā€ can take longer than just writing it twice.
  • When debugging, I constantly had to compile and run the program in order to even see the generated glsl.
  • The error line numbers were always wrong, unless looking at the generated glsl.
  • The API was more complicated.
  • Worse, the strictness of the API meant I would take 2 steps of working on my actual problem before needing to come back to the API layer and change something. Instead of writing a few lines of OpenGl calls, I would have to spend a bunch of time thinking about the type design, making sure the API surface was right, debugging generating code, going and updating all of the old call sites, etc.

However, I should note that I didn’t remove all of comptime from my graphics programming. For example while programs and VAOs are now concrete types, buffers are still generic. I still get the benefit of types when writing the data to the buffer, and the definitions for binding the fields still gets the benefits of generating from fields and offsets, but in a much more flexible way:

    media.vao.beginInit();
    media.verts.vertexAttribField(.pos, false);
    media.verts.vertexAttribField(.color, false);
    media.vao.endInit();

So, if my previous experience is anything to go by, I have doubts on some proposals in this thread:

  • Parsing SQL: Handling all of the cases as the SQL gets complicated sounds like a massive headache. The answer here seems to be ā€œjust don’t handle themā€, which I feel will be hit much more often than expected.
  • Generating SQL: Much more directly has problems I talked about above, and has the same problem that once it gets complicated you end up in working on the API layer land, instead of actually working on your problem.

But actually, I think my real take away is ā€œGo a bit crazy with it, see what works, but plan to pull back to what was actually useful.ā€

5 Likes

Taking a few minutes to look at the SQLite API, I do have a direction I would try when writing a Zig API. I identify two priorities:

  1. Prepared statements look like they’re important for performance. Also worth pointing out the parametrized queries are how sql injection attacks are avoided, so will also definitely want any queries to be declared statically at compile time.
  2. It should be super simple to actually write SQL. I don’t want to hit inflection points in my usage where having a join or subquery or something suddenly means that I have to pull back to a different API. Real world queries tend to be a lot more complicated SELECT id, origin, channel, author, body FROM messages.

From those priorities, where I’d start is an API that looks like this:

Queries.zig

messages: Query(
    "SELECT m.id, m.origin, u.name, m.body FROM messages m INNER JOIN users u on m.author = u.id WHERE m.channel = ? ORDER BY m.date ASC LIMIT 100",
    struct {
        channel: u64,
    },
    struct {
        id: u64,
        origin: u64,
        author: []const u8,
        body: []const u8,
    }),

main.zig

pub fn main() !void {
  var server: SQLiteServer(@import("Queries.zig")) = try .connectAndPrepare(...);
  // app code
  ...
}

fn loadMessages(allocator: Allocator, channel: u64) !void {
  var row_iterator = try server.begin(.messages, .{.channel = channel});
  defer row_iterator.release();
  while (try row_iterator.next(allocator)) |row| {
    defer allocator.free(row.author); // needed?
    defer allocator.free(row.body); // needed?
    //etc
  }
}

Basically, a connection to the server starts by preparing all of the queries the app will ever use. There’s some notes about how multithreading and such work, so not taking the time to dig into them i don’t know the best way to handle them. Probably something like having the prepared statements be thread local. I would use debug.SafetyLock on prepared statements to check against double use.

I’m a bit unsure about how to handle memory for byte blobs. Again would have to dig into the details. If byte blobs have a lot of options that code will want to choose between, I’d probably switch out the struct types in the query definition for enums or something, then have getter methods closer to the OP for the different usages. Edit: I would probably start with a iterator.next() and iterator.nextDupe(allocator) methods, where dupe causes things to be owned, and the simple next’s values become invalid once next is called again. If you have a mix of needs, then call the dupe on the fields you need to yourself. In that case you shouldn’t be using the structs which have dangling pointers anyways.

1 Like

Writing a Zitron SQLite parser would be simpler than usual, due to its heritage. Zitron includes the actual SQLite parser as sample code for validating lemon.zig’s behavior, someone could take this (public domain) code and replace all the C in it with Zig-specific actions. Everything but the C should be readily understood by Zitron, and the parser tables should be identical, rather than equivalent.

What @kristoff is doing is a case of what we in langsec call ā€œshotgun parsingā€. In his case there’s nothing wrong with that at all, for two reasons: it’s happening during compilation, and (as a consequence) an ā€˜attacker-controlled’ string can’t get anywhere near the logic. Loris is careful in several places to note the limitations and specific circumstance in which he does this, I think it’s clever, 10/10 no notes.

Shotgun parsing more generally has a bad reputation, especially when it gets anywhere near a string from ā€˜the wild’, and parsing in general has a way of starting in one specific place and growing until wild strings can reach the parser, often in surprising ways.

For the ā€œdoing cool stuff with Zig, SQLite, and Zitronā€ concept, there’s certainly potential there. At issue is that a lot of the ā€œcool stuffā€ involves going from Zig code to SQL strings, which is something a parser can’t really help with. It could verify the syntax, but SQLite will do that for you.

For anything which starts with an SQL string, though, yeah, could do some good stuff there. That can include validating the structure of an arbitrarily-complex query against expected Zig types, although SQLite is dynamically typed so that has to be borne in mind with regard to anything returned.

1 Like

Slightly off topic:
The SQL-string versus code reminds me of my job (which I quit btw) using C#.
I made my own metadata ā€˜persistence’ tool extracting stuff from the database and only used sql-strings if my tool was not advanced enough to extract the needed data. The tool was written mainly for database conversions: read-convert-write.
I also had a brief look at the EntityFramework shipping with C# and even that could very easily crash.
Up until now I have never encountered something that really can ā€˜translate’ code to SQL and back in a logic, consistent, fast, crashless way. As programmers I think we do not like to write SQL statements :slight_smile:
Besides that: it is a very interesting area.

1 Like

I never had Entity Framework crash on me except in the sense that it’s possible to write queries that either cannot be translated to SQL at all or perhaps just that the engine didn’t support it, in which case it would throw an exception needless to say that’s an annoying way to develop, especially if an updated query was faulty and made it through QA. You can catch the exception but the code still won’t work.