Semipure functions for web servers

Lately I’ve been thinking about a form of function purity which I’d like to call semi-pure functions. It revolves around a function where the first time calling it won’t give a guarantee of a certain output, but calling it again with the same parameters will guarantee the same output as the first time.

To give an example, imagine a web server whose handlers are semipure. Upon calling an endpoint, the web server will give some additional parameter, maybe a context parameter or a special allocator, which will be used to guarantee semipurity. Let’s call this the semipure parameter.

When the handler makes an SQL call, the results would be saved in the semipure parameter. Querying the same rows in the same table again will make a new SQL call, adding the second version of these rows to the semipure parameter.

If the handler were to be executed again using the same semipure parameter, it will not make new SQL calls. Instead, it will load the data from the semipure parameter. This guarantees that the subsequent call is not dependent on the state of the database, meaning the handler will execute the same code again.

The main benefit comes into play when you can serialize and deserialize this semipure parameter. If the web server detects that the handler encountered an error, it could serialize the semipure parameter and save it to disk. Then, in the future, you could load the semipure parameter from disk, restore the relevant database state to replay what has happened, and modify the handler to fix the bug. You could even create an automatically growing test suite that covers every single failed request your web server has ever encountered.

This method does rely on knowing the state of every queried row in the database. For simple queries this is fine, but if you introduce subqueries that do not return every touched column and row, you lose this information.

I think zig makes this endeavor somewhat possible through comptime shenanigans. If one were to create an ORM that allows creating new types that are built up from exisitng database types, you could encode partial queries and queries over multiple tables inside the ORM. To give a crude example:

const User = struct {
    id: ID,
    firstName: []const u8,
    lastName: []const u8,
    balance: i64,

    pub const ID = enum(u32) {_};
};

const Address = struct{
    id: ID,
    userId: User.ID,
    city: []const u8,
    street: []const u8,

    pub const ID = enum(u32) {_};
};


const UserDB = DBType.fromType(User);
const AddressDB = DBType.fromType(Address);

const UserAddressDB = UserDB.join(AddressDB, User.ID).select(.{
    .userId = UserDB.fields.id,
    .firstName = UserDB.fields.firstName,
    .lastName = UserDB.fields.lastName,
    .city = AddressDB.fields.city,
});
const UserAddress = UserAddressDB.toType();



// Imagine this handler is supposed to do something more complex than simply retrieve some data
pub fn getUserAddress(semipure: Semipure, sqlClient: SqlClient, request: Request) !Response {
    const userAddress = try sqlClient.getOne(semipure, UserAddressDB, .{
        UserDB.equals(UserDB.fields.id, request.userId),
    });

    if (userAddress.userId == 3) {
        userAddress.city = "Atlantis";
        try sqlClient.set(semipure, userAddress);
    }

    return Response.encodeJson(userAddress);
}

Calling getUserAddress with userId = 1 might log the following data in the semipure:

Query 1 GET
User
id firstName lastName balance
1 Bob Smith -
Address
id userId city street
1 1 Paris -

Calling getUserAddress with userId = 3 might log the following data in the semipure:

Query 1 GET
User
id firstName lastName balance
3 Mark Jones -
Address
id userId city street
3 3 London -
Query 2 UPDATE
User
id firstName lastName balance
3 Mark Jones -
Address
id userId city street
3 3 Atlantis -

The address id was not explicitly included in UserAddressDB, but the system might have figured out that this is the primary index of the address table, and therefore included it anyways. I’m not entirely certain how this would internally work. Maybe a comptime error would get emitted if these fields are not included in the type?

Do you guys have any ideas/feedback on such a system?

So… It’s an SQL client with a cache?

Not exactly. If we were to query UserAddressDB twice, both times for user id=3, it would fetch data from the database both times.
It only differs from this when running the handler again, but with the same semipure parameter. In that case, it would first try loading from the previously saved data. Under normal circumstances, this should only ever happen when debugging a specific http request.

I think this is already called Idempotency

1 Like

This is a form of idempotency, but as far as I’m concerned this is a different form of idempotency than traditionally understood.
In a web server, traditional idempotency seems to refer to sending the same HTTP request twice of thrice only resulting in the request being executed once.
In my example, the request would be processed twice or thrice because a new semipure parameter would be made for each individual request. It’s only when you load a semipure parameter from some backup that idempotency applies.

This sounds like memoization, no?

1 Like

I still don’t understand the specific use case of this. Why use this design?

My understanding is that it preserves the error state when the system encounters an error? And also acts as a kind of caching mechanism?

It seems it might be useful in some scenarios, but not for a particularly large extent. Most systems probably don’t need such precision.

1 Like

any ideas/feedback on such a system?

I’m not sure what you’re trying to accomplish or what problem you’re trying to solve here. If the idea is to allow the webserver to optimize by skipping re-reads from the DB, then that’s a cache.

If you’re trying to prevent making repeat writes for the same data, then what you really want is an idempotency key on the requests. In my experience, you usually need this in situations where the client isn’t going to be able to include your “semipure parameter” in the first place.

For example, most webhooks will include an event id or some other key and will explicitly say that requests will be made at least once but possibly more, and it’s your responsibility to only handle an event once. A client re-submitting a request also isn’t going to have received your response and then included this semipure parameter on the re-submission.

I’m not sold, it sounds like your server is going to be holding a lot of your DB in memory, especially if you have many different requests that are all involving overlapping data. For example, a number of different users making requests on the same forum thread.

However, I think you can accomplish the end result (the replays) by using an event sourcing model. You store all the events, and you use those to transform the relevant data from state A → state B, and then you execute any other effects. When there’s an error, instead of writing the state of DB to disk, you just need to log where in the event stream you are (e.g. the id of the last event processed) and any relevant details about the request itself. You can then “replay” all events up to that point to get a test/dev DB to the same state.

2 Likes

TLDR, to me it sounds like a solution in search of a problem.

Also, it’s worth noting that the designers of http gave us some a lot of lovely tools to deal with caching, like etag headers, and other cache control headers (see If-Modified-Since).

The web has been around long enough for there to be forgotten solutions to be problems, as well as new ones quietly being implemented.

As a web developer, there is always more to learn! Even if you completely ignore the whole React/Svelte/whatever side of things.

I apologize if all I’m doing is raining on your parade - If you want to make your own caching system, go nuts! Just be aware that there might be easier solutions if you need to be pragmatic :sweat_smile:

1 Like

Indeed, it tries to save the full state that the program interacts with as it changes through time. I believe that this allows for tooling that gives deeper insight into what data a function is dealing with, which is especially useful when dealing with edge cases that are difficult to debug.
The main benefit is to save hours debugging what edge cases of data caused a specific function to fail.

I would second @unkempt6057 ‘s suggestion to look into event sourcing. It’s also worth considering a tool like Sentry if your goal is tracking down errors. With Zigs excellent error handling, it makes it very easy to inject the context you need when you need it (though admittedly there’s no official binding, atm). Then you can see errors and whatever state caused them :slight_smile:

1 Like

I believe the memory usage of this method is quite managable. You’d only need to log anomalous requests, and within these requests you’d only need to save all the touched state of the database. For non-anomalous requests, the semipure parameter can act as the allocator for the request, and you can simply clear it afterwards.

For sure! But wouldn’t this actually consume quite a lot of memory? You could construct this by using a daily backup, and then replaying every request that happened that day. This would require you to actually store all the data for every request made that day, and then replay them all in order to get to the desired state. When dealing with large servers handling a lot of requests I can imagine that this would take a significant amount of memory.
Wouldn’t this additionally take a very long time to actually execute? If you have a large server farm handling maybe millions of requests per second, you would need a testing environment capable of replaying that many requests within a reasonable timeframe. Or am I missing something?

+1, I’d say ESPECIALLY if you ignore that. The web platform has so much interesting stuff out there.

1 Like

In a certain sense, I guess? Only this does not cache the returned SQL rows, if you were to query a user twice, the first time around it would go to the database twice. If you then re-run the handler, while setting the semipure parameter to repeat, it will that time around use the cached state. In a certain sense, you’re simply saving and replaying every message sent to and from external servers.

I’ve actually tried implementing etag caching at my work once! It’s lovely to read into obscure mdn articles to find such things.

Haha, no worries :smiley: I appreciate you taking time to read and evaluate my post!

1 Like

I wouldn’t be worried about memory here. You’d be streaming events off the event store and applying them to your DB.

This would require you to actually store all the data for every request made that day

Generally, you would only store “events” (not the entire request). So if you had one million requests with people browsing an e-commerce store, but only a hundred purchases, you would only record the events around those purchases.

1 Like

This. You could event do the serialization you’re thinking of and include that in the error log.

The more I think about this, the more it sounds like two different things.

  1. caching and idempotency handling, which there are plenty of known approaches to.
  2. when something goes wrong, being able to record the state of relevant tables. Which can be solved by … recording the state of those relevant tables at the time your error occurred.

You could then write something that takes this recorded state as well as the recorded request, and spits out a test for you.

1 Like

If you need convincing about the benefits of an event sourced system, download an Elm app and play around with the dev tools. A simple (State → Event → New State) loop can give you time travel super powers! And if you are worried about storage space, there are established solutions. e.g. If you use aggregates to group resources, you could run a cron job to automatically remove events from resources deleted > 90 days ago.

Couldn’t this apply to my method aswell? When an anomalous request is encountered, only the parts of the database that are touched are saved. This could then be handed off to some other database or server that stores it until needed.

Fair. I can imagine how this will work for most anomalous requests, and how off-the-shelf solutions will offer practical benefit over re-engineering your web server. Thanks!

1 Like

This is just a pure function depending on an additional context parameter that is shared between called but runtime generated or else.

Note not pure in terms of identical outputs, the weaker version of no side effects