How to approach debugging stdlib TLS client?

I have had success with using the standard library client without TLS. I have also had success with using the client with TLS for certain websites.

However, I have had issues with certain domains which, according to my understanding, should function.

In particular, from a RHEL 8 workstation, I know the CA bundle I use is supported by rescan (although I also attempted just add by absolute path just in case)

I also am under the understanding that the stdlib supports TLS 1.2 and TLS 1.3. I know the domain does not support TLS 1.3, but did confirm via openssl s_client -connect <domain>:443 -tls1_2 that the domain supports TLS 1.2, the handshake succeeding.

Thus the dilemma, when I attempt to connect, all I get is TlsAlert

I was hoping to gather more information as, I notice .level and .description are set if (options.alert). However, two points, since the call to init fails, there isn’t a TLS client from which to access the options and check the alert that way. However, If I’m correct, I can pass options to init (which I do), and the code should mutate that parameter and fill it in with the alert information.

Despite this, I don’t actually manage to get the level nor description as options.alert is just null. The only way I managed to jimmy it into telling me is by passing in a dummy alert in my options (e.g. I just created an alert, level=warning, description=bad_mac). Only then did that if-condition succeed, thus populating the option with level=fatal, description = handshake_failure.

In conclusion, is there a way to get that TlsAlert information without passing a dummy alert to the function via options? And, is there any remaining underlying snare for the stdlib TLS client that I should be aware of before falling back to trying tcpdump? (e.g. it doesn’t support X or Cipher AES256-GCM-SHA384 is not supported, or something).

The alert passed in Options is a key point. So you will need to pass in one that you created. This is expected. If you don’t pass that in the Options for init then it will have nothing to populate. That is the only way to get the Alert information.
With that said, it looks like your issue is that the handshake has failed. This could be a cert issue. Without knowing where you are connecting and the cert info, it may be hard to troubleshoot.

As far as figuring out what is wrong, I’ve been using a debugger to step through the tls.Client for one of my projects and that has helped.

Welcome to Ziggit. Glad to have you join us.

1 Like

Regarding passing in an alert being the proper approach, that stumps me a bit. In particular because, it’s not like you can pass in an empty Alert struct to be filled in. I have to populate it with dummy data, otherwise I get missing struct field: level, missing struct field: description. However, if I do that, is it not ambiguous? E.g., suppose I pass in

var alert = std.crypto.tls.Alert{ .level = ...fatal, .description = ...bad_record_mac };
const opt_alert: ?*std.crypto.tls.Alert = &alert;
const options = std.crypto.tls.Client.Options{
    ...,
    .alert = opt_alert,

};

but then, suppose the actual alert from the program is level = fatal, description = bad_record_mac. How does one know whether an error is genuine or whether it’s the original passed in alert?

I feel like I must be missing something obvious there.

As for stepping through, I’ll try that. Recently updated gdb so that it can even handle the zig binary, and thank you for the welcome.

This is a case where you want to initialize it as undefined:

var alert: std.crypto.tls.Alert = undefined;
const options = std.crypto.tls.Client.Options{
    ...,
    .alert = &alert,
};

This way you know that it will be populated and not be ambiguous. You will also know it’s populated by the TlsAlert error. If you don’t have that error, then the alert will not be populated.

Ah, thank you. Makes sense, like doing = NULL, in C, when the callee makes the allocation made by malloc and puts the address where you want it.

Using good old std.log.debug it gets a bit more surprising for me.

I the init ā€œmethodā€ starts, and reaches the while-loop

This executes only once.

During that one iteration, cipher_state = .cleartext

Hence

ctd, ct = .{ record_decoder, record_ct },

which, by the subsequent switch, we know ct = .alert and, by logging, ctd = .{ .buf = { 2, 40 }, .idx = 0, .our_end = 0, their_end = 2, .cap = 2, .disable_reads = true }

Which is spooky to me, since, it means the first population of the record buffer is a TLS alert, seemingly before anything happens, but the data is well enough that none of the preceding errors occur.

Will analyse more and keep this thread posted in case anyone happens to have insight or, should I figure it out myself, whoever also gets this kind of issue can resolve it.

edit: so, on a working site, the first thing read is the record_header = {22, 3, 3, 0, 80} which, 22 = content_type = handshake, {3, 3} dropped, {0, 80} = record_length, where it then reads the next 80 bytes yielding a valid buffer. On a non-working site, the record_header = {21, 3, 3, 0, 2}, i.e. 21 = content_type = alert, {3, 3} dropped, {0, 2} = record_length, where it then reads the next 2 bytes yielding a valid alert buffer.

However, that’s just it, this is the first input read on the input stream in the TLS client init. How can I be getting a TLS Alert as my very first message? (I did at least print out the bundle to make sure the cert bundle I was expecting was loaded, which it seemingly was).

I have yet to isolate which of the following yielded the positive change, but I ensured no session id was send on the client hello. I also ensured another cipher (a more archaic one) was include where it needed to be.

The result was, a client hello was send to the server, and then a triple packet of ā€œServer Hello, Certificate, Server Hello Doneā€ was returned. Upon processing, Zig errors out. Better than before when it failed to even get past client hello. Again, unsure if session id or cipher or both.

What’s weird here though is, we send a client hello, we get back a response.

The first iteration is parses out the first handshake type and initializes the state, so(handshake_type, handshake_state) = (.server_hello, .hello) which is processed successful. Updating handshake_state to .certificate.

The second iteration parses out the next handshake type, so we have(handshake_type, handshake_state) = (.certificate, .certificate)which is also, seemingly, processed successful. At the end of who’s iteration, handshake_state is updated to .trust_chain_established. From which we break out of the outer-switch, to process the next message.

it then processes the last message, which updates the handshake_type to .server_hello_done, i.e. (handshake_type, handshake_state) = (.server_hello_done, .trust_chain_established), which just errors since it expects the handshake_state to also be .server_hello_done

bit unsure if that .trust_chain_established should be acceptable for the .server_hello_done handshake_type, or if it should update to server_hello_done. If only because, at least according to all documentation I can find, client hello → server hello, certificate, server hello done is a valid sequence. And, indeed, https://github.com/ianic/tls.zig/blob/main/src/handshake_client.zig#L505 seems to implement it that way. I will experiment with that 3rd party library as well.

So using that other TLS library, the one by ianic, wherein they do permit transition to server_hello_done from certificate works, and there are no issues.

Sadly, trying to just jimmy that into the stdlib doesn’t work as, it then progresses to L712 where it panics as const pre_master_secret = key_share.getSharedSecret().?; is accessing a null value. According to TLS documentation, the client should be generating this pre_master_secret, so a bit surprised it’s null.

the ianic implementation, after the server hello done, as this is TLS 1.2, proceeds to set the pre_master_secret (i.e. via h.dh_kp.sharedKey(…) or&h.rsa_secret.secret. In contrast, the stdlib implementation, as said above, does key_share.getSharedSecret(). This ā€œmethodā€ just accesses a splice in the key_share byte array, which is initialized at the start of the whole TLS function

So, a bit unsure how the null case can even happen. Logically, only if sk_len == 0 for key_share.sk_buf. Which, I guess can happen as the KeyShare.init initializes it to 0 and undefined, respectively. This buffer is ONLY filled out in exchange

i.e. when there is a named_group. My intuition tells me that, this is where the discrepancy lies. Ianics uses the named_group if it’s there, and if it’s not, uses the rsa_secret. Confirming via logs, that is what happens for my successful run with ianic’s.

Per the TLS documentation ā€œthe premaster secret is set, either by direct transmission of the RSA-encrypted secret or by the transmission of the DH parameters that will allow each side to agree upon the same premaster secret.ā€ I.e., Ianics handles both scenarios (although, since it’s TLS1.2, not sure if they should be called ā€œnamed groupsā€, but maybe useful for consistent naming). Regardless, it seems that the stdlib does not support direct RSA transmission?

Obviously, separate to that potential bug or design decision, it is curious that my server doesn’t play nice with curve based ciphers sent by both ianic’s implementation and the stdlib. I say this because openssl just works (specifying both cipher and tls version). The only difference there I found was that openssl sends the ā€˜ec_point_formats’ extension, whereas the two zig implementations do not. Confirmed - manually copying the byte sequence for the ec_point_format makes the ianic implementation start working again.

(As an aside, useful for debugging is editing modules under .cache/zig/*)

Conclusions:

Neither the stdlib nor ianic pass ec_point_format after the compression section. At least for my server, this results in a TLS alert immediately after the Client Hello when forcing a a DH cipher.

For this reason, when many ciphers are allowed, all the DH ciphers are not chosen and the client and server agree on a direct transmission of the RSA secret. This works for ianic’s implementation as that fallback is implemented. For the stdlib, it seemingly is not.

Personal resolution: use ianic’s implementation as, at least it’ll fall back to the RSA secret key.

Thread resolution: As much as I’ve learned through this analysis, TLS is not my forte. At most, someone with better sense than me can confirm whether the stdlib has a bug regarding a lack of supporting the direct transition from certificate to server_hello_done, as well as the lack of support for the the RSA fallback, and whether both the ianic and stdlib implementations have a bug by not providing ec_point_format.

You may want to open an issue and start a discussion around this. Note that the answer may be something along the lines of ā€œThis is for supporting package management and it works for our needsā€. However, this is still something that would go well in an issue, especially since you have put so much effort into figuring out what is wrong.