Featured image of post Troubleshooting a Google change

Troubleshooting a Google change

Google changed the way one of their services work. I spent a couple hours troubleshooting the resulting breakage.

At work, we have one central server running Postfix which accepts all internal email, and sends it out through our email provider (Google Workspace). So for example, alerts from Nagios, requests and responses in our ticketing system, and automated reports all flow through this server, which sends them to Workspace’s “smtp-relay” as a smarthost, for further delivery (to both internal and external addresses).

This presents several advantages, especially centralizing the control of (server) authentication in one place, but of course it comes with the risk that an issue can affect a wide variety of systems.

The error

I woke up on Monday morning to an eerily quiet inbox. It seemed too good to be true, and it was: I quickly saw reports that some email through the ticketing system hadn’t gotten through. Looking in the ticketing system however, the email had been received, and forwarded out; but Workspace never seemed to receive it.

A glance at our statistics (using Munin) confirmed that the emails were stuck on our mail server, with the queue reaching up to nearly 100 emails, from its usual 0-1.

Looking at the journal on the mail server quickly showed where they were getting stuck:

$ journalctl --facility=mail -e
Mar 18 05:15:15 rico postfix/smtp[3810878]: certificate verification failed for smtp-relay.gmail.com[142.251.116.28]:587: self-signed certificate
Mar 18 05:15:15 rico postfix/smtp[3810878]: A815632425E5: to=<john@foo.bar>, relay=smtp-relay.gmail.com[142.251.116.28]:587, delay=301, delays=300/0.04/0.19/0, dsn=4.7.5, status=deferred (Server certificate not verified)

But that’s not all there is to this story.

The red herring

First things first, I thought, and went about making sure I could connect with the same set of CA’s that Postfix was using. This was a bit difficult; first I had to figure out whether Postfix was using a “CAfile” or a “CApath”, and I also had to figure out whether it was using the copy inside its chroot or outside.

$ echo | openssl s_client -no-CApath -no-CAstore -no-CAfile -CAfile /etc/ssl/certs/ca-certificates.crt -connect smtp-relay.gmail.com:587 -starttls smtp |& grep Verification:
Verification: OK
$ echo | openssl s_client -no-CApath -no-CAstore -no-CAfile -CApath /etc/ssl/certs/ -connect smtp-relay.gmail.com:587 -starttls smtp |& grep Verification:
Verification: OK
$ echo | openssl s_client -no-CApath -no-CAstore -no-CAfile -CAfile /var/spool/postfix/etc/ssl/certs/ca-certificates.crt -connect smtp-relay.gmail.com:587 -starttls smtp |& grep Verification:
Verification: OK
$ echo | openssl s_client -no-CApath -no-CAstore -no-CAfile -CApath /var/spool/postfix/etc/ssl/certs/ca-certificates.crt -connect smtp-relay.gmail.com:587 -starttls smtp |& grep Verification:
Verification: OK

Everything seemed okay! I checked every way it could be validating the certificate, and it all seemed fine. So I decided that the “self-signed” bit of the error must be wrong, and went looking for other causes. I even went so far as to check a packet capture to make sure Postfix was reaching out to the right place. I went and looked at update logs, but the most recent system update was a couple days before the issue started, and Postfix hadn’t even been restarted since the last time the server was rebooted, a week earlier.

At this point I was pretty stumped and spent about 15 minutes looking for a Postfix configuration setting that could show me what I was missing. One of my friends jumped in and pointed out smtp_tls_loglevel.

The fix

Turning this setting straight up to 11 (okay, 4) immediately spit out much more helpful information:

Mar 18 13:25:57 rico postfix/smtp[22710]: smtp-relay.gmail.com[142.251.116.28]:587: depth=0 verify=0 subject=/OU=No SNI provided; please fix your client./CN=invalid2.invalid
Mar 18 13:25:57 rico postfix/smtp[22710]: smtp-relay.gmail.com[142.251.116.28]:587: depth=0 verify=0 subject=/OU=No SNI provided; please fix your client./CN=invalid2.invalid
Mar 18 13:25:57 rico postfix/smtp[22710]: smtp-relay.gmail.com[142.251.116.28]:587: depth=0 verify=1 subject=/OU=No SNI provided; please fix your client./CN=invalid2.invalid
...
Mar 18 13:25:57 rico postfix/smtp[22710]: certificate verification failed for smtp-relay.gmail.com[142.251.116.28]:587: self-signed certificate
Mar 18 13:25:57 rico postfix/smtp[22710]: smtp-relay.gmail.com[142.251.116.28]:587: subject_CN=invalid2.invalid, issuer_CN=invalid2.invalid, fingerprint=90:4A:C8:D5:44:5A:D0:6A:8A:10:FF:CD:8B:11:BE:16, pkey_fingerprint=34:F2:C3:F5:25:92:A4:49:5A:A6:0C:29:57:32:54:43

Yeah… it turns out Google just decided to change their servers in the middle of the night, without telling anyone or announcing it through any of their announcement channels. Thanks, Google, I really appreciate it, good value you’re giving us for our money here.

$ echo | openssl s_client -no-CApath -no-CAstore -no-CAfile -CAfile /etc/ssl/certs/ca-certificates.crt -connect smtp-relay.gmail.com:587 -starttls smtp -noservername |& grep Verification
Verification error: self-signed certificate

Bingo! Now the only thing left to do was tell Postfix to send SNI: smtp_tls_servername = nexthop.

Postfix’s default behavior is also somewhat to blame here; defaulting to not sending any SNI is most certainly the wrong choice. But, it’s also a much more established choice than Google’s (Postfix probably predates SNI, and certainly predates the use of TLS between MTAs!).

I’m not sure how no one at Google, with all their wealth and employees, thought to test this in advance with any major MTAs; they even advertise the smtp-relay as being useful for embedded devices like printers or fax machines which will almost certainly not support SNI. And it’s not like Google doesn’t have plenty of IP space!

No harm, no foul?

Thankfully, email is built to handle temporary failures and retry later, so none of these emails were actually dropped. However, they were delayed for several hours, which led to some concern; therefore, we are adding some alerting (which does NOT go through the central mail server) when the mailqueue gets large.

Why did you do this, Google?