The server has been brought to life

My last post talked about speccing out a server and buying a bunch of parts. This is a follow up in which I declare success and complain about incompatible parts I had originally selected.

When I first got started with the build, I was pretty confident, both that it would work and that I would do a good job with cable management.

The good news is that it did eventually work. The bad news is it was a huge pain in my ass and took over a week of debugging. As we’ll see later, I also stopped being so careful with cable management.

A series of problems arose:

Here’s the final BOM we ended up with:

Item Qty Condition $/each Total
Sliger CX2137b 1 New $199.00 $199.00
Arctic F8 PWM Fan 3 New $8.00 $24.00
Asus Pro WS W680-ACE IPMI 1 New $399.99 $399.99
Intel i9-13900K 1 New $490.00 $490.00
Dynatron Q5 LGA1700 Cooler 1 New $49.95 $49.95
Kingston DDR5-4800 32GB 4 New $118.75 $475.00
GIGABYTE GeForce RTX 4060 1 New $319.99 $319.99
Silverstone FX600 1 New $174.99 $174.99
NVIDIA Tesla P4 1 Used $126.13 $126.13
Intel DC S3500 480GB 2 Used On hand $0.00
Crucial P3 4TB 6 New $236.99 $1421.94
Startech PEX8M2E2 1 New $162.67 $162.67
Startech PEX4M2E1 1 New $22.93 $22.93
Noctua NF-A6x25 1 New $14.95 $14.95

Grand total before tax and shipping: $3881.54. Not really too bad.

Now that we’ve established how this build hurt my wallet, let’s talk about how it hurt me.

Chapter 1: My intel SSD didn’t work

I installed Debian. I rebooted. The boot device was no longer there. The previous build I used this Intel NVMe in had a big chonky heatsink on it and a huge fan to keep it cool. This build did not really have sufficient cooling.

The drive kept appearing and disappearing, and I kept fiddling with flags in the UEFI thinking it was just mad about something. But it was just a thermal issue.

I don’t think the controller got completely cooked, as it did appear again later once it cooled down. However, this was a non-starter. I gave up on my all-NVMe ideal and installed some SATA SSDs I had lying around: Intel DC S3500 480GB. I had two of em, so figured why not make it root-on-mirrored-ZFS.

I installed Debian again.

Chapter 2: The Teamgroup NVMe didn’t work

It’s quite disappointing to buy 6x4TB of NVMe and not be able to use it. They did appear in the UEFI, but the NVMe behind the PCIe switches on the Glotrends carrier cards just wouldn’t show up in Linux. This was a disappointment. The ones installed directly on the motherboard, however, did work.

I tested the Intel drive I had just removed and found that it did appear when installed behind the switch. I figure this is a firmware bug on the Team stuff, or in the Realtek controller they use. I sent the drives back and got Crucial P3 4TB as replacements. These ones appeared to work.

I went ahead and installed Proxmox and set up a big ZFS pool to use for my VMs.

I started migrating stuff from my old server, then I hit another problem.

Chapter 3: The NICs didn’t work

While I was copying stuff over, I set the MTU to 9000. Then later, I was debugging some stuff on my desk and plugged the server into a 100mbit dumbswitch to get it connected. It did not like this.

The kernel kept panicking when systemd decided to turn on the NIC. I could boot into a livecd, but not my installed system. I once again started playing with UEFI toggles. I started playing with kernel flags too. It took me way too long to make the connection that it was because of the jumbo frames and dumbswitch having a bad time with one another.

I didn’t feel like running a new cable across the room to the real switch, but I knew I had more desk-based debugging and testing to do. So, I just mounted / in the livecd and changed the MTU back to 1500.

Chapter 4: The PCIe switches didn’t work

I noticed my disk utilization frequently dropping to zero during the prior rsync’ing. I checked dmesg to see what was going on.

[  988.525697] pcieport 0000:00:1d.0: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:12:00.0
[  988.536154] nvme 0000:12:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
[  988.547103] nvme 0000:12:00.0:   device [c0a9:540a] error status/mask=00004000/00400000
[  988.555098] nvme 0000:12:00.0:    [14] CmpltTO
[  988.560907] pcieport 0000:10:08.0: AER: device recovery successful
[  990.619281] nvme nvme3: I/O tag 204 (d0cc) opcode 0x1 (I/O Cmd) QID 2 timeout, aborting req_op:WRITE(1) size:8192
[  992.667249] nvme nvme3: I/O tag 204 (d0cc) opcode 0x1 (I/O Cmd) QID 2 timeout, reset controller
[  992.688623] nvme nvme3: Abort status: 0x371
[  992.721812] nvme nvme3: Shutdown timeout set to 2 seconds
[  992.933183] nvme nvme3: 8/0/0 default/read/poll queues
[  992.956487] nvme nvme3: Ignoring bogus Namespace Identifiers
[  999.730279] pcieport 0000:00:1d.0: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:12:00.0
[  999.740791] nvme 0000:12:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
[  999.751728] nvme 0000:12:00.0:   device [c0a9:540a] error status/mask=00004000/00400000
[  999.759723] nvme 0000:12:00.0:    [14] CmpltTO
[  999.765495] pcieport 0000:10:08.0: AER: device recovery successful
[ 1000.859080] nvme nvme3: I/O tag 891 (d37b) opcode 0x1 (I/O Cmd) QID 2 timeout, aborting req_op:WRITE(1) size:122880
[ 1002.910082] nvme nvme3: I/O tag 891 (d37b) opcode 0x1 (I/O Cmd) QID 2 timeout, reset controller
[ 1002.932441] nvme nvme3: Abort status: 0x371
[ 1002.961263] nvme nvme3: Shutdown timeout set to 2 seconds
[ 1003.172257] nvme nvme3: 8/0/0 default/read/poll queues
[ 1003.193600] nvme nvme3: Ignoring bogus Namespace Identifiers
[ 1004.128115] pcieport 0000:00:1d.0: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:12:00.0
[ 1004.138653] nvme 0000:12:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1004.149592] nvme 0000:12:00.0:   device [c0a9:540a] error status/mask=00004000/00400000
[ 1004.157602] nvme 0000:12:00.0:    [14] CmpltTO
[ 1004.259772] pcieport 0000:10:08.0: AER: device recovery successful
[ 1005.211047] nvme nvme3: I/O tag 487 (d1e7) opcode 0x2 (I/O Cmd) QID 3 timeout, aborting req_op:READ(0) size:45056
[ 1006.235030] nvme nvme3: I/O tag 487 (d1e7) opcode 0x2 (I/O Cmd) QID 3 timeout, reset controller
[ 1006.255197] nvme nvme3: Abort status: 0x371
[ 1007.287473] nvme nvme3: Shutdown timeout set to 2 seconds
[ 1007.498696] nvme nvme3: 8/0/0 default/read/poll queues
[ 1007.519380] nvme nvme3: Ignoring bogus Namespace Identifiers

This is bad, awful, and not good. However, it was only occurring for the NVMe on the carrier cards. I sent them back and got a new one with a PLX switch instead of an Asmedia one.

Before I made that decision, I spent an entire day playing with kernel flags to change NVMe timeouts, turn off APST and ASPM, etc., all to no avail.

Since I had ditched root-on-NVMe, I had a spare slot on the board, so I now only needed one card with a switch, and one that was basically just a dumb adapter.

I didn’t really go out of my way to optimize this (e.g., the PCIe switch carrying two of the NVMe is installed in a x4 slot instead of a x8) but it gets pretty decent performance.

A quick test in fio returned some surprisingly round numbers:

This is way more than good enough for my use case.

Chapter 5: I had to 3D print a cooling duct for the Tesla

The Tesla P4 is designed for real servers that have a bunch of beefy fans to push a ton of air through it. This server is not one of those.

I ran into a bunch of problems with my printer that I don’t feel like elaborating on, but ultimately I got a part made.

It’s ugly but it works.

Aside: This Tesla was sold as open box, but I could see evidence of a really janky refurb job, like poorly-cut thermal pads sticking out under the backplate. I decided to take it apart and redo all the pads and paste myself. Fortunately I did not brick the card.

Chapter 6: Thermal throttling

I knew I would have thermal issues with the 13900K, since I had originally specced a run-of-the-mill 13900.

Ultimately, I:

This got me down to an idle package temp (with VMs idling) of 35C, at 98W total system power draw. It briefly hits 90C and pulls 330W during stress -c 24, but then once the “long duration” power cap kicks in it drops to 80C/265W.

While we’re here, I’ll also mention that:

I find this perfectly acceptable.

Conclusion

In the end, the inside of the thing looked like this:

My awful cable management job is fortunately hidden behind a rather pleasant facade.

Ignore the mess on my desk and the fact that the lid wasn’t closed in that picture.

I now have a nice 2U machine to replace both my ancient HP workstation tower and my old Synology NAS.