Skip to main content

Ethan Thornberg

Week 8: Prediction

Ethan Thornberg15 min read
Deep WorkNetworkingUDPClient PredictionGame NetworkingMonth 2

Week 8: Prediction

Week 7 ended with a clear diagnosis.

The game I'd spent three weeks building worked fine on localhost. The moment I ran it through ./proxy 50 — 50ms of latency, nothing unusual — it felt broken. 100ms was near-unplayable. 10% packet loss caused the game to fall apart in ways that were almost comically bad: silent failed movements, permanently desynced treasures, players teleporting across the map.

The problem wasn't bugs. It was assumptions. The architecture assumed a perfect network because that's what I'd been building against.

Week 8 was about fixing that.

Instant

Monday I implemented client-side prediction, and by the end of the session I understood why it's non-negotiable.

The core idea is simple: don't wait for the server to confirm a movement before rendering it. When the player presses 'w', move them up immediately on the client side, then tell the server what happened. The server is still the authority — if it disagrees with the client's position, it sends a correction. But for the common case, where the movement is valid, the client is already showing the right thing before any network round-trip completes.

The key was validate_movement(). When a keypress comes in, the client checks the move against boundary constraints and player collision data — the same logic the server uses — before applying it locally. If the move is valid, the client updates its position immediately, sends the update packet, and moves on. No waiting.

This required the client to know its own ID. Before Week 8, all clients received the same position broadcasts and the client had no reliable way to filter out its own updates. The server now sends a YOUR_ID_IS packet — via send_user_id() — early in the connection sequence. Once the client has its ID, handle_position_update() can recognize its own entries in broadcast packets and skip them, trusting its locally predicted state instead.

I also hit an ordering bug almost immediately. The server was sending position data before the client received its ID assignment. The client didn't yet know which entries in the broadcast were its own, so it would overwrite its locally predicted position with stale data from the server. Moving the send_user_id() call to happen before any position broadcasts fixed it.

Then I tested through the proxy.

50ms. I pressed 'w'. The character moved. Immediately.

The input lag I'd spent all of last week frustrated by was completely gone. The server confirmation was still happening in the background — the ACK would arrive 50ms later — but I didn't feel it. From a player's perspective, it was instant.

Client prediction is why modern games feel responsive. Without it, you're not controlling a character — you're suggesting to one.

100ms. Same thing. Completely responsive. I'd press keys and move naturally. The server was authoritative somewhere behind the scenes, but the player experience was smooth.

This was the most satisfying session of the month.

The Tension

Tuesday was about what happens when the client is wrong.

Client-side prediction works beautifully for the common case. But the client doesn't have the full picture. It has stale data about other players. It might predict a move into a space that another player just entered a fraction of a second ago. The server will reject it. The client and server now disagree about where the player is.

The fix: server-side position corrections via POSITION_CORRECTION_ID.

When the server processes a movement and finds it invalid — boundary violation, player collision — it sends back a correction packet. The client's handle_position_correction() receives it, checks that the ID matches the current player, and overrides the predicted position with the server's authoritative one. The client then re-renders.

From a player perspective, this looks like a small snap or correction. A position correction that arrives 100ms later, briefly snapping you back to where you actually are. Not ideal, but dramatically better than the old behavior where you'd press a key and nothing would happen at all.

Treasure collection needed its own design decision. When the client steps onto a treasure's coordinates, it optimistically removes the treasure immediately from the local list — no waiting for server confirmation. The delay if you waited was noticeable and bad. The visual feedback needs to be instant.

But what if the server rejects the movement? Maybe another player was already in that spot. The server would reject the movement, send a position correction, and resend the treasure back to the client. The client re-adds it. A small flicker in a rare edge case.

The key insight was that most of those rejection cases — another player in the same spot — meant the treasure was about to be collected by that player anyway. It was effectively gone. So the flicker wouldn't even occur in practice.

Score, though, was different. There was no optimistic score increase. The client waits for the server to report the new score in its next position broadcast. Only the visual removal was optimistic. The actual stat update stayed authoritative.

Optimistic updates work when the edge cases are rare and recoverable. Applying them everywhere indiscriminately is a mistake. Applying them nowhere is too conservative.

Making Critical Things Reliable

Wednesday I built the ACK system.

This was the fix for the 10% packet loss failures. The game doesn't need every UDP packet to arrive — position updates are fine as fire-and-forget. But some messages absolutely cannot be lost: connection handshakes, treasure spawns and removals, player join/leave events, ID assignments. Lose one of those and the client and server diverge permanently.

The solution was a reliable delivery layer on top of UDP. A ReliablePacket struct tracks each outgoing critical message: the original packet data, a sequence number, a timestamp of when it was sent, and a retry counter. A linked list — ReliablePacketSLL — maintains all the packets waiting for acknowledgment. Every critical message goes through send_ack_packet() instead of send_packet(), which inserts the sequence number into the packet and adds the entry to the waiting list.

On the receiving side, when a critical message arrives, the receiver immediately sends back an ACKID packet containing the sequence number. When the sender receives that ACK, handle_ack_packet() removes the corresponding entry from the waiting list. Every game loop tick, check_for_timeout() scans the list for entries older than 3x the current measured latency. If found, resend_ack_packet() retransmits up to three times before giving up.

There was one detail in the sequence number insertion that slowed me down. My first instinct was to just append the sequence number to the end of each packet. But the packet header — APPID and message type at bytes 1 through 4 — needed to stay in its fixed position, and the receiving side already had parsing logic that expected the payload to start at byte 5. I couldn't just tack numbers on at the end without breaking everything that read the packets.

The solution was memmove() — shift the existing payload right by 2 bytes, write the sequence number into the vacated slot at byte 5, and on the receiving side reverse it. Simple once I saw it. But it was a good 20 minutes of staring at a packet that was arriving garbled before I understood why.

Once it was working, I tested it through the proxy with 10% drop rate. A treasure spawn packet went out. The proxy dropped it. Nothing on the client. The timer ticked. Timeout triggered. Resent. This time it got through. The $ appeared on screen.

That was the moment the whole system clicked into place. Not just conceptually — actually watching a message that would have silently vanished a week ago arrive after two retries. The game was no longer at the mercy of what the network decided to deliver.

The Segfault

Thursday I extended the reliability system to the server side and ran into the week's most interesting bug.

The server needed to send critical messages reliably too: new treasure notifications to all clients, user join/leave updates, position corrections. I built a server-side send_ack_packet() with a global sequence counter, added an ack_packets structure to track outstanding messages, and extended the game loop to check for timeouts and trigger retries.

Everything worked. Until a player disconnected mid-game.

Segfault.

The retry logic was scanning the pending ACK list and trying to resend packets to clients who no longer existed. The player's address data was freed when they disconnected, but the pointer in the ReliablePacket still pointed to where that data had been. Dereferencing a freed pointer is undefined behavior. In practice it crashed.

The fix was straightforward once I found it: before retrying a packet, check if the target player is still connected. If get_player_by_id() returns NULL, remove the pending packet from the list and skip the retry. One NULL check.

There's a particular frustration to segfaults. The program doesn't fail at the line that's wrong — it fails wherever the dangling pointer happens to get dereferenced. I spent about 40 minutes slowly narrowing it down, adding printf statements to identify which function was crashing, then which specific pointer access. Eventually: a retry loop trying to call sendto() with an address pointer that pointed into freed memory.

Pointers don't know they're dangling. The responsibility is entirely yours.

After the fix, player disconnections became clean. Retry system removed pending packets for departed clients. No crashes.

The Last Piece

Friday I added connection handshaking and spent the second half of the session planning Month 3.

The connection handshake completed the reliability story. Before Week 8, a client would send a username and start receiving game state, with no formal acknowledgment that the server had actually accepted the connection. Under packet loss, the initial message could be dropped entirely. The client would be waiting for state updates that were never coming because the server didn't know it existed.

The new flow: client sends CONNECTION_REQUEST, server sends back CONNECTION_CONFIRMATION via the reliable ACK system, client sets user->connected = 1 when it arrives. While waiting for confirmation, the client displays a "connecting to server..." animation — a simple loop printing dots at 400ms intervals. After 3 seconds with no confirmation, it prints "failed to connect to server" and shuts down.

Small detail, but it matters. Before, a connection failure looked like the program just... hanging. Nothing happened and you had no idea why. Now it's explicit: you either get confirmation or you get a clear failure message.

Then I tested everything through the proxy one more time. ./proxy 100 10. 100ms latency, 10% packet loss.

Movement: immediate. Client prediction working exactly as intended.

Treasure collection: stable. No more flickering, no more phantom treasures the server had already removed.

Other players: not teleporting. Position updates arrived when they arrived, and when they didn't, the last known position held until the next one came in.

Connection: the handshake succeeded with a couple retries visible in the proxy stats. Critical messages all delivered eventually.

The same setup that was "unbearable" last week was now a completely playable game.

I sat with that for a while. The proxy test from Week 7 had felt like a failure. This one felt like validation.

The second half of Friday was planning. Month 2 ends here — six weeks building game networking systems from the movement server through client prediction and reliable UDP. Month 3 starts Monday.

Month 3 is a task queue system. I applied to Roblox's infrastructure team for my internship this summer, and I'm optimistically preparing for it — concurrent systems, thread pools, work queues, mutexes. But honestly, the specific destination matters less than the fact that I want to learn this. I arbitrarily chose networking for this deep work journey, not because I knew I'd use it, but because I wanted to understand it. That's still the point. Learning hard things, building real systems, not because I have to, but because I want to.

Different domain. Same approach.

What Week 8 Proved

Week 7 was diagnosis. Week 8 was the prescription.

Client prediction made input feel instant. Not faster — the round-trip time didn't change. The server was still 100ms away. But the player experience became indistinguishable from a local game because movement rendered immediately without waiting for network confirmation. The 100ms still existed. It just stopped mattering.

Selective reliability fixed the broken invariants. The game had been losing critical state updates to packet loss — treasure spawns, player joins, connection handshakes. The ACK system made those reliable. But position updates deliberately stayed unreliable. Retransmitting old positions would have added latency for no benefit. Knowing what needs to be reliable and what doesn't is the actual engineering judgment, and it's not something you learn by reading about it.

And the proxy — the tool I built in Week 7 specifically to break things — became the tool I used in Week 8 to verify they were fixed. ./proxy 100 10 as a regression test. Run it after every change, see if things still hold. That feedback loop — build something, stress test it, fix it, stress test again — is how real systems get made. I didn't know when I built it that it would do double duty.

The moment that stuck with me was running that final proxy test and seeing all the parts working together. One week earlier I'd sat at the same table watching the game fall apart under conditions that aren't even unusual. This week I ran the same conditions and couldn't find a failure.

The gap between "works on my machine" and "works under real conditions" is the gap between a prototype and a product. Week 7 showed me the gap. Week 8 closed it.

Month 2 Complete

Forty days. Zero missed.

I want to spend a moment on that before moving forward.

When I started this in January, the point was discipline. I wrote in Week 1 about being a professional beginner — someone who gets excited about something, learns the basics fast, and bails when it gets hard or boring. The goal was to prove I could stick with something difficult long enough for it to matter.

I proved that. But something else happened that I didn't expect.

I started looking forward to it.

Not every day — Week 2's Day 9 at 3:01pm was genuinely miserable. Week 3's Friday was hollow and disconnected. Those sessions were discipline, nothing more. But somewhere between then and now, the ratio shifted. The sessions I dreaded became fewer. The sessions I lost track of time in became more common. By this week I was starting early on Friday because I wanted to, not because I was trying to get ahead.

I thought this journey would teach me to work hard. It did that. But more than that, it opened something up.

Two months of showing up every day has taught me more than just networking. I understand how real systems actually work differently now. I understand design tradeoffs. I understand what it means to build something that works under real conditions versus something that works in a demo. I understand, in a practical sense, what architecture means — not as a concept but as the thing you feel when your poor planning comes back as tedious debugging, and the thing you feel when your good planning pays off in five seconds of integration.

I understand myself better. What kinds of problems engage me. What makes work feel like flow versus cleanup. How much I'm capable of learning when I'm consistent about showing up.

The thing I keep coming back to is how much more effective this has been than any class or tutorial. Two and a half hours a day, on my own terms, building things I chose to build, solving problems I ran into naturally. No grade, no deadline, no audience. Just me and the work.

Learning something because you want to is categorically different from learning something because you have to. The same hours produce different results depending on why you're there.

I want to keep doing this. Not just for the remaining four months of this specific journey, but indefinitely. When this is over, I want to find the next hard thing and give it the same treatment. Not because it'll look good somewhere. Because I've discovered that I actually love this — the slow accumulation of real understanding, the moment a concept clicks because you've earned it through confusion, the quiet satisfaction of something you built yourself working correctly.

Anyone reading this who's been curious about trying something like it: do it. Pick something hard that you don't have to learn. Give it 2.5 hours a day. Don't measure it by what you accomplish. Measure it by whether you showed up. The understanding and the discipline will follow.

Month 3 starts Monday. Same table. 2pm. Day 41.