Investigating packet loss

Continuing the discussion from LAN protocol group support:

FWIW, I’m seeing a lot of “packet loss” too, and had been wondering whether it was the network or the bulbs. What sort of diagnostics do you use to differentiate one from the other?

1 Like

I’m really interested in this too. If you find any evidence that its the bulbs I would love to take it to the firmware teams.

Of course it’s difficult to confirm because we can’t see what’s happening on the bulb’s network interface, the best I’ve been able to do to provide correlation is to send streams of unicast packets to other wireless clients to confirm that there are no drops on other devices, and by sending broadcast LIFX packets, again confirming that these packets arrive at other wireless clients. This is obviously not fool-proof methodology, however as no other clients observe packet loss on the network, but LIFX bulbs require re-sends to operate correctly, I can only conclude that the bulbs are not processing all sent packets correctly.

It’s been a little while since I did these diagnostics, but from memory once I added per-device rate-limiting to my protocol implementation reliability improved somewhat, which suggests that there are states in which the firmware may not process input, because I don’t think I was sending significant quantities of packets (in which case I could see small buffers filling quickly), just packets close together. My recollection could be wrong here though.

We do specifically call out a limit in the LAN documentation:

Maximum recommended message transmit rate to a device: 20 per second

The buffers on the device are quite small (like a couple of packets) so it wouldn’t surprise me if multiple packets in quick succession could cause it to drop them.

I’ve been good about heeding the rate limit in the case of failure (I wait 100ms before retransmitting), but I haven’t been paying too much attention to the rate limit in the case of success. In other words, as soon as I get back a State message (if I sent a Get) or an Ack (if I sent a Set), I’m feeling free to send another message right away. If the problem is limited receive buffers, then that should be okay, since we know it already received the previous message. But if it’s a matter of the firmware needing to go do other, non-network related things, then I might be overwhelming it.

Wow, a couple of packets is even smaller than I suspected. I haven’t yet added per-device rate-limit updates for broadcast packets because this adds a bunch of complexity/coupling (broadcasts need to be aware of the last unicast send on any device, and unicast sends need to be aware of the last broadcast… ugly) and because there are very few broadcast packets issued generally, but perhaps I need to bump that up the priorities. As it stands in my implementation, it’s possible that there may be a couple of packets sent in quick succession if there is a mix of broadcast and unicast packets.

I’d have to check some debug logs, but I’m pretty sure there are moderately frequent retries that are not explainable by this phenomenon though.

Is the suggested rate limit just a best guestimate for maximum command execution time? Is there any thermal throttling that might impact that expectation? Is the rate-limit suggested purely to avoid filling the buffer, or is there some other condition where incoming packets may not be processed?

Sending one packet per second, I’m not seeing much packet loss. So it probably is rate-dependent. What’s interesting, though, is that the round-trip time doing EchoRequest is much larger and more variable than the round-trip time with ICMP ping:

WhiteAndNerdy% cabal run -- ping
Preprocessing executable 'lifx' for lifx-0.1.0.0...
Running lifx...
(Bulb en1 192.168.11.3:56700 d073d50225cd) #1 322.532ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #2 56.564ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #3 383.219ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #4 75.674ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #5 9.261ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #6 13.978ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #7 408.670ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #8 98.674ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #9 323.594ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #10 38.882ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #11 8.519ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #12 12.979ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #13 342.054ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #14 35.103ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #15 360.318ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #16 1020.353ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #17 23.098ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #18 61.994ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #19 9.930ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #20 893.613ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #22 14.104ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #23 424.729ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #24 112.509ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #25 8.983ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #26 27.317ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #27 11.486ms
(Bulb en1 192.168.11.3:56700 d073d50225cd) #28 41.697ms
^C
WhiteAndNerdy% ping 192.168.11.3
PING 192.168.11.3 (192.168.11.3): 56 data bytes
64 bytes from 192.168.11.3: icmp_seq=0 ttl=128 time=6.716 ms
64 bytes from 192.168.11.3: icmp_seq=1 ttl=128 time=5.978 ms
64 bytes from 192.168.11.3: icmp_seq=2 ttl=128 time=7.774 ms
64 bytes from 192.168.11.3: icmp_seq=3 ttl=128 time=9.813 ms
64 bytes from 192.168.11.3: icmp_seq=4 ttl=128 time=10.930 ms
64 bytes from 192.168.11.3: icmp_seq=5 ttl=128 time=12.476 ms
64 bytes from 192.168.11.3: icmp_seq=6 ttl=128 time=6.455 ms
64 bytes from 192.168.11.3: icmp_seq=7 ttl=128 time=11.092 ms
64 bytes from 192.168.11.3: icmp_seq=8 ttl=128 time=7.634 ms
64 bytes from 192.168.11.3: icmp_seq=9 ttl=128 time=7.197 ms
64 bytes from 192.168.11.3: icmp_seq=10 ttl=128 time=25.510 ms
64 bytes from 192.168.11.3: icmp_seq=11 ttl=128 time=11.167 ms
64 bytes from 192.168.11.3: icmp_seq=12 ttl=128 time=8.163 ms
64 bytes from 192.168.11.3: icmp_seq=13 ttl=128 time=8.055 ms
64 bytes from 192.168.11.3: icmp_seq=14 ttl=128 time=3.993 ms
64 bytes from 192.168.11.3: icmp_seq=15 ttl=128 time=7.537 ms
64 bytes from 192.168.11.3: icmp_seq=16 ttl=128 time=12.484 ms
64 bytes from 192.168.11.3: icmp_seq=17 ttl=128 time=6.194 ms
64 bytes from 192.168.11.3: icmp_seq=18 ttl=128 time=7.326 ms
64 bytes from 192.168.11.3: icmp_seq=19 ttl=128 time=7.302 ms
64 bytes from 192.168.11.3: icmp_seq=20 ttl=128 time=7.079 ms
64 bytes from 192.168.11.3: icmp_seq=21 ttl=128 time=6.866 ms
64 bytes from 192.168.11.3: icmp_seq=22 ttl=128 time=7.597 ms
64 bytes from 192.168.11.3: icmp_seq=23 ttl=128 time=8.541 ms
64 bytes from 192.168.11.3: icmp_seq=24 ttl=128 time=8.307 ms
64 bytes from 192.168.11.3: icmp_seq=25 ttl=128 time=7.170 ms
64 bytes from 192.168.11.3: icmp_seq=26 ttl=128 time=8.111 ms
64 bytes from 192.168.11.3: icmp_seq=27 ttl=128 time=7.997 ms
64 bytes from 192.168.11.3: icmp_seq=28 ttl=128 time=12.288 ms
^C
--- 192.168.11.3 ping statistics ---
29 packets transmitted, 29 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.993/8.888/25.510/3.744 ms

The longest round-trip time for ICMP is 25 ms, while the longest RTT for EchoRequest is over a second! This suggests that the firmware is, indeed, doing a lot of other stuff besides servicing network requests. I wonder why it’s able to service ICMP so much more quickly. The existence of two firmware version messages suggests there are two microcontrollers in the bulb. So perhaps ICMP is serviced directly by the “wifi” microcontroller, while LIFX protocol messages such as EchoRequest have to go through the “host” microcontroller, which apparently is very busy. That’s just a theory, though.

You pretty much got it in one. There are two microcontrollers, one that is responsible for managing the Wifi connection and networking and another that manages the functions of the bulb. The EchoRequest message will traverse both these controllers, whereas a ping is returned directly from the Wifi side.

(OK, this discussion has already gone a little bit over my head, so if any of the following is complete nonsense, I apologise! I am a total noob at this networking stuff.)

This is a really good point and something I hadn’t considered at all. I hadn’t even thought about the rate limits except in terms of not resending a failed message within less than 50ms. I’m sure I’ve been sending successful messages in very quick succession, which is probably not helping matters at all.

One issue that strikes me (you are probably all way ahead of me here) is that messages can be unicast or broadcast to the bulbs by various different clients (one or more instances of the mobile app, the Lifx Cloud, etc.) at any time. In particular, I know that Lifx Cloud checks in with the bulbs periodically (perhaps once per minute or something similar).

So even if your own implementation is squeaky clean in terms of how frequently it sends messages to each bulb, there will always be a small chance that one of your messages is sent less than 50ms before or after one of the Lifx Cloud or app’s messages. And if I understand correctly, that could potentially jam up the buffer?

The only way I can think of avoiding this is to keep a socket constantly listening for network traffic, and if it detects any, to tell your packet-sender to delay for at least 50ms before it sends its next packet. But I’ve no idea if this is possible, especially with unicast messages…

It appears to me that you understand it quite well, but if there’s anything you’d like me to explain further, just let me know!

That occurred to me, too, but I decided not to worry about it. If we assume the buffer can hold at least two packets:

then two different clients could send to the bulb at once. Even if there are more than two clients, the chance that three or more would send a packet at the same time seems small.

In theory, you could put the network interface in promiscuous mode, which is what tools like Wireshark or tcpdump do. This allows receiving packets that aren’t addressed to you. But even if that works, I think it’s going way beyond the call of duty.

Since the wifi microcontroller seems to keep up pretty well, I’m not so much worried about the actual networking buffers. What I’m more worried about is what the communication between the two microcontrollers looks like, and how much buffering that has.

I wonder what the “host” microcontroller (is that the proper name for the one that isn’t the wifi microcontroller?) is doing that makes it go “out to lunch” for up to a second. If we’re retransmitting every 50 ms, then we will have sent 20 packets by then!

Perhaps Daniel can shed some light on this next week, but I believe it’s already Saturday in Australia, even though it’s just 11:30 Friday morning here in California.

We don’t expect you to be aware of other clients sending messages. The 20 per second stated limit was was found through experimentation and multiple clients, including the cloud. In my python library I plan to implement this by keeping counters for each bulb, when I send a broadcast I increment all counters, and when I send a unicast I increment a single counter. I haven’t gotten there yet though, and at the moment I just send packets as they are generated.

If you fill up the packet buffers the worst that should happen is that incoming packets may be lost. This is why you need to retransmit packets sometimes.

We also call it the host microcontroller here too. :smiley:

I haven’t seen devices delays as big as you saw in your tests, any chance you can run those ICMP pings in parallel with the EchoRequest tests?

Sure; here they are in parallel.

However, I realized this may not be entirely fair, since Haskell is a garbage collected language. To really do a fair comparison, I should write an EchoRequest client in C. However, I’m going to be busy over the next week, so I might not get that done right away.

So, for a quicker answer to “is it the garbage collection?”, by using the runtime option “-S”, I can print statistics for each garbage collection. Here are those results. Somewhat suspiciously, the garbage collections and the >1 second EchoRequests seem to be happening at roughly similar frequencies. However, they don’t seem to be happening in lockstep. And the “gc elap” column is 0.00 in all cases, except for one case where it is 0.01. This is in seconds, so nearly all garbage collections took less than 10 ms, and all took less than 20 ms.

So I don’t think garbage collection explains the echoes that take longer than a second.

I agree, I really doubt that Haskell’s GC is having any significant effect. One way to rule it out entirely though would be to perform a packet capture with Wireshark. Then you would simply correlate the requests timings with the response timings. This would actually help the firmware devs recreate the issue, but you’ll probably want to do it in a PM, as it might have sensitive data.

I’ll raise this matter with the firmware team and see what they say.