WLED supports a variety of network control protocols, including “realtime UDP” and (for audio-reactive ports) “Sound Sync”. UDP data “frames” comprising one or more packets are typically sent to WLED devices at fixed intervals, e.g. 20ms for the default 50Hz SoundSync update. Unfortunately, consumer-grade WiFi suffers from jitter of 20-80ms. Packets can occasionally arrive in rapid “floods” (several packets in <10ms), followed by long “droughts” of 75ms or more.
The solution? Store up a fixed number of “frames” from the sporadically arriving packet data stream, then play them back at smooth intervals, essentially using the buffer to help “weather the droughts and floods”. This dramatically smooths playback. This approach effectively trades latency for smoothness of playback. This buffering approach is referred to as Realtime Smoothing (RTS).
Note: In this description, “playing” a frame means updating the strip directly for realtime UDP streams, or, for audio sync in the audio-reactive WLED usermod (AS), simply updating the volume, FFT and related shared variables so audio-reactive effects can pick them up and promptly update the display based on their values, typically in 10ms or less.
- Reference: RTS for UDP Realtime is described in the Realtime Smoothing WLED PR #1808.
- Algorithm Details:
- We keep track of frame arrival times, and compute the intervals
I_i(in ms) between adjacent frames. - During an initial training phase comprised of a fixed number of frames, compute the full average and stddev scatter of the frame intervals.
- After training is complete, begin tracking a trimmed exponential (decaying) average,
I, of the sequence of intervalsI_i, removing outliers. - Calculate the “bin error”, as a similar exponentially-declining average offset between the desired buffer fullness (default: 3 of 5) and the current buffer fill.
- Based on the moving average bin fullness, compute a guide factor
g = 1 - 2 * f * bin_error/buffer_sizeby which the playback interval is to be guided up or down, via multiplication (I_cor = g * I). The factorfcontrols the maximum half-range of guide factor, and should bef<~0.15. In this manner, average bin fullness is smoothly maintained by slightly varying the playback intervals (I_cor). Possible guide factor values:- 1
- A guide factor of one indicates target buffer fullness as been achieved. The discovered interval
Iwill be used as-is. - <1
- This indicates the buffer is over-full, and the interval should be reduced to empty the buffer somewhat faster than average.
- >1
- This indicates the buffer is under-full. Frame playback is slowed down slightly to allow the buffer to refill back to the target level.
- After each corrected interval
I_cor, remove one frame from the end of the buffer and play it back. - If the buffer overflows, the oldest frame is immediately “played” to make room for new frame. The reduced guide factor will eventually bring the buffer level back down to its target.
- If the buffer underflows (“runs dry”), simply do nothing and wait for new packets; it will eventually refill as guide factor increases.
- We keep track of frame arrival times, and compute the intervals
- Latency: the cost of smoothing the playback intervals for UDP WLED updates is additional latency. This latency is equal to the sending interval (e.g. 20ms) times the target buffer fullness, which by default is
(N+1)/2whereNis the buffer length (i.e. number of frames stored). 5 frames, with target fullness=3 is the default, which adds 60ms of latency for a 50Hz frame rate. - Structure: a simple struct array is used as a ring buffer, with a head position index that is moved forward (with wrap-around) when the oldest frame is shifted off the back of the buffer. In this manner a fixed pre-allocated data store of size
Ncan be continuously reused.
An example frame sending, arrival, buffer state, and “playback” timeline for a sequence of 9 frame packets sent at uniform intervals, during steady-state RTS operation
|--- time -----------------------------------------------------------------------------------> | UDP packets sent, e.g. at I=20ms intervals 1 2 3 4 5 6 7 8 9 | | <- I -> | | | | | | | |--- time -----------------------------------------------------------------------------------> | Frames arrive at WLED device via UDP uni/multicast, with delay and jitter | 1 2 3 4 5 6 7 8 | | | | | | | | | |--- time -----------------------------------------------------------------------------------> | RTS Buffer State (@,* = older frames, - = empty). Here N_frame = 3, fill level = 2 | | | | | | | | | | @* @ 1@ 1 21 2 32 3 43 4 - 5 65 765 76 7 87 8 | | | | | | | | | | |--- time -----------------------------------------------------------------------------------> | RTS playback at approximately constant interval I, with ~2*I (e.g. 40ms) additional latency | * @ 1 2 under- 3 4 under- 5 over- 6 7 | | | | | full | | full | full | | |--- time ----------------------------------------------------------------------------------->
See UDP-Sound-Sync info.
- Audio Sync (AS) is a feature of the sound-reactive fork that allows sending derived audio information via UDP, between WLED devices, or to other devices that can understand the AS format.
- AS via UDP simply replicates the FFT/volume information that would otherwise have been computed locally, via sampling of an attached microphone. Frame rates up to 50Hz are possible.
- Unlike realtime UDP formats, which specify either all of the color/brightness data or an indexed list of such updates, AS sends a simple 44 byte
audioSyncPacket. This packet contains information on the current and average sample volume, as well as 16 bins of FFT frequency information:#define UDP_SYNC_HEADER_V2 "00002" // new "V2" audiosync struct - 44 Bytes struct __attribute__ ((packed)) audioSyncPacket { // WLEDMM "packed" ensures that there are no additional gaps char header[6]; // 06 Bytes offset 0 - "00002" for protocol version 2 ( includes \0 for c-style string termination) uint8_t pressure[2]; // 02 Bytes, offset 6 - optional - sound pressure as fixed point (8bit integer, 8bit fraction) float sampleRaw; // 04 Bytes offset 8 - either "sampleRaw" or "rawSampleAgc" depending on soundAgc setting float sampleSmth; // 04 Bytes offset 12 - either "sampleAvg" or "sampleAgc" depending on soundAgc setting uint8_t samplePeak; // 01 Bytes offset 16 - 0 no peak; >=1 peak detected. In future, this will also provide peak Magnitude uint8_t frameCounter; // 01 Bytes offset 17 - optional - rolling counter to track duplicate/out of order packets uint8_t fftResult[16]; // 16 Bytes offset 18 - 16 GEQ channels, each channel has one byte (uint8_t) uint16_t zeroCrossingCount; // 02 Bytes, offset 34 - optional - number of zero crossings seen in 23ms float FFT_Magnitude; // 04 Bytes offset 36 - largest FFT result from a single run (raw value, can go up to 4096) float FFT_MajorPeak; // 04 Bytes offset 40 - frequency (Hz) of largest FFT result };
Other information about this data structure, via softhack007:
MajorPeakis in Hz,Magnitudehas no units but usually is between 16 and 4096. Raw and smooth are very similar, as you said raw is instant, smooth has some low-paid filtering applied to make it less “jumpy”. If you have a sender that does not provide both, it’s ok if the two variables are assigned the same value. 255 is the max of all 16 channels overall, but not the strongest per channel. There are different “frequency scaling” option in audioreactive settings. The 16fftResultelements are scaled accordingly - either linear, square root, or logarithmic. - Uses UDP multicast to the group address 239.0.0.1 on port 11988 (by default).
- Unlike the realtime UDP algorithms, AS, part of the
audioreactiveusermod, does not itself directly update display. Instead it simply calculates and updates variables shared via a pointer array with the main WLED display code, which some effects pick-up and use for their display. - Currently, AS implements a simple interpolation scheme, effectively introducing a ~20ms (single frame) latency, and interpolating the FFT/volume data between the 2nd to last and most recent samples. This allows audio-reactive effects which may update at high rates (150Hz possible) and at unpredictable times to always receive new (synthetic) data, avoiding apparent “freezing” during packet droughts. This method does not mitigate jitter in AudioSync packet arrival.
As an addition to guiding for constant-interval playback as in the above timing diagram, an interpolation scheme could be employed, combining the current rapid update capabilities of AS with RTS interval smoothing. This would occur, as it currently does, whenever the usermod loop (UL) is called, which happens at unpredictable times, as frequently as every 2ms.
Changes Needed:
- Increase the ring buffer length.
- Optionally, target a higher buffer fill level, e.g. 4/6 or 5/8. This will increase the latency to ensure the buffer contains at least 2 frames most of the time (but see below).
- Keep track of the playback interval
I, in the same manner as above. - When it would normally be time to “playback” a given frame
Fand remove it from the buffer (att=t_F), instead, retain it for an additional cycle. - Whenever UL gets called after time
t_F, playback interpolated values betweenFandF+1(the next newer frame, if available). - The interpolation is based on the time since
Fwas scheduled for and first contributed to playback, relative to the current estimated corrected intervalI_cor:val_UL = val_F + (val_{F+1} - val_F) * (t - t_F)/I_cor - When it is past time
t_{F+1}at whichF+1is to be played (i.e. att=t_F + I_cor), removeFfrom the buffer and repeat this pattern forF+1andF+2. - Note that at
t=t_F, the value ofFis used as-is, and then the value gradually and smoothly morphs into that atF+1as its playback time approaches. - If at the time UL is called, the buffer is empty (hopefully a rare occurrence), simply do not play back anything; let the old values stand.
- If at that time there is only one frame available, just play it back (retaining it, unless the time has arrived for
F+1playback and it should be removed). - The goal is to configure the buffer size and fullness level (i.e. the average latency) so the prior two conditions rarely occur.
- For the more typical case of 2 or more frames in the buffer, proceed with normal interpolation scheme above.
- Note that the latency is now only
I*(N_fill-1), since the frameFis kept for an additional cycle after the time it first gets “played”.
Buffered Interpolating sounds interesting.
If the RTS adds 60ms, would BI basically only add between 20 and 40ms?
I get the feeling that we might not be able to judge what's best until we can actually hear and see the results
For comparison we can compare WiFi Vs two wired ethernet devices running unicast for our zero latency benchmark