Skip to content

Instantly share code, notes, and snippets.

@jdtsmith
Last active January 20, 2025 17:16
Show Gist options
  • Select an option

  • Save jdtsmith/c3e129923103488f6444dc485027d92b to your computer and use it in GitHub Desktop.

Select an option

Save jdtsmith/c3e129923103488f6444dc485027d92b to your computer and use it in GitHub Desktop.
WLED AudioSync Realtime Smoothing with Interpolation Proposal

Background

WLED supports a variety of network control protocols, including “realtime UDP” and (for audio-reactive ports) “Sound Sync”. UDP data “frames” comprising one or more packets are typically sent to WLED devices at fixed intervals, e.g. 20ms for the default 50Hz SoundSync update. Unfortunately, consumer-grade WiFi suffers from jitter of 20-80ms. Packets can occasionally arrive in rapid “floods” (several packets in <10ms), followed by long “droughts” of 75ms or more.

The solution? Store up a fixed number of “frames” from the sporadically arriving packet data stream, then play them back at smooth intervals, essentially using the buffer to help “weather the droughts and floods”. This dramatically smooths playback. This approach effectively trades latency for smoothness of playback. This buffering approach is referred to as Realtime Smoothing (RTS).

Current RTS algorithm

Note: In this description, “playing” a frame means updating the strip directly for realtime UDP streams, or, for audio sync in the audio-reactive WLED usermod (AS), simply updating the volume, FFT and related shared variables so audio-reactive effects can pick them up and promptly update the display based on their values, typically in 10ms or less.

  • Reference: RTS for UDP Realtime is described in the Realtime Smoothing WLED PR #1808.
  • Algorithm Details:
    • We keep track of frame arrival times, and compute the intervals I_i (in ms) between adjacent frames.
    • During an initial training phase comprised of a fixed number of frames, compute the full average and stddev scatter of the frame intervals.
    • After training is complete, begin tracking a trimmed exponential (decaying) average, I, of the sequence of intervals I_i, removing outliers.
    • Calculate the “bin error”, as a similar exponentially-declining average offset between the desired buffer fullness (default: 3 of 5) and the current buffer fill.
    • Based on the moving average bin fullness, compute a guide factor g = 1 - 2 * f * bin_error/buffer_size by which the playback interval is to be guided up or down, via multiplication (I_cor = g * I). The factor f controls the maximum half-range of guide factor, and should be f<~0.15. In this manner, average bin fullness is smoothly maintained by slightly varying the playback intervals (I_cor). Possible guide factor values:
      1
      A guide factor of one indicates target buffer fullness as been achieved. The discovered interval I will be used as-is.
      <1
      This indicates the buffer is over-full, and the interval should be reduced to empty the buffer somewhat faster than average.
      >1
      This indicates the buffer is under-full. Frame playback is slowed down slightly to allow the buffer to refill back to the target level.
    • After each corrected interval I_cor, remove one frame from the end of the buffer and play it back.
    • If the buffer overflows, the oldest frame is immediately “played” to make room for new frame. The reduced guide factor will eventually bring the buffer level back down to its target.
    • If the buffer underflows (“runs dry”), simply do nothing and wait for new packets; it will eventually refill as guide factor increases.
  • Latency: the cost of smoothing the playback intervals for UDP WLED updates is additional latency. This latency is equal to the sending interval (e.g. 20ms) times the target buffer fullness, which by default is (N+1)/2 where N is the buffer length (i.e. number of frames stored). 5 frames, with target fullness=3 is the default, which adds 60ms of latency for a 50Hz frame rate.
  • Structure: a simple struct array is used as a ring buffer, with a head position index that is moved forward (with wrap-around) when the oldest frame is shifted off the back of the buffer. In this manner a fixed pre-allocated data store of size N can be continuously reused.

Timing Diagram

An example frame sending, arrival, buffer state, and “playback” timeline for a sequence of 9 frame packets sent at uniform intervals, during steady-state RTS operation

|--- time ----------------------------------------------------------------------------------->
| UDP packets sent, e.g. at I=20ms intervals
1         2         3         4         5         6         7         8         9  
|         | <- I -> |         |         |         |         |         |         |               
|--- time ----------------------------------------------------------------------------------->
| Frames arrive at WLED device via UDP uni/multicast, with delay and jitter
|           1            2      3          4                  5  6  7                8
|           |            |      |          |                  |  |  |                |
|--- time ----------------------------------------------------------------------------------->
| RTS Buffer State (@,* = older frames, - = empty).  Here N_frame = 3, fill level = 2
|           |            |      |          |                  |  |  |                |
| @*   @    1@    1      21  2  32     3   43     4         - 5  65 765 76       7   87    8
|      |          |          |         |          |         |           |        |         |
|--- time ----------------------------------------------------------------------------------->
| RTS playback at approximately constant interval I, with ~2*I (e.g. 40ms) additional latency
|      *          @          1         2  under-  3         4   under-  5  over- 6         7  
|      |          |          |         |  full    |         |   full    |  full  |         |
|--- time ----------------------------------------------------------------------------------->

Audio Sync Differences

See UDP-Sound-Sync info.

  • Audio Sync (AS) is a feature of the sound-reactive fork that allows sending derived audio information via UDP, between WLED devices, or to other devices that can understand the AS format.
  • AS via UDP simply replicates the FFT/volume information that would otherwise have been computed locally, via sampling of an attached microphone. Frame rates up to 50Hz are possible.
  • Unlike realtime UDP formats, which specify either all of the color/brightness data or an indexed list of such updates, AS sends a simple 44 byte audioSyncPacket. This packet contains information on the current and average sample volume, as well as 16 bins of FFT frequency information:
    #define UDP_SYNC_HEADER_V2 "00002"
    // new "V2" audiosync struct - 44 Bytes
    struct __attribute__ ((packed)) audioSyncPacket {  // WLEDMM "packed" ensures that there are no additional gaps
      char    header[6];          //  06 Bytes  offset 0 - "00002" for protocol version 2 ( includes \0 for c-style string termination) 
      uint8_t pressure[2];     //  02 Bytes, offset 6  - optional - sound pressure as fixed point (8bit integer,  8bit fraction) 
      float   sampleRaw;       //  04 Bytes  offset 8  - either "sampleRaw" or "rawSampleAgc" depending on soundAgc setting
      float   sampleSmth;     //  04 Bytes  offset 12 - either "sampleAvg" or "sampleAgc" depending on soundAgc setting
      uint8_t samplePeak;   //  01 Bytes  offset 16 - 0 no peak; >=1 peak detected. In future, this will also provide peak Magnitude
      uint8_t frameCounter;   //  01 Bytes  offset 17 - optional - rolling counter to track duplicate/out of order packets
      uint8_t fftResult[16];    //  16 Bytes  offset 18 - 16 GEQ channels, each channel has one byte (uint8_t)
      uint16_t zeroCrossingCount; // 02 Bytes, offset 34 - optional - number of zero crossings seen in 23ms
      float  FFT_Magnitude;   //  04 Bytes  offset 36 - largest FFT result from a single run (raw value, can go up to 4096)
      float  FFT_MajorPeak;   //  04 Bytes  offset 40 - frequency (Hz) of largest FFT result
    };
        

    Other information about this data structure, via softhack007:

    MajorPeak is in Hz, Magnitude has no units but usually is between 16 and 4096. Raw and smooth are very similar, as you said raw is instant, smooth has some low-paid filtering applied to make it less “jumpy”. If you have a sender that does not provide both, it’s ok if the two variables are assigned the same value.

 255 is the max of all 16 channels overall, but not the strongest per channel. There are different “frequency scaling” option in audioreactive settings. The 16 fftResult elements are scaled accordingly - either linear, square root, or logarithmic.

  • Uses UDP multicast to the group address 239.0.0.1 on port 11988 (by default).
  • Unlike the realtime UDP algorithms, AS, part of the audioreactive usermod, does not itself directly update display. Instead it simply calculates and updates variables shared via a pointer array with the main WLED display code, which some effects pick-up and use for their display.
  • Currently, AS implements a simple interpolation scheme, effectively introducing a ~20ms (single frame) latency, and interpolating the FFT/volume data between the 2nd to last and most recent samples. This allows audio-reactive effects which may update at high rates (150Hz possible) and at unpredictable times to always receive new (synthetic) data, avoiding apparent “freezing” during packet droughts. This method does not mitigate jitter in AudioSync packet arrival.

Addition: Buffered Interpolating Realtime Playback

As an addition to guiding for constant-interval playback as in the above timing diagram, an interpolation scheme could be employed, combining the current rapid update capabilities of AS with RTS interval smoothing. This would occur, as it currently does, whenever the usermod loop (UL) is called, which happens at unpredictable times, as frequently as every 2ms.

Changes Needed:

  • Increase the ring buffer length.
  • Optionally, target a higher buffer fill level, e.g. 4/6 or 5/8. This will increase the latency to ensure the buffer contains at least 2 frames most of the time (but see below).
  • Keep track of the playback interval I, in the same manner as above.
  • When it would normally be time to “playback” a given frame F and remove it from the buffer (at t=t_F), instead, retain it for an additional cycle.
  • Whenever UL gets called after time t_F, playback interpolated values between F and F+1 (the next newer frame, if available).
  • The interpolation is based on the time since F was scheduled for and first contributed to playback, relative to the current estimated corrected interval I_cor: val_UL = val_F + (val_{F+1} - val_F) * (t - t_F)/I_cor
  • When it is past time t_{F+1} at which F+1 is to be played (i.e. at t=t_F + I_cor), remove F from the buffer and repeat this pattern for F+1 and F+2.
  • Note that at t=t_F, the value of F is used as-is, and then the value gradually and smoothly morphs into that at F+1 as its playback time approaches.
  • If at the time UL is called, the buffer is empty (hopefully a rare occurrence), simply do not play back anything; let the old values stand.
  • If at that time there is only one frame available, just play it back (retaining it, unless the time has arrived for F+1 playback and it should be removed).
  • The goal is to configure the buffer size and fullness level (i.e. the average latency) so the prior two conditions rarely occur.
  • For the more typical case of 2 or more frames in the buffer, proceed with normal interpolation scheme above.
  • Note that the latency is now only I*(N_fill-1), since the frame F is kept for an additional cycle after the time it first gets “played”.
@netmindz
Copy link

Buffered Interpolating sounds interesting.

If the RTS adds 60ms, would BI basically only add between 20 and 40ms?

I get the feeling that we might not be able to judge what's best until we can actually hear and see the results

For comparison we can compare WiFi Vs two wired ethernet devices running unicast for our zero latency benchmark

@jdtsmith
Copy link
Author

jdtsmith commented Jun 26, 2023

Those are just the defaults. The interpolating scheme actually doesn’t change the latency compared to RTS. It just keeps the last frame around for one more cycle so as to provide an on-demand updated interpolated value whenever asked for it. You can decide how much insurance you want:

  • small buffer: limited insurance against stutter and packet droughts, low latency;
  • large buffer: near perfect smooth playback, higher latency.

Right now the audioreactive usermod has a default 20ms latency, which is fine for local mic samples which can run at smooth intervals. But it doesn’t account for uneven packet delivery at all from network sources.

For my application I will be able to precisely tune sound and light sync using tunable latency on the sound side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment