Audacity 3.2.0
MIDIPlay Class Reference

Callbacks that AudioIO uses, to synchronize audio and MIDI playback. More...

Detailed Description

Callbacks that AudioIO uses, to synchronize audio and MIDI playback.

This class manages MIDI playback. It is decoupled from AudioIO by the abstract interface AudioIOExt. Some of its methods execute on the main thread and some on the low-latency PortAudio thread.

MIDI With Audio
When Audio and MIDI play simultaneously, MIDI synchronizes to Audio. This is necessary because the Audio sample clock is not the same hardware as the system time used to schedule MIDI messages. MIDI is synchronized to Audio because it is simple to pause or rush the dispatch of MIDI messages, but generally impossible to pause or rush synchronous audio samples (without distortion).
MIDI output is driven by the low latency thread (PortAudio's callback) that also sends samples to the output device. The relatively low latency to the output device allows Audacity to stop audio output quickly. We want the same behavior for MIDI, but there is not periodic callback from PortMidi (because MIDI is asynchronous).
When Audio is running, MIDI is synchronized to Audio. Globals are set in the Audio callback (audacityAudioCallback) for use by a time function that reports milliseconds to PortMidi. (Details below.)
MIDI Without Audio
When Audio is not running, PortMidi uses its own millisecond timer since there is no audio to synchronize to. (Details below.)
Implementation Notes and Details for MIDI
When opening devices, successAudio and successMidi indicate errors if false, so normally both are true. Use playbackChannels, captureChannels and mMidiPlaybackTracks.empty() to determine if Audio or MIDI is actually in use.
Audio Time
Normally, the current time during playback is given by the variable mTime. mTime normally advances by frames / samplerate each time an audio buffer is output by the audio callback. However, Audacity has a speed control that can perform continuously variable time stretching on audio. This is achieved in two places: the playback "mixer" that generates the samples for output processes the audio according to the speed control. In a separate algorithm, the audio callback updates mTime by (frames / samplerate) * factor, where factor reflects the speed at mTime. This effectively integrates speed to get position. Negative speeds are allowed too, for instance in scrubbing.
The Big Picture
Sample
Time (in seconds, = total_sample_count / sample_rate)
  ^
  |                                             /         /
  |             y=x-mSystemTimeMinusAudioTime /         /
  |                                         /     #   /
  |                                       /         /
  |                                     /   # <- callbacks (#) showing
  |                                   /#        /   lots of timing jitter.
  |       top line is "full buffer" /         /     Some are later,
  |                     condition /         /       indicating buffer is
  |                             /         /         getting low. Plot
  |                           /     #   /           shows sample time
  |                         /    #    /             (based on how many
  |                       /    #    /               samples previously
  |                     /         /                 *written*) vs. real
  |                   / #       /                   time.
  |                 /<------->/ audio latency
  |               /#       v/
  |             /         / bottom line is "empty buffer"
  |           /   #     /      condition = DAC output time =
  |         /         /
  |       /      # <-- rapid callbacks as buffer is filled
  |     /         /
0 +...+---------#---------------------------------------------------->
  0 ^ |         |                                            real time
    | |         first callback time
    | mSystemMinusAudioTime
    |
    Probably the actual real times shown in this graph are very large
    in practice (> 350,000 sec.), so the X "origin" might be when
    the computer was booted or 1970 or something.

To estimate the true DAC time (needed to synchronize MIDI), we need a mapping from track time to DAC time. The estimate is the theoretical time of the full buffer (top diagonal line) + audio latency. To estimate the top diagonal line, we "draw" the line to be at least as high as any sample time corresponding to a callback (#), and we slowly lower the line in case the sample clock is slow or the system clock is fast, preventing the estimated line from drifting too far from the actual callback observations. The line is occasionally "bumped" up by new callback observations, but continuously "lowered" at a very low rate. All adjustment is accomplished by changing mSystemMinusAudioTime, shown here as the X-intercept.
theoreticalFullBufferTime = realTime - mSystemMinusAudioTime
To estimate audio latency, notice that the first callback happens on an empty buffer, but the buffer soon fills up. This will cause a rapid re-estimation of mSystemMinusAudioTime. (The first estimate of mSystemMinusAudioTime will simply be the real time of the first callback time.) By watching these changes, which happen within ms of starting, we can estimate the buffer size and thus audio latency. So, to map from track time to real time, we compute:
DACoutputTime = trackTime + mSystemMinusAudioTime
There are some additional details to avoid counting samples while paused or while waiting for initialization, MIDI latency, etc. Also, in the code, track time is measured with respect to the track origin, so there's an extra term to add (mT0) if you start somewhere in the middle of the track. Finally, when a callback occurs, you might expect there is room in the output buffer for the requested frames, so maybe the "full buffer" sample time should be based not on the first sample of the callback, but the last sample time + 1 sample. I suspect, at least on Linux, that the callback occurs as soon as the last callback completes, so the buffer is really full, and the callback thread is going to block waiting for space in the output buffer.

Midi Time
MIDI is not warped according to the speed control. This might be something that should be changed. (Editorial note: Wouldn't it make more sense to display audio at the correct time and allow users to stretch audio the way they can stretch MIDI?) For now, MIDI plays at 1 second per second, so it requires an unwarped clock. In fact, MIDI time synchronization requires a millisecond clock that does not pause. Note that mTime will stop progress when the Pause button is pressed, even though audio samples (zeros) continue to be output.
Therefore, we define the following interface for MIDI timing:
  • AudioTime() is the time based on all samples written so far, including zeros output during pauses. AudioTime() is based on the start location mT0, not zero.
  • PauseTime() is the amount of time spent paused, based on a count of zero-padding samples output.
  • MidiTime() is an estimate in milliseconds of the current audio output time + 1s. In other words, what audacity track time corresponds to the audio (plus pause insertions) at the DAC output?
AudioTime() and PauseTime() computation
AudioTime() is simply mT0 + mNumFrames / mRate. mNumFrames is incremented in each audio callback. Similarly, PauseTime() is pauseFrames / rate. pauseFrames is also incremented in each audio callback when a pause is in effect or audio output is ready to start.
MidiTime() computation
MidiTime() is computed based on information from PortAudio's callback, which estimates the system time at which the current audio buffer will be output. Consider the (unimplemented) function RealToTrack() that maps real audio write time to track time. If writeTime is the system time for the first sample of the current output buffer, and if we are in the callback, so AudioTime() also refers to the first sample of the buffer, then
RealToTrack(writeTime) = AudioTime() - PauseTime()
We want to know RealToTrack of the current time (when we are not in the callback, so we use this approximation for small d:
RealToTrack(t + d) = RealToTrack(t) + d, or
Letting t = writeTime and d = (systemTime - writeTime), we can substitute to get:
RealToTrack(systemTime) = RealToTrack(writeTime) + systemTime - writeTime
= AudioTime() - PauseTime() + (systemTime - writeTime)
MidiTime() should include pause time, so that it increases smoothly, and audioLatency so that MidiTime() corresponds to the time of audio output rather than audio write times. Also MidiTime() is offset by 1 second to avoid negative time at startup, so add 1:
MidiTime(systemTime) in seconds
= RealToTrack(systemTime) + PauseTime() - audioLatency + 1
= AudioTime() + (systemTime - writeTime) - audioLatency + 1
(Note that audioLatency is called mAudioOutLatency in the code.) When we schedule a MIDI event with track time TT, we need to map TT to a PortMidi timestamp. The PortMidi timestamp is exactly MidiTime(systemTime) in ms units, and
MidiTime(x) = RealToTrack(x) + PauseTime() + 1, so
timestamp = TT + PauseTime() + 1 - midiLatency
Note 1: The timestamp is incremented by the PortMidi stream latency (midiLatency) so we subtract midiLatency here for the timestamp passed to PortMidi.
Note 2: Here, we're setting x to the time at which RealToTrack(x) = TT, so then MidiTime(x) is the desired timestamp. To be completely correct, we should assume that MidiTime(x + d) = MidiTime(x) + d, and consider that we compute MidiTime(systemTime) based on the current* system time, but we really want the MidiTime(x) for some future time corresponding when MidiTime(x) = TT.)
Also, we should assume PortMidi was opened with mMidiLatency, and that MIDI messages become sound with a delay of mSynthLatency. Therefore, the final timestamp calculation is:
timestamp = TT + PauseTime() + 1 - (mMidiLatency + mSynthLatency)
(All units here are seconds; some conversion is needed in the code.)
The difference AudioTime() - PauseTime() is the time "cursor" for MIDI. When the speed control is used, MIDI and Audio will become unsynchronized. In particular, MIDI will not be synchronized with the visual cursor, which moves with scaled time reported in mTime.
Timing in Linux
It seems we cannot get much info from Linux. We can read the time when we get a callback, and we get a variable frame count (it changes from one callback to the next). Returning to the RealToTrack() equations above:
RealToTrack(outputTime) = AudioTime() - PauseTime() - bufferDuration
where outputTime should be PortAudio's estimate for the most recent output buffer, but at least on my Dell Latitude E7450, PortAudio is getting zero from ALSA, so we need to find a proxy for this.
Estimating outputTime (Plan A, assuming double-buffered, fixed-size buffers, please skip to Plan B)
One can expect the audio callback to happen as soon as there is room in the output for another block of samples, so we could just measure system time at the top of the callback. Then we could add the maximum delay buffered in the system. E.g. if there is simple double buffering and the callback is computing one of the buffers, the callback happens just as one of the buffers empties, meaning the other buffer is full, so we have exactly one buffer delay before the next computed sample is output.

If computation falls behind a bit, the callback will be later, so the delay to play the next computed sample will be less. I think a reasonable way to estimate the actual output time is to assume that the computer is mostly keeping up and that most callbacks will occur immediately when there is space. Note that the most likely reason for the high-priority audio thread to fall behind is the callback itself, but the start of the callback should be pretty consistently keeping up.

Also, we do not have to have a perfect estimate of the time. Suppose we estimate a linear mapping from sample count to system time by saying that the sample count maps to the system time at the most recent callback, and set the slope to 1% slower than real time (as if the sample clock is slow). Now, at each callback, if the callback seems to occur earlier than expected, we can adjust the mapping to be earlier. The earlier the callback, the more accurate it must be. On the other hand, if the callback is later than predicted, it must be a delayed callback (or else the sample clock is more than 1% slow, which is really a hardware problem.) How bad can this be? Assuming callbacks every 30ms (this seems to be what I'm observing in a default setup), you'll be a maximum of 1ms off even if 2 out of 3 callbacks are late. This is pretty reasonable given that PortMIDI clock precision is 1ms. If buffers are larger and callback timing is more erratic, errors will be larger, but even a few ms error is probably OK.

Estimating outputTime (Plan B, variable framesPerBuffer in callback, please skip to Plan C)
ALSA is complicated because we get varying values of framesPerBuffer from callback to callback. Assume you get more frames when the callback is later (because there is more accumulated input to deliver and more more accumulated room in the output buffers). So take the current time and subtract the duration of the frame count in the current callback. This should be a time position that is relatively jitter free (because we estimated the lateness by frame count and subtracted that out). This time position intuitively represents the current ADC time, or if no input, the time of the tail of the output buffer. If we wanted DAC time, we'd have to add the total output buffer duration, which should be reported by PortAudio. (If PortAudio is wrong, we'll be systematically shifted in time by the error.)

Since there is still bound to be jitter, we can smooth these estimates. First, we will assume a linear mapping from system time to audio time with slope = 1, so really it's just the offset we need.

To improve the estimate, we get a new offset every callback, so we can create a "smooth" offset by using a simple regression model (also this could be seen as a first order filter). The following formula updates smooth_offset with a new offset estimate in the callback: smooth_offset = smooth_offset * 0.9 + new_offset_estimate * 0.1 Since this is smooth, we'll have to be careful to give it a good initial value to avoid a long convergence.

Estimating outputTime (Plan C)
ALSA is complicated because we get varying values of framesPerBuffer from callback to callback. It seems there is a lot of variation in callback times and buffer space. One solution would be to go to fixed size double buffer, but Audacity seems to work better as is, so Plan C is to rely on one invariant which is that the output buffer cannot overflow, so there's a limit to how far ahead of the DAC time we can be writing samples into the buffer. Therefore, we'll assume that the audio clock runs slow by about 0.2% and we'll assume we're computing at that rate. If the actual output position is ever ahead of the computed position, we'll increase the computed position to the actual position. Thus whenever the buffer is less than near full, we'll stay ahead of DAC time, falling back at a rate of about 0.2% until eventually there's another near-full buffer callback that will push the time back ahead.
Interaction between MIDI, Audio, and Pause
When Pause is used, PauseTime() will increase at the same rate as AudioTime(), and no more events will be output. Because of the time advance of mAudioOutputLatency + latency and the fact that AudioTime() advances stepwise by mAudioBufferDuration, some extra MIDI might be output, but the same is true of audio: something like mAudioOutputLatency audio samples will be in the output buffer (with up to mAudioBufferDuration additional samples, depending on when the Pause takes effect). When playback is resumed, there will be a slight delay corresponding to the extra data previously sent. Again, the same is true of audio. Audio and MIDI will not pause and resume at exactly the same times, but their pause and resume times will be within the low tens of milliseconds, and the streams will be synchronized in any case. I.e. if audio pauses 10ms earlier than MIDI, it will resume 10ms earlier as well.
PortMidi Latency Parameter
PortMidi has a "latency" parameter that is added to all timestamps. This value must be greater than zero to enable timestamp-based timing, but serves no other function, so we will set it to 1. All timestamps must then be adjusted down by 1 before messages are sent. This adjustment is on top of all the calculations described above. It just seem too complicated to describe everything in complete detail in one place.
Midi with a time track
When a variable-speed time track is present, MIDI events are output with the times used by the time track (rather than the raw times). This ensures MIDI is synchronized with audio.
Midi While Recording Only or Without Audio Playback
To reduce duplicate code and to ensure recording is synchronised with MIDI, a portaudio stream will always be used, even when there is no actual audio output. For recording, this ensures that the recorded audio will by synchronized with the MIDI (otherwise, it gets out-of- sync if played back with correct timing).
NoteTrack PlayLooped
When mPlayLooped is true, output is supposed to loop from mT0 to mT1. For NoteTracks, we interpret this to mean that any note-on or control change in the range mT0 <= t < mT1 is sent (notes that start before mT0 are not played even if they extend beyond mT0). Then, all notes are turned off. Events in the range mT0 <= t < mT1 are then repeated, offset by (mT1 - mT0), etc. We do NOT go back to the beginning and play all control changes (update events) up to mT0, nor do we "undo" any state changes between mT0 and mT1.
NoteTrack PlayLooped Implementation
The mIterator object (an Alg_iterator) returns NULL when there are no more events scheduled before mT1. At mT1, we want to output all notes off messages, but the FillOtherBuffers() loop will exit if mNextEvent is NULL, so we create a "fake" mNextEvent for this special "event" of sending all notes off. After that, we destroy the iterator and use PrepareMidiIterator() to set up a NEW one. At each iteration, time must advance by (mT1 - mT0), so the accumulated complete loop time (in "unwarped," track time) is computed by MidiLoopOffset().

The documentation for this class was generated from the following file: