Figuring out: Audio Pull up/down
/When working with video, an audio pull up or pull down is needed when there´s been a change in the picture´s frame rate and you need to tweak the audio to make sure it stays in sync.
This subject is somehow always surrounded by a layer of mysticism and confusion so this is my attempt of going through the basics and hopefuly get some clarity.
Source: https://bit.ly/2V1CfwX
Audio Sampling Rate
First, we need to understand some basic digital audio concepts. Feel free to skip this if you have it fresh.
Whenever we are converting an audio signal from analogue to digital, all we are doing is checking where the waveform is at certain “points” in its oscilation. These “points” are usually called samples.
In order to get a faithful signal, we need to sample our waveforms many times. The number of times we do this per second is what determines sampling rate and is measured in Hertzs.
Keep in mind that if our sampling rate is not fast enough, we won´t be able to “capture” the higher frequencies since these would fluctuate faster than we can measure. So how fast do we need to be for accurate results?
The Nyquist-Shannon sampling theorem gives us a very good estimation. It basically says that we need about twice the sampling rate of the highest frequency we want to capture. Since the highest frequency humans can hear is around 20Khz, a sampling rate of 40Khz should suffice. Once we know this, let´s see the most comonly used sampling rates:
| Sampling Rate | Use | 
|---|---|
| 8 KhZ | Telephones, Walkie-Talkies | 
| 22 Khz | Low quality digital audio | 
| 44.1 Khz | CD quality, the music standard. | 
| 48 KHz | The standard for professional video. | 
| 96 Khz | DVD & Blu-ray audio | 
| 192 Khz | DVD & Blu-ray audio. This is usually the highest sampling rate for professional use. | 
As you can see, most professional formats use a sampling rate higher than 40 Khz to guarantee that we capture the full frequency spectrum. Something that is important to remember and that will become relevant later on is that a piece of audio is always going to be the same lenght as long as it is played at the same sample rate that it was recorded.
For the sake of completion, I just want to mention audo resolution (or bit depth) briefly. This is the other parameter that we need to take into consideration when converting to digial audio. It measures hoy many bits we use to encode the information of each of our samples. Higher values will give us more dynamic range, since a bigger range of intensity values will be captured. This doesn´t really affect the pull up/down process.
Frames per second in video
Let´s now jump to the realm of video. There´s a lot to be said on the subject of frame rate but I will just keep it short. This value is simply how many pictures per second are put together to create our film or video. 24 frames per second (or just fps) is the standard for cinema, while TV uses 25 fps in europe (PAL) and 29.97 fps in the US (NTSC).
Keep in mind that these frame rates are different not only on a technical level but also on a stylistic level. 24 fps “feels” cinematic and “premium” while sometimes the higher frame rates used in TV feel “cheap”. This is probably a cultural perception and is definitely changing. Videogames, which many times use high frame rates like 60 fps and beyond, are partially responsible for this taste shift. The amount of motion is also very important, higher fps will be the best at showing fast motions.
But how can these different frame rates affect audio sync? The problem usually starts when a project is filmed at a certain rate and then converted to a different one for distribution. This would happen if, for example, a movie (24 fps) is brought into european TV (25 fps) or an american TV programme (29.97 fps) is brought into India, which uses PAL (25 fps).
Let´s see how this kind of conversion is done.
Sampling Rate vs Frame Rate
Some people think that audio can be set to be recorded at a certain frame rate the same way it can be set to be recorded at a certain sampling frequency. This is not true. Audio doesn´t intrinsically have a frame rate value the same way it has a bit depth and sampling rate.
If I give you an audio file and nothing else, you could easily figure out the bit depth and sampling rate but you would have no idea about the frame rate used on the associated video. Now, and here comes the nuanced but important point, any audio recorded at the same time with video will sync with the specific frame rate used when recording that video. They will sync because they were reocrded together. They will sync because what the camera registered as a second of video was also a second of audio in the sound recorder. Of course, machines are not perfect and their clocks may measure a second slightly different and that’s why we connect them via timecode but that’s another story.
This session is set at 24 fps, so each second is divided into 24 frames.
Maybe this confussion comes from the fact that when you create a new session or project in your DAW, you basically set three things: sampling rate, bit depth and frame rate. So it feels like the audio that is going to be inside is going to have those three intrinsic values. But that is not the case with frame rate. In the context of the session, frame rate is only telling your DAW how to divide a second. Into 24 slices? That would be 24 fps. Into 60 slices? That´s 60 fps.
In this manner, when you bring your video into your DAW, the video´s burnt in timecode and your DAW’s timecode will be perfectly in sync but all of this will change nothing about the duration or quality of the audio within the session.
So, in summary, an audio file only has an associated frame rate in the context of the video it was recorded with or to but this is not an intrinsic charactheristic of this audio file and cannot be determined without the corresponding video.
Changing Frame Rate
A frame rate change is usually needed when the medium (cinema, TV, digital…) or the region changes. There are two basic ways of doing this. One of them is able to do it without changing the final duration of the film, usually by re-distributing, duplicating or deleting frames to accomodate the new frame rate. I won’t go into details on these methods partly because they are quite complex but mostly because if the lenght of the final picture is not changed, we don´t need to do anything to the audio. It will be in sync anyway.
Think about this for a second. We have changed the frame rate of the video but, as long as the final leght is the same, our audio is still in sync which kind of shows you that audio has no intrinsic frame rate value. Disclaimer: This will be true as long as the audio and film are kept separated. If audio and picture are on the same celluloid and then you start moving frames around, obviously you are going to mess up the audio but in our current digital age we don’t need to worry about this.
The second method is the one that concern us. This is, when the lenght of the picture is actually changed. This happens because this is the easiest way to fix the frame rate difference, specially if it is not very big.
Telecine. How video frame rate affects audio.
Let´s use the Telecine case as an example. Telecine is the process of transfering a old fashion analogue film into video. This is not always the case but this usually also implies a change in frame rate. As we saw earlier, films are traditionally shot at 24 fps. If we want to broadcast this film in european television, which uses the PAL system at 25 fps, we would need to go from 24 to 25 fps.
The easiest way to do this is just play the original film 4% faster. The pictures will look faster and the movie will finish earlier but the difference would be tolerable. Also, if you can show the same movie in less time in TV that gives you more time for commercials, so win, win.
What are the drawbacks? First, showing the pictures a 4% faster may be tolerable but is not ideal and can be noticeable in quick action sequences. Second and more importantly, now our audio will be out of sync. We can always fix this by also playing the audio a 4% faster (and this would traditionally be the case since audio and picture were embed in the same film) but in this case, the pitch will be increased by 0.68 semitones.
In the digital realm, we can achieve this by simply playing the audio at a different rate that was recorded. This would be the digital equivalent to just cranking the projector faster. Remember before when I said that an audio file will always be the same leght if it is played at the same saple rate as recorded? This is when this becomes relevant. As you can see below, if we play a 48 KHz file at 50 KHz, we would get the same speed up effect that a change from 24 to 25 fps provides.
This would solve our sync problems, but as we were saying, it would increase the final pitch of the audio by about 0.68 semitones.
That increase in pitch may sound small but can be quite noticeable, specially in dialogue musical sections. So how do we solve this? For many years the simple answer was nothing. Just leave it as it is. But nowadays we are able to re-pitch the resulting audio so it matches its original sound or, alternativaly, we can directly change the lenght of the audio file without affecting the pitch. More on tese methods later but first let’s see what happens if, instead of doing a reasonable jump from film to PAL, we need to go from film to NTSC.
Bigger frame rate jumps, bigger problems (but not for us).
If a jump from 24 to 25 is a 4% change, a jump between 24 to 29.976 would be a whooping 24.9%. That´s way too much and it would be very noticeable. Let´s not even think about the audio, everybody would sound as a chipmunk. So how is this accomplished? The method used is what is called a “2:3 pulldown”.
Now, this method is quite involved so I’m not going to explain the whole thing here but let’s see the basics and how it will affect our audio. First let´s start with 30 fps as this was the original frame rate for TV in NTSC. This makes sense because the electrical grid works at 60 Hz in the states. But as people who, for some reason, are happy living this way, things were bound to get messy and after color TV was introduced and for reasons you can see here, the frame had to be dropped by a 1/1000th to 29.976.
A 2:3 pulldown uses the proportion of frames and the interlaced nature of the resulting video to make 4 frames fit into 5. This is because a 24/30 proportion would be equal to a 4/5 proportion. Again, this is complex and goes beyond the scope of this article but if you want more details this video can help.
But wait, we don’t want to end up with 30 frames, we need 29.97 and this is why the first step we do is slow down the film from 24 fps to 23.976. This difference is impossible to detect but crucial to make our calculations work. Once this is done, we can do the actual pulldown which doesn´t change further the lenght of the film, it only re-arranges the frames.
What does this all mean for us, audio people? It means that we only need to worry about that initial change from 24 to 23.976 which would just be a 0.1 % change. That’s small but it will still throw your audio out of sync during the lenght of a movie. So we just need to adjust the speed in the same way we do for the 4% change. If you look again at the picture above, you’ll see that that 0.1% is the change we need to use to go from film to NTSC.
As for the change in pitch, it will be very small but we can still correct it if we need with the methods I show you below. But before that, here is a table for your convenience with all the usual frame changes and the associated audio change that would be needed.
| Frame Rate Change | Audio Speed Change | Pitch Correction (If needed) | 
|---|---|---|
| Film to PAL | 4% Up | 4% Down // 96% // -0.71 Semitones | 
| Film to NTSC | 0.1% Down | 0.1% Up // 100.1% // + 0.02 Semitones | 
| PAL to Film | 4% Down | 4% Up // 104% // +0.68 Semitones | 
| PAL to NTSC | 4.1% Down | 4.1% Up //104.1% // +0.68 Semitones | 
| NTSC to Film | 0.1% Up | 0.1% Down // 99.9% // -0.02 Semitones | 
| NTSC to PAL | 4.1% Up | 4.1% Down // 95.9% // -0.89 Semitones | 
Techniques & Plugins
There are two basic methods to do a pull up or pull down. The first involves two steps: first changing the duration of the file while affecting its pitch (using a different sample rate as explained before) and secondly applying pitch correction to match the original’s tone. The way to actually do the first step depends on your DAW but in Pro Tools, for example, you’ll see that when importing audio you have the option to apply SRC (Sample Rate Conversion) to the file as pictured above.
The second method is simply doing all at once with a plugin capable of changing the lenght of an audio file without affecting its pitch.
Also, keep in mind that these techniques can be applied to not only the stereo or the surround final mix file but also the whole session itself, which would give you much more flexibility to adjust your mix on this new version. This makes sense because a 4% change in speed could be enough to put two short sounds too close together and/or the feel of the mix could be a bit different. Personally, I have only used this “whole session” technique with shorter material like commercials. Here is a nice blog post that goes into detail about how to accomplish this.
As for changing a mixed file as a whole, wether you use a one step or two steps method, you will probably find that is easy to introduce glitches, clicks and pops in the mix. Sometimes you get dialogue that sounds metallic. Phase is also an issue, since the time/pitch is not always consistent between channels.
The thing is, time/pitch shift is not a easy thing to accomplish. Some plugins offer different algorithms to choose from depending on the type of material you have. These are designed with music in mind, not dialogue, so “Polyphonic” is the one that is usually the best option for whole mixes. Another trick you can use is to bounce your mix into stems: music, dialogue, FX, ambiences, etc and then apply the shift to each of them indepentdently, applying the best plugin and algorithm to each. This can be very time consuming but will probably give you the best results.
As you can see, this whole process is kind of tricky, particularly the pitch shift step and this is why in some occassions the audio is corrected for sync but left at the wrong pitch. Nevertheless, nowadays we have better shifting plugins to do the job. Here are some of the most commonly used, although remember that non of these works perfect in every ocassion:
-Zplane Elastique: This is in my opinion the best plugin and the one I personally use. It produces the least artefacts, keeps phase coherent and works great on whole mixes, even with single step processing.
-Pro Tools Pitch Shift: This is the stock time/pitch plugin that comes with Pro Tools. It is quite fast but is prone to create artifacts.
-Pro Tools X-Form: This one is more advanced (comes blunded with Pro Tools Ultimate) but it still suffers from some issues like giving dialogue a metallic tone or mesing the phase on stereo and surround. Also, it is slow. Veeeery slow.
-Serato Pitch n Time: I haven’t tried this either but I had to mention it since it is very commonly used and people swear by it.
-Izotope Time & Pitch: It can work well sometimes and offers many customizable settings that you can adjust to avoid artefacts.
-Waves Sound Shifter: Haven´t used it but it’s another option that seems to work well for some applications.
Which one should you choose? There is no clear answer, you will need to experiment with some of them to see what works for each project. Here is a good article and video comparing some of them.
Conclusions
I hope you now have somehow a better understanding on this messy subject. It is tricky from both a theoretical and practical level but I believe is worth figuring out where things come from instead of just doing what others do without really knowing why. Here are some takeaways:
- Sampling rate and bit depth are intrinsic to an audio file. 
- At the same time, an audio file can be associated to a certain video frame rate when they are both in sync. 
- The frame rate change process is different depending on the magnitud of the change. 
- An audio pull up or pull down is needed when there is a frame rate chenge on the picture that affects its lenght. 
- The pull up/down can be done in two steps: lenght change first, then pitch correction or ir can be done in a single step. 
- Time/Pitch Shift is a complicated process that can produce artefacts, metallic timbres and phase issues. 
- Mixes can be processed by stems or even as whole sessions for more flexibility. 
- Try different plugins and algorithms to improve results. 
Thanks for reading!

 
                    