'I have a ffmpeg command to concatenate 300+ videos of different formats. What is the proper syntax for the concat complex filter?

I plan to concatenate a large amount of video files of different formats and resolution, some without sound, and add a short black screen "pause" of about 0.5s between each.

I wrote a python script to generate such command.

I created a 0.5s video file using ffmpeg.exe -t 0.5 -f lavfi -i color=c=black:s=640x480 -c:v libx264 -tune stillimage -pix_fmt yuv420p blank500ms.mp4.

I then added a silent audio to it with -f lavfi -i anullsrc -c:v copy -c:a aac -shortest

I now have the problem of adding a blank audio track for streams without one, but I don't want to generate new file, I want to add it to my complex filter.

This is my complex script and generate command.

The command (there are line returns, because I send this with the python subprocess module)

ffmpeg.exe
-i
input0.mp4
-i
input1.mp4
-i
input2.mp4
-i
input3.mp4
-i
input4.mp4
-i
input5.mp4
-i
input6.mp4
-i
input7.mp4
-i
input8.mp4
-i
input9.mp4
-i
input10.mp4
-f
lavfi
-i
anullsrc
-filter_complex_script
C:/filter_complex_script.txt
-map
"[final_video]"
-map
"[final_audio]"
output.mp4

The complex_filter_script:

[0]fps=24[fps0];
[fps0]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled0];
[1]fps=24[fps1];
[fps1]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled1];
[2]fps=24[fps2];
[fps2]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled2];
[3]fps=24[fps3];
[fps3]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled3];
[4]fps=24[fps4];
[fps4]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled4];
[5]fps=24[fps5];
[fps5]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled5];
[6]fps=24[fps6];
[fps6]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled6];
[7]fps=24[fps7];
[fps7]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled7];
[8]fps=24[fps8];
[fps8]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled8];
[9]fps=24[fps9];
[fps9]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled9];
[10]fps=24[fps10];
[fps10]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2,setsar=1,setpts=PTS-STARTPTS[rescaled10];
[10]split=10[blank0][blank1][blank2][blank3][blank4][blank5][blank6][blank7][blank8][blank9];
[rescaled0:v][0:a][blank0][rescaled1:v][1:a][blank1][rescaled2:v][2:a][blank2][rescaled3:v][3:a][blank3][rescaled4:v][4:a][blank4][rescaled5:v][5:a][blank5][rescaled6:v][11:a][blank6][rescaled7:v][11:a][blank7][rescaled8:v][11:a][blank8][rescaled9:v][11:a][blank9]concat=n=22:v=1:a=1[final_video][final_audio]

As you can see, some video use [11:a], because it's a silent audio stream.

input10.mp4, mapped to [10] and then split (or "cloned") into blanked0 to 9, is a short pause separator.

ffmpeg tells me the error

[Parsed_split_55 @ 000001591c33b280] Media type mismatch between the 'Parsed_split_55' filter output pad 1 (video) and the 'Parsed_concat_56' filter input pad 5 (audio)
[AVFilterGraph @ 000001591bf1e6c0] Cannot create the link split:1 -> concat:5
Error initializing complex filters.
Invalid argument

I'm a bit lost when it comes to using the [X:Y:Z] syntax, and how the order matter in the concat argument list.

I'm open to any other suggestion to solve my problem. I would rather do this in a single command, without intermediate file.

EDIT:

For details, I already wrote a large concat+xstack filter that worked well with 8GB of memory.

In this case, there are a lot of inputs, but those inputs are small, most of them are between 1 and 10MB, so it would probably not generate out-of-memory problems, although I'm not certain.



Solution 1:[1]

While theoretically doable, I don't recommend calling FFmpeg with so many input files. This will increase the memory footprint of the runtime and likely to bog down the speed (if not throwing an out-of-memory error). Instead, my suggestion is to approach this in 2 steps:

  • Step 1: Transcode each video files so each is properly encoded exactly in the way you like it. Do this in a loop and save as intermediate files.
  • Step 2: Copy-concat all the intermediate files to form the final output

The important part here is that all temp files have the exact same stream config. Video: codec, framerate (fps), pix_fmt (pfmt), size (w,h), and timebase and Audio: codec, sample_fmt (sfmt), sampling rate (fs), channel layout ('layout') and timebase. (I'm using these "variables" in the command sketches below inside curly braces.)

Step 1 command sketches:

Below I assuming that video & audio configs are identical among the input files except for the size, which you already addressed in your code. If not, you may need additional filters.

  1. If video file has both audio & video:
ffmpeg -i input.mp4 \
       -f lavfi -i color=c=black:s={w}x{h}:d=0.5:r={fps},format={pfmt} \
       -f lavfi -i aevalsrc=0:n=1:c={layout}:s={fs},aformat={sfmt} \
       -filter_complex [0:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v]; \
                       [v][0:a][1:v][2:a]concat=n=2:v=1:a=1[vout][aout] \
       -map [vout] -map [aout] -enc_time_base 0 output.mp4
  1. If video file only has video stream:
ffmpeg -i input.mp4 \
       -f lavfi -i color=c=black:s={w}x{h}:d=0.5:r={fps},format={pfmt} \
       -f lavfi -i aevalsrc=0:n=1:c={layout}:s={fs},aformat={sfmt} \
       -filter_complex [0:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v]; \
                       [v][2:a][1:v][2:a]concat=n=2:v=1:a=1[vout][aout] \
       -map [vout] -map [aout] -enc_time_base 0 output.mp4

Note that the only difference between 1 & 2 is the 2nd input of concat filter. If audio is missing, just use the aevalsrc for the missing stream.

  1. No 0.5-s padding for the last input video:

With audio

ffmpeg -i input.mp4 \
       -vf scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1 \
       -enc_time_base 0 output.mp4

Without audio:

ffmpeg -i input.mp4 \
       -f lavfi -i aevalsrc=0:n=1:c={layout}:s={fs},aformat={sfmt} \
       -filter_complex [0:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v]; \
                       [v][2:a]concat=n=1:v=1:a=1[vout][aout] \
       -map [vout] -map [aout] -enc_time_base 0 output.mp4
  1. Use ffprobe to identify whether the file has audio stream (you can also use ffmpeg, but I prefer this approach):
ffprobe -of default=nk=1:nw=1 -select_streams a -show_entries stream input.mp4

In python, you can run this command with subprocess.run with stdout=sp.PIPE and check the length of the obtained stdout bytes (>0 with audio, =0 no audio).

  1. While running the per-input ffmpeg, also compose ffconcat text file.

The concat demuxer takes a text file as the input, and it has the following format:

ffconcat version 1.0

file output1.mp4
file output2.mp4
...

where the output#.mp4 are the names of the files you generated in the loop. Build this file in the Step-1 loop and save it in the same directory as the intermediate video files (call it ffconcat.txt).

Step 2 command sketch

Most of the work is done at this point, and you should be able to obtain the final video by:

ffmpeg -i ffconcat.txt -c copy final.mp4

Warning: I didn't test these codes, so if you encounter any typo that you cannot figure out please leave a comment and I'll be happy to correct/clarify.

One-n-done sketch

What's written above can be extended to a single-run (or a partial combo) approach. Assume there are 100 files, then you can do:

ffmpeg -i input0.mp4 -i input1.mp4 ... -i input99.mp4 \
       -f lavfi -i color=c=black:s={w}x{h}:d=0.5:r={fps},format={pfmt} \
       -f lavfi -i aevalsrc=0:n=1:c={layout}:s={fs},aformat={sfmt} \
       -filter_complex \ 
         [0:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v0]; \
         [1:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v1]; \
         ...
         [99:v]scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:-1:-1,setsar=1[v99]; \
         [v0][0:a][100:v][101:a][v2][101:a][100:v][101:a]...[100:v][101:a][v99][99:a]concat=n=199:v=1:a=1[vout][aout] \
       -map [vout] -map [aout] output.mp4

Here, I assumed that the 1st and last have audio and the second has no audio. Input #100 = color filter, Input #101 = aevalsrc filter. The total number of video-audio stream pairs to concatenate is 199 (100 videos and 99 0.5-s pause. The key here is that you can reuse the filter outputs as many times as you need.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1