We recently released a project for a client where we integrated WebRTC video chat. The goal was to make an app for both Android and iOS that could cross connect to run a simple multiplayer game with a live video stream. The details of getting WebRTC up and running on both platforms are for another post, but here I'm going to focus on one specific client request for this project: recording video.
For reference, a lot of the original research and experimentation was carried out with Pierre Chabardes' AndroidRTC project and Gregg Ganley's implementation of Google's AppRTC demo. We used the most recent versions of libjingle_peerconnection at the time of development (Android (Maven) 11139, iOS (Cocoapods) 11177.2.0), which are not actually the most recent WebRTC sources.
Originally, we discussed going the whole nine yards: having a running video buffer with maybe the last minute or so of video, including sound, that we could save off to a .h264 MP4 video or some such. WebRTC streams video and audio in two separate streams, and the SDKs for Android and iOS don't easily expose the audio stream.
For the sake of development time, we decided to restrict our video recording to simple animated GIFs. Even though this was a vast simplification, it still proved to be a large development headache, especially on Android. On iOS, at least, StackOverflow has some pretty straightforward answers, like this one from rob mayoff. It was just a matter of getting things threaded and then we were off and running.
Actually, before I get to the GIF encoding, let me take a step back. Where are the frames we're going to use coming from? On both platforms, WebRTC has a public interface that feeds a series of custom I420Frame objects from the backend streaming to the frontend rendering. The I420Frames are really just YUV images. Documentation is light, but we were able to dig through the WebRTC source, at least. For Android, we have the VideoRenderer, which contains both the I420Frame class definition and the VideoRenderer.Callbacks interface, which is what actually gets handed a frame. On the iOS side, we have the RTCVideoRenderer, which has a renderFrame method that can be overridden to get at the I420Frame (in this case called RTCVideoFrame). More generally, the UIView you would actually use is a RTCEAGLVideoView, which can just be inherited, and you can grab the I420Frame when renderFrame is called.
Android is, again, trickier. When you receive a new remote video stream from WebRTC, you need to have a VideoRenderer.Callbacks implementation wrapped in a VideoRenderer object that you apply to the stream. The Android SDK provides a helper class (org.webrtc.VideoRendererGui) with static methods to create VideoRenderer.Callbacks implementations that can draw to a GLSurfaceView. However, this implementation doesn't really play nice with inheritance like things do on iOS. Fortunately, you can add multiple renderers to a video stream. So we created our own implementation of VideoRenderer.Callbacks, wrapped it in a VideoRenderer, and added and removed it from the remote video stream as needed. Now renderFrame would be called on it, and we had access to the I420Frame. NOTE: We discovered we had to call VideoRenderer.renderFrameDone() at the end of renderFrame to clean things up. The WebRTC SDK creates a separate I420Frame object for each video renderer, and each is responsible for its own cleanup. Otherwise, you'll end up with a mysterious memory leak.
So all of that is done, and now we're getting I420Frame objects as they're sent over the remote video stream, which we can copy to a local streaming buffer, data store, or whatever you like for later. But again, these are YUV images, not typical RGB, which means they need to be converted before they can actually be encoded using any sort of standard GIF library. On iOS, this is comparatively easy. Google developed a YUV converter that lives in the WebRTC library, and we can just use that. We grabbed the header files, and then we could just use the various functions to copy frames (libyuv:I420Copy) and convert to RGB (libyuv:I420ToABGR). Note the swapped order of ABGR. iOS image generation expects RGBA, but empirical testing showed that the endian-ness was swapped, and converting with ABGR on the WebRTC side resulted in correctly ordered bytes when fed to iOS libraries. StackOverflow again has answers for getting a usable UIImage out of a byte array, such as this one by Ilanchezhian and Jhaliya.
As is a running theme here, Android was not so easy. Technically, it has the same YUV converter buried in the native library, but we're operating in Java, and things are not easily exposed at that level. It turned out to be way easier to write a YUV converter class than try to get at the internal conversion utility. Starting from this StackOverflow answer by rics, we created YuvFrame.java, which we've posted here. (Edit 2/2020: when we upgraded our project to use Google's WebRTC library, we had to make a different YuvFrame.java that's compatible with the library. Also, here's an Objective-C version, I420Frame.)
Finally, we're at the point of actually saving the collection of WebRTC video frames to an animated GIF. I discussed the iOS method earlier. I also leave it as an exercise to the reader to record the variable framerate of the video stream and apply the frame timing reasonably to the animated GIF. The main discussion is once again Android.
We started out with a Java-based GIF encoder with high color accuracy. This got the job done well, but it had a drawback: on somewhat older devices, like the Nexus 5, encoding 2 seconds of video at 10fps with 480x480px frames (20 of them) could take upwards of 3 minutes to complete (though to be fair, with lots of background processes closed and a fresh boot, it could be down to 1 minute 15 seconds). Either way, this was unacceptable. All our tests on iOS, even with an older iPhone 5, showed much better quality encoding in 10-15 seconds. Step one was the increase the thread priority, since we were using an AsyncTask, which defaults to background thread priority and takes up maybe 10% of the CPU. Bumping this up to normal and even high priority got us around a 40% speed increase. That's a lot, and given that the majority of phones have multiple CPU cores, it didn't affect the video stream performance. However, our actual target was a 6 second animated GIF at 15fps, which means 90 frames to encode. The next step was to dig up an NDK-based GIF encoder. This got us a further speed increase, and we were looking at just over a minute for the full 90 frame encode.
I instrumented the whole encoding process, and there were two major time sinks: creating a color palette for each frame, and converting the frame to the color palette. The former was maybe 20% of the frame encode time, while the latter was 70%-75%. I played around a bit with global color palettes and only generating a new color palette every few frames. The former resulted in a pretty bad quality reduction in certain cases, but when I generated the color palette once every 5 frames and stored it for the intervening frames, I got a decent amount of speed back without a serious drop in quality. Still, this was only affecting the operation that comprised a bit of the total frame encoding time. Actually going through all the pixels of each frame and finding the best match in the color palette was the most intensive part.
I can't say I came up with the idea (that credit belongs to Bill), but I did implement our final solution. We multi-threaded the process of palettizing the frames. We checked the device to see how many CPU cores it had (combining StackOverflow answers from both David and DanKodi), then set the encoding thread count to one less than that (so the video stream keeps running). We segmented the frame by rows into however many threads we had to work with, and proceeded to palettize each segment concurrently. Now you may be asking, what about dithering? Well, strictly speaking, this method results in a slightly lower quality frame because we can't do dithering quite the same way. We dithered each segment as normal, and for the later segments, we used the (un-dithered) row from the previous segment as a basis. On its own, this would result in artifacts along the lines between segments. So after all the threads were done, we did one more custom dithering pass along the boundary lines between segments to use the final dithered values from the previous segment to update the first row in the next segment. This pretty much smoothed out all the noticeable artifacts.
We forked Wayne Jo's android-ndk-gif project with this new encoding method. This got us yet another 40% increase in encoding speed, bringing us under 40 seconds on average to encode 90 frames on an old Nexus 5, which we deemed acceptable. On a modern phone, this actually results in faster speeds than we saw on iOS.
In conclusion, I have failed to talk about other potentially useful pieces of this whole puzzle, including saving animated GIFs to the Android image gallery, saving animated GIFs to the iOS PhotoLibrary, getting WebRTC connections to persist across Android screen rotations, and the whole thing where we actually got the Android app and the iOS app to connect to each other.