There are two main factors that determine the viewer experience during the live streaming of sports: latency and stalls. Latency should be low and stalls should not occur. Yet, these two factors work against each other and it is not trivial to strike the best trade-off between them. One of the best tools we have today to manage this trade-off is the adaptive playback speed control. This tool allows the streaming client to slow down the playback when there is a risk of stalling and increase the playback when there is no risk of stalling but the live latency is higher than desired. While adaptive playback generally works well, the artifacts due to the changes in the playback speed should preferably be unnoticeable to the viewers. However, this mostly depends on the part of the audio/video content subject to the playback speed change. In this talk, we discuss the details of a content-aware playback speed control (CAPSC) algorithm we developed for dash.js along with the metadata we defined to indicate the event densities in a given content. CAPSC aims to keep the playback speed close to the nominal speed (1x) during the important parts of the content and gracefully adapts the playback speed changes to provide a more pleasant viewing experience. Our ultimate goal is to apply CAPSC on live soccer games in real time through a third-party company (e.g., Wyscout, TheSports, and InStat) that provides real-time game data and statistics.
Obtuse Crypto-Fish? Organic Chemotaxonomist’s Federation? Obvious Chafing Factor?
In the filmmaking process and within Netflix’s globally distributed collaboration environments, “OCF” refers to Original Camera Files (or Original Capture Formats, depending on who you ask). Call them what you like, but their importance cannot be overstated. They contain the data captured by a camera sensor, and represent the digital equivalent of a negative in film-based workflows of yore. Post-production processes like review, editorial and visual effects cannot begin without OCF.
This presentation will outline the unique challenges associated with managing digital camera files in the course of content creation and demonstrate how we bring joy to both our members and creative partners alike as we scale best practices of content production, leverage open standards like OpenTimelineIO and ACES, and implement new tools tailored to the specific needs of filmmakers. These solutions include fast and intelligent transfer of multi-terabyte camera cards, classification of shots and their relationships to the final cut, and media optimization for use throughout a complex production pipeline.
VR may sound like a gimmick at the moment, but there is no denying in its potential. To prepare for the future, the Evolution recently launched a live VR game. There are not many guides out there on how to produce a proper live VR stream, so we had to experiment and find out the ways ourselves. The purpose of this talk is to share some points we learned on dos and don'ts of VR live production that could help producing a better live VR experience.
Oftentimes, when people bemoan the state of accessibility for media playback, they’re talking about the media content itself: closed captioning, audio descriptions, available transcripts, seizure-causing flashes in video content, scene color contrast, and the like. These are unquestionably important considerations when attempting to making your users’ media playback experience accessible, but they are unfortunately also very hard and sometimes quite costly problems. Not only that, oftentimes these (at least currently) require considerations and buy-in at the production phases of content creation, well before the media makes its way into streaming media pipelines, platforms, and players.
Yet there is an area of much easier, cheaper, and tractable improvements to accessibility available to video engineers - the media player itself. In this talk, I’ll give an overview of the primary considerations for making an accessible browser-based media player. At its core, good media player accessibility requires a semantics that is consistent and make sense for both non-technical users and video developers alike. This goes hand in hand with coming up with the right metaphors for the different kinds of interactions and information surfacing—analogous to the physical metaphors at play for visual user experience. With these, we should be able to establish some best practices and reference implementations for common use cases. Finally, I’ll end by demo-ing a player-agnostic example implementation aiming toward these goals, along with some bonus wins you get for free when you make your media player accessible.
We have a complex rendering pipeline that requires temporal, and graphical editing. Historically we've used ffmpeg and it's filter graphs as our rendering engine.
However this has lead to duplicate work in our user interface to preview rendered output. We have invested time in understanding the demux/decode/filter/encode/mux pipeline.
We have come up with techniques combining Puppeteer, MP4Box.js, WebCodecs, <canvas>, and ffmpeg (final muxing) together to create a pipeline that gives web developers a familiar <canvas> graphics API, without sacrificing the performance characteristics of ffmpeg.
ffmpeg is mostly known for being the swiss army knife of video processing. In recent years, as the advancement of AI-based video processing starts to become popular, ffmpeg has introduced many of these functionalities through a number of filters. These filters has dramatically lowered the technical barrier for doing AI-based video processing.
In this talk, we'll examine the various existing filters like SuperResolution or Derain. We will also talk about our experience creating a scene classification filter - training the models, implementing the filter using ffmpeg's tensorflow backend, and using GPUs to run the model for live streams. This filter is open source, and currently detects content types such as professional soccer games or adult content. We will share benchmarks and our learnings in tuning the performance for this filter.
The interchange of content delivery configuration metadata between the various entities in the delivery ecosystem is essential for efficient interoperability. The need for an industry-standard API and metadata model becomes increasingly important as content and service providers automate more of their operations, and as technologies, such as open caching, require coordination of content delivery configurations. This talk discusses the Open Caching configuration interface, how the project was born, how it is moving through the SVA and IETF, and what it means for the industry.
The video streaming system of Netflix has hundreds of configuration parameters that influence many aspects of the playback behavior when using our service; for example, such configurations specify the amount of video content to load before we begin playback to balance play delay and risk of rebuffers.
Usually, we perform many iterations of A/B experiments to fine-tune these values and provide the best possible member experience across the wide range of platforms and networks we serve worldwide. Still, identifying good configurations that work well across diverse networks and devices, in particular when dealing with multi-dimensional parameters, is challenging given their complex interactions with various streaming metrics.
To help with these challenges, one powerful approach we have evaluated in the last years is Bayesian optimization. With this method, we can efficiently explore and understand the relationship between configuration parameters and objective metrics (such as playdelay, rebuffer rate, …) by building a surrogate model that incorporates experimental observations and guides future experiments.
In this presentation, we give a brief introduction into the world’s leading streaming entertainment service Netflix - followed by an example use case on how we use Bayesian optimization in conjunction with our A/B experimentation framework to deliver concrete service improvements for our users.
The software I work on allows users to design video clips in the browser. Users can combine video clips with captions and other elements into multiple scenes, which we render to a rasterised video.
This means we need to keep different forms of playable media in sync over time, while playing, seeking, and scrubbing in the designer UI, as well as during rendering.
My talk will cover how we implemented our time state tracking in React, including:
• Embracing the concept of "derived state" for reliable, deterministic rendering
• Optimising performance through various techniques
• How to test time-based state (or, how to time-travel in your tests)
• How to sync various types of media (videos, captions, etc) with a single source of truth
This will help anyone, who wants to build a video editor and/or rendering system in the browser, set up a solid foundation for handling time in their UI.
Apollo 350 built an in-browser platform for a startup to support live video AMA (ask me anything) events. We needed to support thousands of viewers and multiple video conference hosts, where the hosts could talk in real-time to each other. We wanted a solution that would allow us to customize the video that was available for viewers. In our case, we needed to shape the video of each host into a circle and animate them larger/smaller based on who was speaking. We also did not want to rely on any processing on the "hosts'" side due to host browser limitations. The "hosts" would simply log into the website and be able to broadcast from their browser (desktop or mobile).
We had to build this quickly and needed to leverage existing services like Twilio, but we also built some custom solutions of our own. We'll talk about the process of how we got to a working product and the decisions (both good and bad) that we had to make. We will share the architecture we ended up with.
Apollo 350 is a software consulting company that has extensive experience building and launching video products to millions of users with billions of views. We consist of tech executives and leads who launched VEVO, Condé Nast Entertainment, and LinkedIn Video.
I plan to talk about the changes in the FFmpeg community, (in the libav community), and what happened in the FFmpeg project - code wise - and our realization. FFmpeg 5.0, releases, CoC and a few other things. I plan to speak about dav1d, x264/rav1e on ARM too.
Content providers, who frequently rely on third party software and services (Players, CDNs, Origin Services), struggle to develop the observability necessary to achieve their QoE goals. They are essentially responsible for the entire customer experience but only able to fully observe what they themselves instrument and what their service providers are able/willing to share. In recent years, there has been some progress along these lines in the form of CMCD-based data propagation of player/app metrics and CDN log streaming but logs and metrics are still insufficient for enabling the deep observability necessary for consistently amazing user experiences.
In addition, 3rd party service providers (e.g. CDNs, origin services, packages, etc) struggle to provide optimal service to their customers (e.g. content publishers) due to the same observability challenge. In short, telemetry is fragmented and siloed making it virtually impossible to get the complete architectural or operational picture.
As a result of these conditions the Steaming Video Alliance's QoE working group is developing methods for augmenting logs and metrics with the third pillar of observability, distributed request tracing. Typically distributed request tracing is performed using standardized methods based on the open telemetry project in the context of a single service architecture. Our project is designed to span 3rd party services across the video ecosystem from player to CDN to Origin and beyond using logging mechanisms already in place.
In this talk we'll cover the methods for implementing our first phase of this initiative from player to Origin and back. We'll talk about early results and how tracing can be leveraged to turn high level QoE warning signals into deep root cause analysis. If successful this effort promises to enable game-changing quality for not only content publishers but the 3rd party services that support them.
In 2020 we decided to replace our aging Live 2 VOD system with a new system that would hopefully fix some of the issues with the system already in place, such as missing signaling about programming, having to maintain a separate L2V origin forever and lack of frame accuracy. We wanted to create a system that would run completely automated based on our playout systems schedule, requiring no manual intervention or manual cutting. We also wanted the system to allow users to immediately play content as soon as it was broadcast, with the option to extend to "start over" functionality later. Lastly, we wanted Live 2 VOD videos to be served from our normal VOD library.
The talk covers:
How we found a way to integrate with traditional broadcast systems to get highly accurate and automated VOD assets from traditional broadcast channels with systems already on place for 31 channels (8 national, 8 regional and 15 events)
How we process 250 transmissions per day, of which ~40 are published to the service with a length of 5 mins to 8 hrs.
How we managed to wrangle the existing playout system to provide usable SCTE-104 insertion that we could convert to SCTE-35 and later read.
How we combined using a live buffer with producing VOD assets to simplify long term storage of captured VOD assets and avoid having a dedicated L2V origin while still having content available in under 1 minute.
How we handle situations with sports that have a break between multiple parts.
Why it is sometimes impossible to get accurate and automated markers within the reality of a broadcast world.
How we had a manual handling rate of less than 1% while handling content for Wimbledon, Tour de France, several ATP tournaments, and the Handball World Championship 2021.
You should pick this talk because it covers the interaction between the traditional broadcast domain and streaming services, this wasn't really covered much in the 2020 Demuxed talks. We take a look at a challenge and a infrastructure landscape that is likely to be something many broadcasters share as they work to move focus to streaming services and internet delivery while customers continue to cable cut or cable shed.
QUIC (RFC 9000) is a new network protocol designed to power HTTP/3, but it's also a powerful transport for other applications like video. There are multiple approaches for mapping video to the QUIC API, varying based on the target latency and user experience. At Twitch/IVS we've built a new distribution protocol (Warp) to replace our HLS stack, utilizing a unique prioritization scheme to minimize latency in the face of congestion.
It was a Monday like any other when a friend called us and said: "Last weekend we streamed a Pay Per View soccer match in Bolivia and we were pirated through Twitch, Facebook and YouTube. The next match is in 5 days. Can you help us?"
This talk tells the story of how we built and run a pilot of an end to end anti-piracy system in a week.
We will cover the challenges we faced, the strategies we used, and the architecture of the solution we built using AWS, VideoJS and Common Encryption.
The code will be available on GitHub celebrating the Demuxed 2021 edition.
During COVID, the importance of video calling grew as people were stuck and separated from their family and friends. But the 2D experience falls short of making you feel present in the same space together. Moving to a 3D calling experience (e.g. holographic calling) can help people feel closer and more immersed.
People are familiar with the concept of holographic video calling from sci-fi movies, and there have been some demos by different companies using highly specialized equipment. But what would it take to make it a reality, enabling this technology for the masses, and making it so that people can use it as easily as 2D calls? The answer to this question spans multiple domains, both on hardware and software, from computer vision and machine learning to compression and real-time transport. In this talk, I will focus on some of the technical challenges this brings for video and real time communication.
While idly scrolling through my Twitter feed one evening, I stumbled on a Tweet containing a video of Matt and Phil trying their absolute hardest to make us submit talk submissions before the last moment humanly possible.
As usual, I laughed in derision, and reached out with my oversized thumb to scroll away, but then Phil said something that stopped me in my tracks.
"Smell-o-vision", he said. Could it be that simple? Could this be my golden ticket to the Demuxed hall of fame?
So I started down the strangest, smelliest rabbit hole imaginable, researching the complete history of smell-o-vision, and let me tell you, my eBay recommendations will never be the same again.
This tongue-in-cheek talk is a whistle-stop tour of all things olfactory in the video technology space. We'll start in the early 1900s, with movie theatre owners selecting and blowing their own scents into the theatre, in many cases before the films even had audio tracks, traveling through to smell-o-vision and the competitive AromaRama standard wars (Spoiler: they both turned out to be betamax).
Next we'll jaunt through the deeply questionable history of "scratch-and-sniff" cards, both in the movie theatre, but also shoved in a DVD case for you to enjoy in your own home, culminating in a live, on stage, scratching and sniffing of a 40 year old "ODORAMA" card, with second opinion from Matt McClure himself.
Penultimately, we'll take an ill-advised detour into the noxious dark ages of smell-enabled VR headsets with a very unsocially distanced look at the Nosulus Rift (the only device ever designed to emit a custom Fartgrance), and ponder if any of the upcoming "Digital scent technology" devices on Indiegogo are actually worth buying.
Finally, we'll arrive in 2021 and review what's to become the future of streaming - O(dor)TT. It might surprise you to hear that there isn't currently a standard for delivering smells via HLS or DASH manifests, so first up, we'll fix that, and only then can we take a scene from one of the terrible scratch and sniff movies I bought, and wire up my very own home made smell-o-vision device. Join me, live on stage for the world's first demonstration of `EXT-X-SMELL` in action.
(This is not a joke, you can check my eBay history if you don't believe me...)
We ask the question:
“Can we compress AV content generated via webcams to just text and recover videos with similar Quality-of-Experience compared to standard codecs in a low bitrate regime?”
and answer it in affirmative using state-of-the-art deep learning models.
Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes, and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on billions of people in developing countries or other locations with limited or unreliable broadband connectivity. Moreover, a reduction in required bandwidth can have a significant impact on global network performance by decreasing the network load for live and pre-recorded content, providing broader access to multimedia content worldwide. In this talk, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning-based voice cloning and lip-syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs, while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The code for this work is available as an open-source project on GitHub (https://github.com/tpulkit/txt2vid.git).
The focus of our work is on audio-video (AV) content transmitted from webcams during video conferencing or webinars. Current compression codecs (such as H.264 or AV1 for videos, and AAC for audio) lossily compress the input AV content by discarding details that have the least impact on user experience. However, the distortion measures targeted by these codecs are often low-level and attempt to penalize deviation from the original pixel values, or audio samples. But what matters most is the final quality-of-experience (QoE) when this media stream is shown to a human end-consumer. Thus, in our proposed pipeline, instead of working with pixel-wise fidelity metrics we directly approximate the original content such that the QoE is maintained. Compressing to text, we can achieve bitrates of ~100bps at similar QoE compared to a standard codec. The pipeline uses a state-of-the-art voice cloning model to convert text-to-speech (TTS), and a lip-syncing model to convert audio to reconstructed video using a driving video at the decoder. Our pipeline can be used for storing the webcam AV content as a text file or for streaming this content on the fly. We evaluated our pipeline using a subjective study on Amazon Mturk to compare user preferences between Txt2Vid generated videos and videos compressed with standard codecs at varying levels of compression, for multiple contents.
We believe the proposed framework has the potential to change the landscape of video storage and streaming. It can enable several applications with great potential for social good expanding the reach of video communication technology. Some examples include better accessibility in areas with poor internet availability, transmission of pedagogical content for remote learning, real-time machine translation of talks, etc. It can also enable some fun applications such as joining an AV call but just typing in your input instead of speaking.
While we used specific tools in our pipeline to demonstrate its capabilities, we envision significant progress in the components used over the coming years. We would like to highlight that current implementation is just a prototype of the proposed pipeline, and a perfect timing for the community to be involved to make it more practical and accessible. Potential improvements to the current framework and implementation include reducing computational complexity and latency for streaming, improved Quality-of-Experience to include more non-verbal cues, and assuage ethical concerns over usage of such a technology. We call upon the community to build upon current implementation and adapt it for different applications.
Streaming media formats are continually being updated with new features. Additionally, some platforms and rights holders are starting to require certain features be implemented/adopted.
In the past, the solution to this problem has been to re-encode and/or remux your existing media library to add the new features. This is expensive, time consuming and at times requires re-architecting your encoding/muxing pipeline to accommodate.
This talk will delve into the ins-and-outs of using Edge Compute platforms across multiple vendors to implement new features to existing media streams with a just-in-time and globally scalable approach.
There will be demos showing features added to existing HLS streams:
- Adding Roku's JPEG based Trick Play
- Convert MPEG-TS based HLS to fMP4 based HLS
- CDN Pre-Fetching/Per-Warming
WHIP WebRTC, into shape
Try to detect it, it's not too late
It's time to WHIP it, WHIP it good.
For many in broadcast and streaming, WebRTC is not “complete”, as it lacks a standard signaling protocol to make it work like RTMP or RTSP.
WHIP, the WebRTC HTTP Ingest Protocol, was developed to solve the biggest pain point with adopting WebRTC as a serious, professional, robust contribution protocol: Media Ingest.
WHIP enables WebRTC to retain its technical advantages over older protocols like RTMP when it comes to resiliency over bad network conditions, adaptability, end-to-end encryption, and new codec support (hello AV1 SVC).
It also removes the barrier WebRTC had with a lack of standard signaling protocol that has made it hard to support as a software solution, and difficult for hardware encoders to implement WebRTC.
Developers love WebRTC because it is an IETF & W3C standard that makes it easy to write client applications with native broadcast and playback support on billions of devices worldwide. And the WISH working group at the IETF is currently reviewing WHIP with a milestone to publish it as a standard by December 2021.
Implementing the open source WHIP library in your software or hardware encoder is all you need to support the entire WebRTC stack on the sender side.
It’s time to WHIP WebRTC into shape and take advantage of WebRTC end-to-end, as it was meant to be, natively on every device.
As streaming exits its infancy stages and becomes a true contender to broadcast television, quality and reliability remain under scrutiny. Beyond improving the resiliency of streaming workflows, it is imperative organizations operationalize the way streaming workflows are monitored if they wish to compete with the broadcast standard of five 9's reliability (99.999% uptime).
I would like to present a glance of how our team is solving this problem, the challenges faced and lessons learned from doing this at "Super Bowl" scale.
For 2021+ I'd like to propose an architecture for web-based players that clearly separates the concerns while setting up an ecosystem for widespread community video development. The proposal builds on existing open web patterns and requires nothing proprietary.
A lot of HDR implementations are incredibly powerful, and the resulting demo reels are impressive. But video tech is mostly a tool to tell stories, and it turned out that HDR is overengineered compared to how people use it. That overengineering makes HDR difficult to work with, requiring complex and expensive solutions.
What would it take for us as an industry to replace complexity with convention, simplifying workflows and lowering costs? What would we lose, and what would we gain?
Server-side mp4 lossless precise seeking is hard. Most of the time, you jump between keyframes. Tracey will show the initial idea, to speedup the fps of the video frames just before the exact wanted seek point. This became early custom patches to mod_h264_streaming abandonware in 2013.
In 2021, it was time to switch to the "house" nginx "mp4" module and update their mp4 module to exact video starts. Time to become contributor #56 to nginx and collaborate with some of the best C programmers on the planet [level unlocked!] Let's hack some (dyslexia unfriendly) STTS and STSS moov atoms...
It's not a new trick, but I still enjoy a deliberately glitched video. It evokes nostalgia of the early days of video on the web, when the glitches were unintentional. It is a fun throwback aesthetic that utilizes some video compression and encoding techniques we mostly take for granted in the year 2021. It's time to revisit it, appreciate the "misuse", and maybe learn what a RIFF file is while we're at it.
This talk will go over what datamoshing is and the different types thereof. It will succinctly explain motion vectors and the frames, both I and P, as well as answer the question, What is the AVI container and why is it used in datamoshing? While there will be a slide with links to resources to make one's own glitched video, the talk will focus on a lower level explanation of why and how a "glitch" occurs rather than a tutorial on how to accomplish it. And of course, there will be a few fun examples sprinkled in there.
In this talk I will review some odd-looking numbers and design aspects that exists in modern days video and media systems and try to explain how and why they have been derived, what was the original intended utility and why we have stuck with them.
Among things I will review will be:
Interlace scan -- 1880 patent by Maurice Leblanc: reducing bandwidth for line-by-line transmission of 2D images
YUV color spaces -- 1938 patent by Georges Valensi: compatibility with Black&White TV
4:2:0 choma subsampling: 1949 patent by Alda Bedford, RCA; reducing bandwidth
First color TV system: 1951: CBS field-sequential system; did not use YUV!
First HDTV System: 1979 Japanese MUSE system. Analog. Extreme form of interlace and subsamling of everything.
3:2 pulldowns -- 1950s “Flying Spot” machines (all analog, this was well before CCDs!)
25/30 fps framerates -- 1930s: use of 50/60Hz AC power frequency as base reference
29.97 framerate -- 1953: bandwidth constraints in the design of NTSC
44.1kHz sampling rate -- 1980: desire to fit Beethoven's 9th Symphony on a single CD
NTSC, EBU, and SMPTE-C colors -- a case when industry has ignored the standard
Standard resolutions: 1080p, 720p, 480i, 576i, etc. -- each has history!
Anamorthic formats: 10:11, 12:11, 40:33, 4:3, 3:2, 64:33, etc. - each has history!
Imagine a live streaming viewer wants to know the delay of their live video, or a live infrastructure engineer wants to capture the latency performance of the pipeline, how could you have the latency systematically measured nearline at scale?
In this talk, we will present a latency measurement framework we developed for Linkedin live video. It supports capturing per stream latency from ingestion to playback, leveraging technologies including media signaling, header manipulation, and streaming processing.
We’ll be covering four major latency contributing factors in the live pipeline: ingestion to storage, origin server processing, origin to CDN, and CDN to client player latency. We will discuss how we work across the stack with Azure Media Service to capture each latency component per client in nearline fashion. We’ll also share some “glitches” we caught while building the flow.
This talk will cover the new features being added to the HLS spec. In particular Content Steering and Interstitials.
We will cover what the challenges have been in the past for Multi-CDN Steering and SSAI and how these systems have been built in the past. We will then go into how these new features are supported in HLS and what they enable/make easier.
Finally we will look at how you can start implementing these new features and talk about what it will mean for support across non-Apple devices.
In this talk, we will focus on the Pseudo HDR technologies and their effective and efficient implementations, targeted to provide HDR-like experiences for SDR videos streamed to ordinary rendering devices that may not be HDR capable. In particular, we are focused on the enhancement of the Quality of Experiences (QoE) for those UGC (User Generated Content) videos, which have observed an explosive growth thanks to the evolution of affordable consumer devices and tremendous popularity of social media platforms. The qualities of UGC videos actually vary over a fairly wide range, quite unpredictable. One category of methodologies to enhance the QoE is to adopt a pre-processing stage (or a pre-encoder) before the videos are fed into the encoder/transcoder to further distribute over the Internet. Our Pseudo HDR approach falls into such a pre-processing category. It maintains the same bit depth, e.g. 8-bit, for the input videos, aiming to enhance such content that contains significant super darker or lighter regions, to generate more visually impactful content and present PGC-like experiences on the end user side without the need of the availability of special rendering platforms, such as HDR-capable devices.
Our essential algorithm behind our Pseudo HDR approach is based on the so-called Contrast Limited Adaptive Histogram Equalization (CL-AHE: https://en.wikipedia.org/wiki/Adaptive_histogram_equalization#Contrast_Limited_AHE), where different histogram equalization transforming functions are derived for different regions in one video frame, targeted to especially enhance the contrast in those regions that are significantly lighter or darker. With contrast limitations, it will prevent too much noise from being amplified that may deteriorate the overall visual qualities.
We are adopting CL-AHE as part of the pre-processing procedure on videos before their transcoding. Advantages of applying such a procedure to the video source are two fold: Unanimous quality enhancement without relying on the ultimate rendering devices, and best leveraging the relatively abundant computational resources available at the transcoder side. UGC many times are first being uploaded to the cloud after their creation and before being widely distributed, and ultimately consumed over mobile devices, where the consumed power is limited. The power consumption performance for any mobile apps have significant impact on device heating status and battery life, eventually influencing the overall QoE to the ender users.
Existing CL-AHE approaches have been mainly studied and applied to still images. Even when considered for video applications, they are usually exploited for the processing of premium content instead of UGC. Overall, we are targeted to develop a Pseudo HDR approach that applies to UGC videos with effective and efficient implementations so that they are deployable to enhance the QoE for both VOD and live end-to-end streaming solutions.
The underlying challenges for Pseudo HDR pre-processing on videos are indeed significant:
(a) Noises presented in those super bright or dark areas, in particular those specific to UGC such as compression artifact noises, will be easily amplified by the AHE transformation and are visually fairly annoying.
(b) It’s always critical to preserve the temporal consistency for videos as one video is not simply a series of still images. If the pixel transformation for contrast enhancement does not take into account the temporal domain characteristics, artifacts such as flickering across frames especially in those dark areas would be very visible, deteriorating the overall visual experience. The smoothness and fluidity in the temporal domain for videos are even more critical than the per-frame visual quality represented in the spatial domain.
(c) The resulting transformed videos could be much larger in size, as edges and fine textures in the original dark areas, for instance, can be compressed out in the original SDR content without incurring much visual degradation. In contrast, after Pseudo HDR processing, more finer textures are exposed and hence demands more bits allocated to achieve sufficiently good visual quality.
Pseudo HDR is an innovative direction that we'd like to drive to enable video end users to enjoy the best possible visual quality. To address the above challenges and other major issues during our product launching with our customers, we have implemented and deployed following methodologies:
1. For super lighter regions, it is much likely to generate even lighter results in such regions after contrast transformation. We hence have developed brightness-level based adaptive approaches to prevent such defects, and meanwhile maintain appropriate transformations for other regions.
2. AHE often results in color distortion in skin areas, which are fairly noticeable to the end users. To preserve accurate chrome properties is very critical in Pseudo HDR processing. We developed several approaches, e.g. fast face detection, skin detection, and color based region segmentation to facilitate our Pseudo HDR to differentiate the skin regions from other regions in the frames and apply specific contrast transformations to preserve accurate skin color properties.
3. Regarding flickering or flashing artifacts caused by the inconsistent contrast transformation across adjacent frames, we observed that such artifacts appear much more frequently in those super darker areas, when the finer textures are over-amplified after transformation. We particularly detected such regions and applied adaptive joint spatial and temporal filtering to constrain the difference in the temporal domain and improve the consistency across adjacent frames after the transformation.
4. As aforementioned, after applying Pseudo HDR processing, encoding bitrate increment would be very likely to result, as much as more than 20% compared to no such processing is applied. Besides finer textures are over-amplified that makes a harder job to the encoder, banding artifacts may appear and existing compression artifacts in the source videos may also be exaggerated. All these will make the encoding task not as effective. Facilitated by such approaches of ROI detection, denoising, and debanding, we are not only able to ameliorate visual artifacts after pseudo HDR is applied, but also greatly attenuate the high frequency components in the videos so that encoding bitrates could be greatly reduced.
It is always important to examine the computational resources consumed for Pseudo HDR processing. Applying quality aware and bitrate effective Pseudo HDR processing comes with a cost in noticeable computational complexity increment and potentially a larger processing delay. Therefore, we have developed a series of presents for our Pseudo HDR approach, where each preset corresponds to a set of algorithm parameters to result in a variety of joint performances of computational complexity (delay), visual quality and generated bitrate, to serve different user scenarios. Such Pseudo HDR preprocessing modules can also be combined with the encoder optimization. Our experimental results demonstrate that video encoders, such as our Aurora1 AV1 encoder, facilitated with such preprocessing approaches can maintain the same level of bitrate, if not less, but deliver a much enhanced visual quality to the ender users.