Sunday, March 16, 2014

My WebRTC related papers from 2013

This post gives an overview of my past and ongoing research on web-based multimedia communication. I wrote four published co-authored research papers last year answering these questions - How does WebRTC interoperate with existing SIP/VoIP systems and how does SIP in JavaScript compare with other approaches? What challenges lie ahead of enterprises in adopting WebRTC and what can they do? How can developers take advantage of WebRTC and HTML5 in creating rich internet applications in the cloud? How does emerging HTML5 technology allow better enterprise collaboration using rich application mash-ups? Read on if one or more of these questions interest you...

A Case for SIP in JavaScript

WebRTC (Web Real Time Communications) promises easy interaction with others using just your web browser on a website. SIP (Session Initiation Protocol) is the backbone of Internet communications interconnecting many devices, servers and systems you use daily for your call. Connecting WebRTC with SIP/RTP raises more questions than what we already know, e.g., does interoperability really work? what changes do you need in the existing devices and systems? does the conversion happen at the server or in the browser? what are the trade-offs of doing the translation? what are the risks if you go full throttle with WebRTC in the browser? and many more. There is a need for a fair comparison to evaluate not only what is theoretically possible but also practically achievable with existing systems.
"This paper presents the challenges and compares the alternatives to interoperate between the Session Initiation Protocol (SIP)-based systems and the emerging standards for the Web Real-Time Communication (WebRTC). We argue for an end-point and web-focused architecture, and present both sides of the SIP in JavaScript approach. Until WebRTC has ubiquitous cross-browser availability, we suggest a fall back strategy for web developers — detect and use HTML5 if available, otherwise fall back to a browser plugin."

Taking on WebRTC in an enterprise

On one hand WebRTC enables the end-users to communicate with others directly from their browsers, on the other it opens a can of worms a door to many novel communicating applications that scares your enterprise IT guy. Existing enterprise policies and tools are designed to work with existing communication systems such as emails and VoIP, and employ bunch of edge devices such as firewalls, reverse proxies, email filters and SBCs (session border controllers) to protect the enterprise data and interaction from the outside world. Unfortunately, session-less WebRTC media flows and, more importantly, end-to-end encrypted data channels, pose the same threat to enterprise security that peer-to-peer applications did. The first reaction from your IT guy will likely be to block all such peer-to-peer flows in the firewall. There is a need to systematically understand the risk and propose potential solutions so that the pearls of WebRTC can co-exist with the chains-and-locks of enterprise security.
 "WebRTC will have a major impact on enterprise communications, as well as consumer communications and their interactions with enterprises. This article illustrates and discusses a number of issues that are specific to WebRTC enterprise usage. Some of these relate to security: firewall traversal, access control, and peer-to-peer data flows. Others relate to compliance: recording, logging, and enforcing enterprise policies. Additional enterprise considerations relate to integration and interoperation with existing communication infrastructure and session-centric telephony systems."

Building communicating web applications leveraging endpoints and cloud resource service

One problem with the growth of SIP was that there were not many applications that demonstrated its capabilities beyond replacing a telephone call. This resulted in telephony mindset dominating the evolution in both standards and deployments. At the core of it, WebRTC essentially does most of what SIP aims for - peer-to-peer media flows and control in the end-points. Unfortunately, telephony influence in existing SIP devices, servers and providers have hidden that benefit from the end-user. There is a need to avoid such dominating telephony influence on WebRTC by demonstrating how easy it is to build communicating web applications, without relying on existing (and expensive) VoIP infrastructure. 
Secondly, the open web is threatened by closed social websites that tend to lock the user data in their ecosystem, and force the user to only use what they build, go where they exist and talk to who they allow. This developer focused research is an attempt to go against them, by showing a rich internet application architecture focused on application logic in the end-point and application mash-ups at the data-level. It shows my 15+ crucial web widgets written in few hundred lines of code each, covering wide rand of use cases from video click-to-call to automatic-call-distribution, call-queue and shared white-boards, and my web applications such as multiparty video collaboration, instant messaging/communicator, video presence and personal social wall, all written using HTML5 without depending on "the others" (imagine tune from lost played here) - the others are the existing SIP, XMPP or Flash Player based systems.
"We describe a resource-based architecture to quickly and easily build communicating web applications. Resources are structured and hierarchical data stored in the server but accessed by the endpoint via the application logic running in the browser. The architecture enables deployments that are fully cloud based, fully on-premise or hybrid of the two. Unlike a single web application controlling the user's social data, this model allows any application to access the authenticated user's resources promoting application mash-ups. For example, user contacts are created by one application but used by another based on the permission from the user instead of the first application."

Private overlay of enterprise social data and interactions in the public web context

Continuing the previous research theme, this research focuses on enterprise collaboration and enterprise social interactions. The goal is to define a few architectural principles to facilitate rich user interactions, both real-time and offline in the form of digital trail, in an enterprise environment. In particular, my proof-of-concept project described in the paper shows how to do web annotations, virtual presence, co-browsing, click-to-call from corporate directory, and internal context sensitive personal wall. The basic idea is to separate the application logic from the user data, keep the user data protected within the enterprise network, use public web as a context for storing the user interactions and build a generic web application framework running the browser.
"We describe our project, living-content, that creates a private overlay of enterprise social data and interactions on public websites via a browser extension and HTML5. It enables starting collaboration from and storing private interactions in the context of web pages. It addresses issues such as lack of adoption, privacy concerns and fragmented collaboration seen in enterprise social networks. In a data-centric approach, we show application scenarios for virtual presence, web annotations, interactions and mash-ups such as showing a user's presence on linked-in pages or embedding a social wall in corporate directory without help from those websites. The project enables new group interactions not found in existing social networks."

Monday, August 06, 2012

The dangers to the promise of WebRTC

WebRTC is an emerging HTML5 component to enable real-time audio and video communication from within the web browser of a desktop or mobile platform. In this article I present my critical views on what can potentially go wrong in achieving the promise of WebRTC.

The promise

The technology promises the web developers a plugin-free cross-browser standard that will allow them to create exciting new audio/video communication applications and bring them to the millions (or even billions!) of web users on the desktop and mobile platform.

The rest of the article assumes these premises:
  1. A technology is easy to invent but hard to change.
  2. Business motivation drives technology deployment.

The dangers

Following are just some of the questions I will explore in this article.
  1. What if Apple does not commit to WebRTC?
  2. What if Microsoft's implementation is different than Google's?
  3. What if there are different flavors of the API implementation?
  4. What if a web browser cannot talk to legacy SIP infrastructure without gateways?
  5. What if "they" cannot agree on a common video codec?
Let us begin by listing the existing technologies - real-time communication protocols (RTP, SIP, SDP), HTTP, web servers, Flash Player and related media servers on desktop - these are hard to change. On the other hand the emerging HTML5 technologies can be changed in the right direction if needed.

Now what are the business motivations for the browser vendors? Google is the most open, rich and web-centric browser vendor with a lot of money and a lot of motivation to drive the technology for a true browser-to-browser real-time communication. It makes sense to design a platform that enables web clients to connect to each other via a rendezvous or relay service that runs in the cloud. Improving and simplifying the tools available to web developers while hiding the complexity of protocol negotiation will improve the overall web experience, which in turn will drive business for web-centric companies such as Google. Moreover, the end-to-end encryption and open codecs are more important to Google than having to deal with legacy VoIP infrastructure -- which gets reflected in the choice of the several new RTP profiles for media path and VP8 for video.

On the other hand, Apple prefers closed systems with integrated offering that works seamlessly across Apple platforms. Opening up the Facetime (or related technology) to all the web developers is interesting but not motivational enough, especially when the resulting web application and the WebRTC session is controlled by the web developers and users instead of the iOS developer platform or the iTunes portal. This makes Apple, as a browser vendor, uncommitted to WebRTC until it proves its worth to the business.

Microsoft already has a huge communication server business that is largely SIP based. Any technology that makes it difficult to interoperate (at the media path/RTP level) with the existing SIP application servers is likely to cause trouble. They could build a gateway, but it will be hard to compete with others who can provide such services on the low cost cloud infrastructure. Hence, having the building blocks in place that will enable a web browser to talk to existing SIP devices in the media path is probably a core requirement for Microsoft. Thus, low level access to SDP and more control over media path is proposed in their draft.

For a web browser on desktop, WebRTC is interesting in bringing a variety of new use cases and web application. However, these use cases are already available to web users on the desktop via alternatives such as Adobe's Flash Player. It it not clear what the WebRTC can contribute in terms of additional use cases for web applications on desktop. However, for mobile platforms and smart phones it is a completely different story. Flash Player either does not work or is too difficult to work for video conferencing on mobile. HTML5 presents a great opportunity to bridge the differences between end user platforms when it comes to web applications. Thus, adding WebRTC to web browsers on mobile phones such as iPhone and Android adds a lot of value.

Let us consider a few "what if"s that pose danger to the promise of WebRTC.

What if Apple does not commit to WebRTC?

If Apple with its iOS platform and variety of cool gadgets and frantic fans-base decides to keep the real-time communication out of reach of web developers -- people will still build such application albeit as standalone native applications and sell over iTunes. Eventually the web developer may end up writing custom  platform dependent kludges, e.g., invoke the native API for iOS but use WebRTC for others. Once Apple realizes that the web-based real-time communication is slipping its grips it may be too late.   Moreover, other browsers besides Safari could run on iOS instead.

What if Microsoft's implementation is different than Google's?

Based on the historic browser interoperability issues and the recent announcement by Microsoft, this is quite likely. Fortunately, web developers are used to such custom browser-dependent kludges. However, WebRTC poses greater risk of differences in operation and interoperability beyond just some HTML/Javascript hacks. Having a server side gateway for media path to facilitate interoperability among browsers from different vendors is not in the spirit of end-to-end principle and is not the correct motivation for WebRTC in my opinion. Moreover, the Chrome Frame extension for Internet Explorer could relieve web developers to some extent.

What if there are different flavors of the API in different browsers?

This is similar to the previous threat -- if the implementations among the browsers do not interwork, especially among the leading mobile platform vendors such as Apple, Google and Microsoft, then we are back to the square one. As a web developer what will be the motivation to build three different applications using WebRTC when one can build once using Adobe Flash Player? Perhaps, Adobe will revive the mobile version of Flash Player to solve the differences :)

What if a web browser cannot talk to legacy SIP infrastructure without gateways?

This issue has troubled both telecom and web-centric vendors alike. Web browsers do not want to blindly implement legacy SIP family of standards (actually RTP) because the problems on the web are different, and need to be addressed differently. The telecom centric vendors do not want to re-do everything to interoperate with web browser. If they can get the web browser to talk to their existing SIP/RTP components by keeping the media path intact, it will be a big deal. Having to implement a variety of new RTP profiles in their equipment for interoperability with web is unmotivational.

This could go in one of the two ways - either browser vendors agree on not interoperating with legacy RTP devices in which case we will need a gateway to do interoperability, or the browser vendors implement proprietary extensions (back doors!) that allow such hooks. If it is the latter, it will get immediately picked up by the telecom-sponsored developers to build applications that use the legacy RTP mode only -- not a good thing!

What if "they" cannot agree on a common video codec?

The debate on what codec to pick is as old as the debate on WebRTC. Currently, "they" are forming two camps -- one that supports royalty free VP8 family of standards and other that backs ubiquitous H.264 family of standards (or keep no mandatory codec). The advantage of H.264 is that many hardware already comes with fast H.264 processing hence will be available natively on mobile platform, whereas with VP8 you may end up using your battery life on the codec -- until VP8 becomes available in the hardware platform. It is not clear what will happen. If VP8 gets picked as the mandatory-to-implement codec in the browser, I can see some browser vendors digressing from the standard and implementing only H.264 (hardware supported codec) or implementing both VP8 and H.264.

What are the other potential threats to this technology? What if the standardization works drags for too long, until it is too late, and web users have moved on to something better! Web is driven by web applications. What a browser vendor does or what a standardization committee adopts are just a shadow of what the web applications do (or want to do). A web developer prefers to have a consistent, simple and unified API for real-time communication.

If it were in my hands, I would ban any IETF or W3C proposal without a running and open implementation. But the reality is far from this!

Saturday, March 31, 2012

Getting started with WebRTC

The emerging WebRTC standards and the corresponding Javascript APIs allow creating web-based real-time communication systems that use audio and video without depending on external plugins such as Flash Player. I got involved in a couple of interesting projects that use the pre-standard WebRTC API available in the Google Chrome Canary release. This article describes how to get started with these projects.

The first step is to download a browser that supports WebRTC, e.g.,  Google Chrome Canary. In your browser address bar, go to chrome://flags, search for Enable Media Stream and click to enable it. This enables the WebRTC APIs for your web applications running in the browser. Use this browser for getting started with the following two projects.

1. The first project is voice and video on web that enables web-based audio and video conferencing among multiple participants. It uses resource-oriented data access to move all the application logic running to the client application in HTML/Javascript, whereas the server merely acts as a data store interface. The architecture enables scalable and robust systems because of the thin server. The source code of WebRTC integration is available in the SVN branch, webrtc. Follow the instructions on the project web page to install and configure the system, but use the link to the webrtc branch instead of trunk in SVN. After visiting the main page of the installed project, see the help tab to create a new conference. On the preferences page, make this conference persistent, check to enable WebRTC and uncheck to disable Flash Player for this conference. Visit the conference from multiple tabs in your browser, using different user names. Let the moderator participant enable the video checkboxes for various participants to see the audio and video flowing via WebRTC among the participants.

2. The second project is SIP in Javascript that creates a complete SIP endpoint running in your browser. It contains a SIP/SDP stack implemented in Javascript along with a SIP soft-phone demonstration application. It has two modes: Flash and WebRTC. In the WebRTC mode, it uses a variation of SIP over WebSocket and secure media over WebRTC. The default mode is to use Flash, which you can change to WebRTC as follows. Checkout the source code from SVN and put it on a web server. Install and run a SIP proxy server that supports SIP over WebSocket as mentioned on the project page. Now visit the page on your web server phone.html?network_type=WebRTC on two tabs of your browser. Follow the getting started instructions on that page to register on the first tab's phone with one name, and on the second with another, and try a call from the first to the second. This is an example program trace at the proxy server from a past experiment on how the WebRTC capabilities are negotiated between the two phones using SIP over WebSocket. In addition to the network_type, you can also supply additional parameters on the page URL to pre-configure the phone, e.g., username, displayname, outbound_proxy, target_value. Example links are available to configure as alice and bob to connect to the web and SIP server on localhost, assuming your webserver is running locally on port 8080 and SIP server on websocket port 5080. Just open the two URLs in the two tabs, click on Register on each, make a call from one to another by clicking on the Call button, accept the call from another by clicking on the Call button, and see the audio and video flowing via WebRTC between the two phones.

Sunday, January 22, 2012

Translating H.264 between Flash Player and SIP/RTP

Our SIP-RTMP gateway as part of the rtmplite project includes the translation of packetization between Flash Player's RTMP and SIP/RTP for H.264. There are some hurdles, but it is doable! In this article I present what it takes to do such an interoperability. If you are interested in looking at the implementation, please see the _rtmp2rtpH264 and _rtp2rtmpH264 functions in the module.

Before jumping in to the details, let us take a brief background of H.264 packetization. The encoder encodes sequence of frames or pictures to generate the encoded stream, which is consumed by the decoder to re-create the video. The encoder generates what is called as NALU or network abstraction layer unit. The decoder works on a single NALU and needs sequence of NALUs to decode. Each frame can have one or more slices. Each slice can be encoded in one or more NALUs. There are certain pieces of information that remain same for all or many frames. For example, the sequence parameter set (SPS) and picture parameter set (PPS) are like configuration elements that need to be sent once or only occasionally instead of with every frame or NALU. The configuration parameters apply to the encoder, whereas the decoder should be able to decode any configuration.

RTMP Payload

Flash Player 11+ is capable of capturing from camera and encoding in H.264 to send to an RTMP stream. Each RTMP message contains header and data (or payload), where the header contains crucial information such as timestamp, stream identifier, and the payload contains the encoded video NALUs or actual configuration data. The format of the payload is same as that of the F4V/FLV tag for H.264 video in an FLV file. Each RTMP message contains one frame but may contain more than one NALUs. The first byte contains the encoding type, and for H.264 is either 0x17 (for intra-frame) or 0x27 (for non-intra frame). The second byte contains packet type and is either 0x00 (configuration data) or 0x01 (picture data). The configuration data contains both SPS and PPS as described here.

  rtmp-payload := enc-type[1B] | type[1B] | remaining
enc-type := is-intra[4b] | codec-type[4b]
is-intra := 1 if intra and 2 if non-intra
codec-type := 7 for H.264/AVC
If the type is configuration data then the next four bytes are configuration version (0x01), the profile index, the profile compatibility and the level index. This is followed by one byte containing least-significant two-bits that determine the number of bytes to use for the length of the NALU in subsequent picture data messages. For example, if the bits are 11b then it indicates 3+1=4 bytes of NALU length, and if the bits are 01b then it indicates 1+1=2 bytes of NALU length. Lets call this the length-size and possible values are 1, 2 or 4. This is followed by a byte containing least-significant 5 bit for the number of subsequent SPS blocks. Each SPS block is prefixed by 16-bits length followed by the bit-wise encoding of SPS as per H.264 specification. This is followed by a byte containing the number of subsequent PPS blocks. Each PPS block is prefixed by 16-bits length followed by the bit-wise encoding of PPS as per H.264 specification. Typically only one SPS and one PPS blocks are present.

remaining for config := version[1B] | profile-idc[1B]
| profile-compat[1B]
| level-idc[1B]
| length-flag[1B]
| sps-count[1B] | sps0 ...
| pps-count[1B] | pps0 ...
length-flag := 0[6b] | value[2b] where value + 1 is length-size
sps-count := 0[3b] | count[5b] where count is number of sps
pps-count := number of pps elements
sps(n) := length[2B] | sps
pps(n) := length[2B] | pps
If the type is picture data then the next three bytes contain a 24-bit number for the decoder delay value for the frame and is applicable only for B-frames. The default baseline profile does not include the B-frames. Thus the first five bytes of the picture data payload are like header data. This is followed by one or more NALU blocks. Each NALU block is prefixed by the length of the next NALU encoded-bits. The number of bytes used to encode this length is determined by length-size mentioned earlier. Then the NALU is encoded as per H.264 specification.

remaining-picture := delay[3B] | nalu0 | nalu1 ...
nalu(n) := length | nalu
length := number in length-size bytes
nalu := NAL unit as per H.264
Each NALU has first byte of flags. The flags contains 1 most-significant bit of forbidden, next 2-bits of nri (NAL reference index) and final 5 least-significant bits of nal-type. There are several nal-types such as 0x01 for non-intra regular pictures, 0x05 for intra-pictures, etc. Please see the H.264 specification for the complete list.

The camera captured and encoded data in Flash Player contains three NALUs in each RTMP message -- the access unit delimiter (nal-type 0x06), the timing-information (nal-type 0x09) and the picture slice (nal-type 0x01 or 0x05). The Flash Player is capable of decoding other nal-types as well, and does not require access unit delimiter or timing-information NALUs for decoding. I haven't seen any support for aggregated or fragmented NALUs in the Flash Player.

RTP Payload

The RTP payload format for H.264 is specified in RFC 6184 and is typically supported in SIP-based video phones. The RTP header contains the crucial information such as the payload type, the timing data, and the sequence number, whereas the actual configuration and picture NALUs are sent in the payload as specified by this RFC. The first byte is the type containing one bit forbidden, two bits of nri and 5 bits of nal-type.

nalu := nal-flags[1B] | encoded-data
nal-flags := forbidden[1b] | nri[2b] | nal-type [5b]
In addition to the base nal-types of H.264, the RFC defines new nal-types for fragmentation and aggregation. Traditionally, the Internet plagued by middle-boxes, NATs and firewalls has imposed a limit on the size of the UDP packet that can be pragmatically used on the Internet, and the typically MTU is around 1400-1500 bytes. The H.264 encoder is capable of generating much larger encoded frame sizes hence cannot be successfully sent as one frame per RTP packet over UDP in many cases. On the other hand, some low-sized encoded frames may be much smaller than MTU thus incurring additional overhead for RTP headers. These low-sized frames can be aggregated for efficiency.

Many SIP video phones configure their H.264 encoders to use multiple slice NALUs in a single frame, unlike Flash Player which generates one picture NALU per frame. Thus the traditional SIP video phones are capable of using low sized encoded payload without RFC 6184 which can be sent in a single RTP/UDP packet.

When a large encoded frame is fragmented to smaller fragments, the nal-type=28 is used in the first byte of each fragment, followed by the second byte containing the actual nal-type of the frame as well as the start and end markers. This is followed by the actual encoded data. The RTP header of all these fragments contain the same timestamp value. The last fragment of the frame contains the marker set to true, whereas all the previous ones set it to false. When multiple smaller frames are aggregated, the nal-type of 24 is used in the first byte of the aggregate payload, followed by one or more NAL data. Each NAL data is prefixed by 16-bit length of the encoded NALU. There are non-trivial rules on how the nri is obtained and we refer you to the RFC for the details.

To fragment:
let encoded-data = fragment0 | fragment1 | fragment2...
encoded-data of fragment(n) := orig-nal-flags[1B] | fragment(n)
orig-nal-flags := start[1b] | end[1b] | ignore[1b]
| orig-nal-type[5b]
start := 1 if first fragment else 0
end := 1 if last fragment else 0

To aggregate:
encoded-data of aggregate := nalu0 | nalu1 | nalu2 ...
nalu(n) := length[2B] | orig-nalu(n)
In additional to sending the SPS and PPS packets in RTP, the video phones also negotiate the configuration data via external protocol such as SIP/SDP. Since Flash Player does not do that, we will not discuss it further.


Now that we understand the packetization of H.264 for Flash Player as well as SIP/RTP, let us go over the details of the translation process.

The configuration data is sent periodically by Flash Player before every intra-picture frame. However, SIP phones may not send the configuration data periodically. It is desirable to cache the configuration data received from both sides, and re-use it when the other side connects. The first packet sent must contain the configuration data. It is also desirable to periodically send the configuration data to both Flash Player and SIP sides from the translator, irrespective of whether the configuration data is received periodically. In our translator we send the configuration data before every infra frame.

In Flash Player to SIP/RTP direction, when the configuration data is received on RTMP, it is sent in two RTP packets, one for SPS and one for PPS. Both use the same timestamp and set the marker to true. When picture data is received on RTMP and need to be sent to the RTP side, it is dropped until a previous configuration data has been sent to the RTP side. If the picture data is not dropped, all the NALUs are extracted. The last out-of-three NALUs per RTMP message is the actual picture NALU which is sent to the RTP side as follows. Only the nal-type of 1 and 5 are used, whereas others are ignored. If the NAL size is less than 1500 bytes, it is used as is in the RTP payload with marker set to true. If the NAL size is more, it is fragmented in to smaller fragments with each of size at most 1500 bytes. Multiple fragmented RTP packets are generated as per the RFC. All but the last fragment has marker set to false. The RTP marker of true indicates end of frame. All the fragments use the same timestamp value.

In the SIP/RTP to Flash Player direction, the configuration data is received in multiple RTP packets and are cached by the translator. When both SPS and PPS payloads have been received from the RTP side, we are ready to start streaming to the Flash Player side. Any incoming RTP packet is put in a queue. When the last packet in the queue (that was most recently received) has marker set to true, the queue is examined and RTMP messages are created to be sent to the Flash Player side. Since Flash Player handles complete frames in each RTMP message, we need to wait until the marker is set to true so that we only send complete frames to Flash Player. If the RTMP stream is ready but we have not received the configuration data from RTP or we have not or are not sending the first intra frame to RTMP, then received packets are dropped. If no infra frames are received for 5 seconds, then we send a fast-intra-update (FIR) request to the SIP/RTP side, so that it triggers the SIP phone to send an intra frame.

Once we decide that we can send packets to RTMP from the received RTP queue, we divide the queue in to groups of packets of same timestamp and same nal-type values while preserving the order of the packets. If the nal-type is 5 indicating that an intra-frame is being sent to RTMP, then we send a configuration data too before the actual picture data. The configuration payload format is explained earlier and contains both PPS and SPS along with other elements. Each group of packets of the same timestamp and same nal-type is sent as a single RTMP message in the same order containing one or more NALUs. If the nal-type is 1 or 5, the NALU from the RTP payload is used as is in the RTMP payload with five bytes of header as explained earlier. If the nal-type is 28 indicating fragmented packets, then all the fragmented payloads are combined in to a single NALU. If the nal-type is 24 indicating aggregated packet, then it is split in to individual NALU data. Then the sequence of NALUs generated from this group of packets of same timestamp and nal-type are combined in to a single RTMP payload to be sent to the Flash Player.


As mentioned in my previous article, there are a few gotchas. You must use the new-style RTMP handshake, otherwise the Flash Player will not decode/display the received H.264 stream. You must use Flash Player 11.2 (beta) or later when using "live" mode, otherwise the Flash Player does not accept multiple slice NALUs of a single frame. If audio and video are enabled, then the timestamp of video must be synchronized with the timestamp of audio sent to RTMP. Note that RTP picks random initial timestamp for each media stream so the audio and video RTP timestamp values are not easily co-related unless using RTCP or external mechanism. You need to co-related the RTP timestamps of audio and video to a single timestamp clock of RTMP.


It is possible to do re-packetization of H.264 between Flash Player's RTMP and standard SIP/RTP without having to do actual video transcoding. This article explains the tricks and gotchas of doing so!

The implementation works between Flash Player 11.2 and a few SIP video phones such as Ekiga and Bria 3.

[6] ITU-T recommendation H.264, "advanced video coding for generic audiovisual services", March 2010.
[7] ISO/IEC International Standard 14496-10:2008.

Wednesday, December 14, 2011

Understanding RTMFP Handshake

(Disclaimer: the protocol description here is not an official specification of RTMFP but just the protocol understanding based on the OpenRTMFP's Cumulus project as well as the IETF presentation slides.)


What is RTMFP? RTMFP (Real-time media flow protocol) allows UDP-based low-latency end-to-end media path between two Flash Player instances. Compared to earlier RTMP-based media path which runs over TCP, this new protocol enables actual real-time communication on the web. Although the end-to-end media path is not always possible when certain types of NATs and firewalls are present, it is possible to do end-to-end media across most residential-type NATs. The end-to-end media path between two Flash Player reduces latency as well as scalability of the service (or server infrastructure) since most heavy media traffic can be sent without going through the hosted server. The UDP transport reduces latency compared to TCP transport even if the media-path is client-server.

Why is understanding RTMFP important? Unlike the earlier RTMP, the new protocol RTMFP is still closed with no open specification available. There have been some attempts at reverse engineering the protocol for interoperability and some official slides explaining the core logic. Understanding the wire-protocol is not important if you are building Flash-based applications that work among each other. However for applications such as Flash-to-SIP gateway or Flash-to-RTSP translator, where you may need to interoperate between RTMFP and SIP/RTP, it is important to understand the wire-protocol in detail. For a Flash-to-SIP gateway incorporating RTMFP from the Flash side in addition to the existing RTMP will enable low-latency UDP media path between the web user and the translator service on the Internet.

The following description is reproduced from a contribution (see to my RTMP server project.


An RTMFP session is an end-to-end bi-directional pipe between two UDP transport addresses. A transport address contains an IP address and port number, e.g., "". A session can have one or more flows where a flow is a logical path from one entity to another via zero or more intermediate entities. UDP packets containing encrypted RTMFP data are exchanged in a session. A packet contains one or more messages. A packet is always encrypted using AES with 128-bit keys.

In the protocol description below, all numbers are in network byte order (big-endian). The | operator indicates concatenation of data. The numbers are assumed to be unsigned unless mentioned explicitly.

Scrambled Session ID

The packet format is as follows. Each packet has the first 32 bits of scrambled session-id followed by encrypted part. The scrambled (instead of raw) session-id makes it difficult if not impossible to mangle packets by middle boxes such as NATs and layer-4 packet inspectors. The bit-wise XOR operator is used to scramble the first 32-bit number with subsequent two 32-bit numbers. The XOR operator makes it possible to easily unscramble.
packet := scrambled-session-id | encrypted-part

To scramble a session-id,
scrambled-session-id = a^b^c

where ^ is the bit-wise XOR operator, a is session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

To unscramble,
session-id = x^y^z

where z is the scrambled-session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

The session-id determines which session keys are used for encryption and decryption of the encrypted part. There is one exception for the fourth message in the handshake which contains the non-zero session-id but the handshake (symmetric) session keys are used for encryption/decryption. For the handshake messages, a symmetric AES (advanced encryption standard) with 128-bit (16 bytes) key of "Adobe Systems 02" (without quotes) is used. For subsequent in-session messages the established asymmetric session keys are used as described later.


Assuming that the AES keys are known, the encryption and decryption of the encrypted-part is done as follows. For decryption, an initialization vector of all zeros (0's) is used for every decryption operation. For encryption, the raw-part is assumed to be padded as described later, and an initialization vector of all zeros (0's) is used for every encryption operation. The decryption operation does not add additional padding, and the byte-size of the encrypted-part and the raw-part must be same.

The decrypted raw-part format is as follows. It starts with a 16-bit checksum, followed by variable bytes of network-layer data, followed by padding. The network-layer data ignores the padding for convenience.
raw-part := checksum | network-layer-data | padding

The padding is a sequence of zero or more bytes where each byte is \xff. Since it uses 128-bit (16 bytes) key, padding ensures that the size in bytes of the decrypted part is a multiple of 16. Thus, the size of padding is always less than 16 bytes and is calculated as follows:
len(padding) = 16*N - len(network-layer-data) - 1

where N is any positive number to make 0 <= padding-size < 16

For example, if network-layer-data is 84 bytes, then padding is 16*6-84-1=11 bytes. Adding a padding of 11 bytes makes the decrypted raw-part of size 96 which is a multiple of 16 (bytes) hence works with AES with 128-bit key.


The checksum is calculated over the concatenation of network-layer-data and padding. Thus for the encoding direction you should apply the padding followed by checksum calculation and then AES encrypt, and for the decoding direction you should AES decrypt, verify checksum and then remove the (optional) padding if needed. Usually padding removal is not needed because network-layer data decoders will ignore the remaining data anyway.

The 16-bit checksum number is calculated as follows. The concatenation of network-layer-data and padding is treated as a sequence of 16-bit numbers. If the size in bytes is not an even number, i.e., not divisible by 2, then the last 16-bit number used in the checksum calculation has that last byte in the least-significant position (weird!). All the 16-bit numbers are added in to a 32-bit number. The first 16-bit and last 16-bit numbers are again added, and the resulting number's first 16 bits are added to itself. Only the least-significant 16 bit part of the resulting sum is used as the checksum.

Network Layer Data

The network-layer data contains flags, optional timestamp, optional timestamp echo and one or more chunks.
network-layer-data = flags | timestamp | timestamp-echo | chunks ...

The flags value is a single byte containing these information: time-critical forward notification, time-critical reverse notification, whether timestamp is present? whether timestamp echo is present and initiator/responder marker. The initiator/responder marker is useful if the symmetric (handshake) session keys are used for AES, so that it protects against packet loopback to sender.

The bit format of the flags is not clear, but the following applies. For the handshake messages, the flags is \x0b. When the flags' least-significant 4-bits are 1101b then the timestamp-echo is present. The timestamp seems to be always present. For in-session messages, the last 4-bits are either 1101b or 1001b.
flags meaning
0000 1011 setup/handshake
0100 1010 in-session no timestamp-echo (server to Flash Player)
0100 1110 in-session with timestamp-echo (server to Flash Player)
xxxx 1001 in-session no timestamp-echo (Flash Player to server)
xxxx 1101 in-session with timestamp-echo (Flash Player to server)

TODO: looks like bit \x04 indicates whether timestamp-echo is present. Probably \x80 indicates whether timestamp is present. last two bits of 11b indicates handshake, 10b indicates server to client and 01b indicates client to server.

The timestamp is a 16-bit number that represents the time with 4 millisecond clock. The wall clock time can be used for generation of this timestamp value. For example if the current time in seconds is tm = 1319571285.9947701 then timestamp is calculated as follows:
int(time * 1000/4) & 0xffff = 46586
, i.e., assuming 4-millisecond clock, calculate the clock units and use the least significant 16-bits.

The timestamp-echo is just the timestamp value that was received in the incoming request and is being echo'ed back. The timestamp and its echo allows the system to calculate the round-trip-time (RTT) and keep it up-to-date.

Each chunk starts with an 8-bit type, followed by the 16-bit size of payload, followed by the payload of size bytes. Note that \xff is reserved and not used for chunk-type. This is useful in detecting when the network-layer-data has finished and padding has started because padding uses \xff. Alternatively, \x00 can also be used for padding as that is reserved type too!
chunk = type | size | payload

Message Flow

There are three types of session messages: session setup, control and flows. The session setup is part of the four-way handshake whereas control and flows are in-session messages. The session setup contains initiator hello, responder hello, initiator initial keying, responder initial keying, responder hello cookie change and responder redirect. The control messages are ping, ping reply, re-keying initiate, re-keying response, close, close acknowledge, forwarded initiator hello. The flow messages are user data, next user data, buffer probe, user data ack (bitmap), user data ack (ranges) and flow exception report.

A new session starts with an handshake of the session setup. Under normal client-server case, the message flow is as follows:
 initiator (client)                target (server)
|-------initiator hello---------->|
|<------responder hello-----------|

Under peer-to-peer session setup case for NAT traversal, the server acts as a forwarder and forwards the hello to another connected client as follows:
 initiator (client)                forwarder (server)                     target (client)
|-------initiator hello---------->| |
| |---------- forwarded initiator hello-->|
| |<--------- ack ----------------------->|
|<------------responder hello---------------------------------------------|

Alternatively, the server could redirect to another target by supplying an alternative list of target addresses as follows:
 initiator (client)                redirector (server)                     target (client)
|-------initiator hello---------->|
|<------responder redirect--------|
|-------------initiator hello-------------------------------------------->|
|<------------responder hello---------------------------------------------|

Note that the initiator, target, forwarder and redirector are just roles for session setup whereas client and server are specific implementations such as Flash Player and Flash Media Server, respectively. Even a server may initiate an initiator hello to a client in which case the server becomes the initiator and client becomes the target for that session. This mechanism is used for the man-in-middle mode in the Cumulus project.

The initiator hello may be forwarded to another target but the responder hello is sent directly. After that the initiator initial keying and the responder initial keying are exchanged (between the initiator and the responded target directly) to establish the session keys for the session between the initiator and the target. The four-way handshake prevents denial-of-service (DoS) via SYN-flooding and port scanning.

As mentioned before the handshake messages for session-setup use the symmetric AES key "Adobe Systems 02" (without the quotes), whereas in-session messages use the established asymmetric AES keys. Intuitively, the session setup is sent over pre-established AES cryptosystem, and it creates new asymmetric AES cryptosystem for the new session. Note that a session-id is established for the new session during the initial keying process, hence the first three messages (initiator-hello, responder-hello and initiator-initial-keying) use session-id of 0, and the last responder-initial-keying uses the session-id sent by the initiator in the previous message. This is further explained later.

Message Types

The 8-bit type values and their meaning are shown below.
type meaning
\x30 initiator hello
\x70 responder hello
\x38 initiator initial keying
\x78 responder initial keying
\x0f forwarded initiator hello
\x71 forwarded hello response

\x10 normal user data
\x11 next user data
\x0c session failed on client side
\x4c session died
\x01 causes response with \x41, reset keep alive
\x41 reset times keep alive
\x5e negative ack
\x51 some ack
TODO: most likely the bit \x01 indicates whether the transport-address is present or not.

The contents of the various message payloads are described below.

Variable Length Data

The protocol uses variable length data and variable length number. Any variable length data is usually prefixed by its size-in-bytes encoded as a variable length number. A variable length number is an unsigned 28-bit number that is encoded in 1 to 4 bytes depending on its value. To get the bit-representation, first assume the number to be composed of four 7-bit numbers as follows
number = 0000dddd dddccccc ccbbbbbb baaaaaaa (in binary)
where A=aaaaaaa, B=bbbbbbb, C=ccccccc, D=ddddddd are the four 7-bit numbers

The variable length number representation is as follows:
0aaaaaaa (1 byte)  if B = C = D = 0
0bbbbbbb 0aaaaaaa (2 bytes) if C = D = 0 and B != 0
0ccccccc 0bbbbbbb 0aaaaaaa (3 bytes) if D = 0 and C != 0
0ddddddd 0ccccccc 0bbbbbbb 0aaaaaaa (4 bytes) if D != 0

Thus a 28-bit number is represented as 1 to 4 bytes of variable length number. This mechanism saves bandwidth since most numbers are small and can fit in 1 or 2 bytes, but still allows values up to 2^28-1 in some cases.


The initiator-hello payload contains an endpoint discriminator (EPD) and a tag. The payload format is as follows:
initiator-hello payload = first | epd | tag

The first (8-bit) is unknown. The next epd is a variable length data that contains an epd-type (8-bit) and epd-value (remaining). Note that any variable length data is prefixed by its length as a variable length number. The epd is typically less than 127 bytes, so only 8-bit length is enough. The tag is a fixed 128-bit (16 bytes) randomly generated data. The fixed sized tag does not encode its length.
epd = epd-type | epd-value

The epd-type is \x0a for client-server and \x0f for peer-to-peer session. If epd-type is peer-to-peer, then the epd-value is peer-id whereas if epd-type is client-server the epd-value is the RTMFP URL that the client uses to connect to. The initiator sets the epd-value such that the responder can tell whether the initiator-hello is for them but an eavesdropper cannot deduce the identity from that epd. This is done, for example, using an one-way hash function of the identity.

The tag is chosen randomly by the initiator, so that it can match the response against the pending session setup. Once the setup is complete the tag can be forgotten.

When the target receives the initiator-hello, it checks whether the epd is for this endpoint. If it is for "another" endpoint, the initiator-hello is silently discarded to avoid port scanning. If the target is an introducer (server) then it can respond with an responder, or redirect/proxy the message with forwarded-initiator-hello to the actual target. In the general case, the target responds with responder-hello.

The responder-hello payload contains the tag echo, a new cookie and the responder certificate. The payload format is as follows:
responder-hello payload = tag-echo | cookie | responder-certificate

The tag echo is same as the original tag from the initiator-hello but encoded as variable length data with variable length size. Since the tag is 16 bytes, size can fit in 8-bits.

The cookie is a randomly and statelessly generated variable length data that can be used by the responder to only accept the next message if this message was actually received by the initiator. This eliminates the "SYN flood" attacks, e.g., if a server had to store the initial state then an attacker can overload the state memory slots by flooding with bogus initiator-hello and prevent further legitimate initiator-hello messages. The SYN flooding attack is common in TCP servers. The length of the cookie is 64 bytes, but stored as a variable length data.

The responder certificate is also a variable length data containing some opaque data that is understood by the higher level crypto system of the application. In this application, it uses the diffie-hellman (DH) secure key exchange as the crypto system.

Note that multiple EPD might map to the single endpoint, and the endpoint has single certificate. A server that does not care about the man-in-middle attack or does not create secure EPD can generate random certificate to be returned as the responder certificate.
certificate = \x01\x0A\x41\x0E | dh-public-num | \x02\x15\x02\x02\x15\x05\x02\x15\x0E

Here the dh-public-num is a 64-byte random number used for DH secure key exchange.

The initiator does not open another session to the same target identified by the responder certificate. If it detects that it already has an open session with the target it moves the new flow requests to the existing open session and stops opening the new session. The responder has not stored any state so does not need to care. (In our implementation we do store the initial state for simplicity, which may change later). This is one of the reason why the API is flow-based rather than session-based, and session is implicitly handled at the lower layer.

If the initiator wants to continue opening the session, it sends the initiator-initial-keying message. The payload is as follows:
initiator-initial-keying payload = initiator-session-id | cookie-echo
| initiator-certificate | initiator-component | 'X'

Note that the payload is terminated by \x58 (or 'X' character).

The initiator picks a new session-id (32-bit number) to identify this new session, and uses it to demultiplex subsequent received packet. The responder uses this initiator-session-id as the session-id to format the scrambled session-id in the packet sent in this session.

The cookie-echo is the same variable length data that was received in the responder-hello message. This allows the responder to relate this message with the previous responder-hello message. The responder will process this message only if it thinks that the cookie-echo is valid. If the responder thinks that the cookie-echo is valid except that the source address has changed since the cookie was generated it sends a cookie change message to the initiator.

In this DH crypto system, p and g are publicly known. In particular, g is 2, and p is a 1024-bit number. The initiator picks a new random 1024-bit DH private number (x1) and generates 1024-bit DH public number (y1) as follows.
y1 = g ^ x1 % p

The initiator-certificate is understood by the crypto system and contains the initiator's DH public number (y1) in the last 128 bytes.

The initiator-component is understood by the crypto system and contains an initiator-nonce to be used in DH algorithm as described later.

When the target receives this message, it generates a new random 1024-bit DH private number (x2) and generates 1024-bit DH public number (y2) as follows.
y2 = g ^ x2 % p

Now that the target knows the initiator's DH public number (y1) and it generates the 1024-bit DH shared secret as follows.
shared-secret = y1 ^ x2 % p

The target generates a responder-nonce to be sent back to the initiator. The responder-nonce is as follows.
responder-nonce = \x03\x1A\x00\x00\x02\x1E\x00\x81\x02\x0D\x02 | responder's DH public number

The peer-id is the 256-bit SHA256 (hash) of the certificate. At this time the responder knows the peer-id of the initiator from the initiator-certificate.

The target picks a new 32-bit responder's session-id number to demultiplex subsequent packet for this session. At this time the server creates a new session context to identify the new session. It also generates asymmetric AES keys to be used for this session using the shared-secret and the initiator and responder nonces as follows.
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

The decode key is used by the target to AES decode incoming packet containing this responder's session-id. The encode key is used by the target to AES encode outgoing packet to the initiator's session-id. Only the first 16 bytes (128-bits) are used as the actual AES encode and decode keys.

The target sends the responder-initial-keying message back to the initiator. The payload is as follows.
responder-initial-keying payload = responder session-id | responder's nonce | 'X'

Note that the payload is terminated by \x58 (or 'X' character). Note also that this handshake response is encrypted using the symmetric (handshake) AES key instead of the newly generated asymmetric keys.

When the initiator receives this message it also calculates the AES keys for this session.
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

As before, only the first 16 bytes (128-bits) are used as the AES keys. The encode key of initiator is same as the decode key of the responder and the decode key of the initiator is same as the encode key of the responder.

When a server acts as a forwarder, it receives an incoming initiator-hello and sends a forwarded-initiator-hello in an existing session to the target. The payload is follows.
forwarded initiator hello payload := first | epd | transport-address | tag

The first 8-bit value is \x22. The epd value is same as that in the initiator-hello -- a variable length data containing epd-type and epd-value. The epd-type is \x0f for a peer-to-peer session. The epd-value is the target peer-id that was received as epd-value in the initiator-hello.

The tag is echoed from the incoming initiator-hello and is a fixed 16 bytes value.

The transport address contains a flag for indicating whether the address is private or public, the binary bits of IP address and optional port number. The transport address is that of the initiator as known to the forwarder.
transport-address := flag | ip-address | port-number

The flag is an 8-bit number with the first most significant bit as 1 if the port-number is present, otherwise 0. The least significant two bits are 10b for public IP address and 01b for private IP address.

The ip-address is either 4-bytes (IPv4) or 16-bytes (IPv6) binary representation of the IP address.

The optional port-number is 16-bit number and is present when the flag indicates so.

The server then sends a forwarded-hello-response message back to the initiator with the transport-address of the target.
forwarded-hello-response = transport-address | transport-address | ...

The payload is basically one or more transport addresses of the intended target, with the public address first.

After this the initiator client directly sends subsequent messages to the responder, and vice-versa.

A normal-user-data message type is used to deal with any user data in the flows. The payload is shown below.
normal-user-data payload := flags | flow-id | seq | forward-seq-offset | options | data

The flags, an 8-bits number, indicate fragmentation, options-present, abandon and/or final. Following table indicates the meaning of the bits from most significant to least significant.
bit   meaning
0x80 options are present if set, otherwise absent
0x20 with beforepart
0x10 with afterpart
0x02 abandon
0x01 final

The flow-id, seq and forward-seq-offset are all variable length numbers. The flow-id is the flow identifier. The seq is the sequence number. The forward-seq-offset is used for partially reliable in-order delivery.

The options are present only when the flags indicate so using the most significant bit as 1. The options are as follows.

TODO: define options

The subsequent data in the fragment may be sent using next-user-data message with the payload as follows:
next-user-data := flags | data

This is just a compact form of the user data when multiple user data messages are sent in the same packet. The flow-id, seq and forward-seq-offset are implicit, i.e., flow-id is same and subsequent next-user-data have incrementing seq and forward-seq-offset. Options are not present. A single packet never contains data from more than one flow to avoid head-of-line blocking and to enable priority inversion in case of problems.


Will update this article in future:
- Fill in the description of the remaining message flows beyond handshake.
- Describe the man-in-middle mode that enables audio/video flowing through the server.

Wednesday, December 07, 2011

Three Problems in Interoperating with H.264 of Flash Player

H.264 decoding has been part of Flash Player since version 9, but H264 encoding was recently added in version 11. Once Flash Player 11 beta was out I started looking in to integrating video translation in the SIP-RTMP gateway project. For a Flash-to-Flash video conference you do not need to understand the problems related to H.264 in Flash Player because everything is taken care of behind the scenes by Flash Player. Adding H.264 support in the flash-videoio project was relatively straight forward. However if you are building your own translator to interoperate video between Flash Player and some other application you will need to understand these problems.

1) The first problem is that Flash Player doesn't enable H.264 even for decoding if the RTMP connection does not use the new-style "secure" handshake. In the older version handshaking with bytes containing zeros worked, but not when using H.264. Eventually I found about this on reading some open-source-flash (osflash) forum post and incorporated it in my gateway.

2) The H.264 encoder generates some sequence headers (called SPS and PPS) which are essential in decoding the rest of the video data packets. The same is true with AAC audio codec. In particular in live H.264 publish mode, Flash Player generates periodic SPS/PPS packets so the other Flash Player (or SIP phone) can join the call later and still be able to start decoding the stream. However, some existing SIP video phones generate the sequence packets only once at the beginning. The SIP-RTMP gateway needed to be modified to cache the sequence packets received from non-Flash Player client and re-send them with correct timestamp to the Flash Player client that joined the stream late.

3) Looks like Flash Player 11.0 changed something related to buffering of live stream, which causes problems if the SIP side generates multiple slice NALU (primitive data units in H.264) per frame. The Flash Player itself generates one NALU per frame, however some existing SIP video phones (e.g., Bria 3) generate old-style multiple slice per frame and one NALU per slice and cannot be decoded and displayed in Flash Player 11 in live mode. You can read more about the problem. This is not a problem for buffered playback though. (update on 12/12/2011 -- I can verify that this bug has been fixed in Flash Player and video call works fine now between Bria 3 and Flash Player via my SIP-RTMP gateway)

Ekiga SIP phone uses the new-style RTP mechanism for fragmenting a full H.264 frame instead of using multiple slices in H.264 encoding. This can be easily translated to Flash Player and works with my SIP-RTMP gateway. However, Ekiga has another problem in incorrectly interpreting RTP timestamp of received stream which makes it play the stream much slower.

Monday, December 05, 2011

The Philosophy of Open Source

I recently read a book by Henrik Ingo on "Open Life: The Philosophy of Open Source". I strongly recommend software developers as well as technical managers to read it. Here I present some excerpts that I find very interesting in the book.

"If a buyer is willing to pay a lot for it, then a cheap product can be sold at a high price... It is not stupid to ask a high price, but it is to pay it."

"The law of supply and demand can lead to situations that seem strange when common sense is applied to them. ... The oil is no different to the oil that was on sale at a considerably cheaper price just the day before... When supply goes down, the price goes up - even if all else remains equal."

"Often, the kind of stuff branded a trade secret can also be absurdly insignificant, but the important thing is that they don't tell others about it. Today's companies are at least as interested in the things they don't do as the things they pretend to be doing and producing... There's an ominous sense that much of what we do is done with a logic of mean-spiritedness, whether it is in business or in our everyday lives!"

"In a word, Europe's farming policy is based on mean-spiritedness. The subsidies policy is based on farmers agreeing not to produce more food than their agreed quota (to keep the supply low and prices high)."

"The logic of mean-spiritedness that follows from the law of supply and demand, can also be found in all fields of commerce where there is any co called 'immaterial property', including IT, music, film, and other kinds of entertainment, but the most glaring examples of it occur within the world of computers."

"These three demands -- features, quality, and deadline -- would build certain tension into any project. If, for instance, the schedule is too tight, there may not be enough time to include all the features you want. But if a project manager leans on his team, demanding that all the features are included and the deadline be met, then they are compelled to do a rushed job and, inevitably, quality suffers. ... The Open Source community's no-deadlines principle makes excellent sense, and is probably one of the reasons Open Source programs are so successful... One of the most frequently asked questions at the time was, 'When will the next version of Linux be released?' Linus had a stock answer, which was always, 'When it's done.'"

"Why do people do things? The first reason is survival. The second reason is to have social life. And the third reason is that people do things for fun. In that order... Since we work to have fun, to enjoy it, then why do we drive ourselves into the ground trying to meet artificial deadlines?"

"Usually, the vision and business strategies which guide a company are created in the upper echelons of management, after which it's up to the employees to do whatever the boss requires of them...But the principle of 'do whatever you like' would suggest that the management should quit producing the whole vision and business strategies, and focus instead on making it possible for employees to realize their own vision as best they can. (Unfortunately) For many managers such a concept would seem totally alien."

"The lazier the programmer, the more code he writes. .. Typing is too arduous for him, so he writes the code for a word processing program... Because it's too much effort to print out a letter and take it to the postbox, he writes the code for e-mail... So, laziness is a programmer's prime virtue."

"It's not healthy for one's central motivation to be hatred and fear. And what if one day Linux did manage to bring down Microsoft? Would life then lose its meaning? In order to energize themselves, would the programmers then have to find some new and fearful threat to compete against?"

"Since the beginning of hacking, Open Source hackers have always made programs to suit their own needs. ... As a client, the Federal Republic of Germany accepted this logic, and they aren't likely to have any reason to complain. Not only did they get what they wanted, they got a high-quality solution, they got it cheap, and they got it fast. What could be unfair about that?"

"An interesting situation -- IBM had to keep developing Eclipse; yet, financially, investing in it was a bad idea. The solution, of course, was Open Source."

"A company that has calculated its tender openly is much easier to trust. If I were to receive an honest tender of 1,000,000 from a company that operated with open principles, and the tender from a closed company came in at 999,500, I am likely to laugh at the latter and accept the former."

"I once read somewhere about a study which showed that about 20 percent of ants in an anthill do totally stupid things, such as tear down walls the other ants have just built, or take recently gathered food and stow it where none of them will ever find it, or do other things to sabotage everything the other ants so diligently try to achieve. The theory is that these ants don't do these things out of malice but simply because they're stupid... Critics of Open Source projects claim that their non-existent hierarchy and lack of organization leads to inefficiency... If a number of people do some stupid things, we make a rule to say it mustn't be done. Then we need a rule that says everybody has to read the rules. Before long, we need managers and inspectors to make sure people read and follow the rules and that nobody does anything stupid, even by mistake. Finally, the organization has a large group of people who spend time thinking up and writing rules, and enforcing them. And those not involved in doing, are primarily concerned with not breaking the rules...However, Linux and Wikipedia prove the opposite is true... This is particularly true when you factor in that not all planners (managers) are all that smart. Which means organizations risk having their entire staff doing something really inane, because that's what somebody planned. So, it seems better to have a little overlapping and lack of planning, because at least you have better odds for some of the overlapping activities actually making sense..."

Saturday, December 03, 2011

Internet Video Communication: Past, Present and Future

I gave a presentation last month titled Hello 1 2 3, can you hear see me now? highlighting my point of view on the origins of Internet video communication technologies we see today. The full text of the presentation can be found at Internet video communication: past, present and future.

Modern video communication systems have roots in several technologies: transporting video over phone lines, using multicast on Internet2's Mbone, adding video to voice-over-IP (VoIP), and adding interactivity in existing streaming applications. Although the Internet telephony and multimedia communication protocols have matured over the last fifteen years, they are largely being used for interconnectivity among closed networks of telecom services. Recently, the world wide web has evolved as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise applications and multi-party collaboration. Unfortunately, there is a disconnect between the web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, Adobe's Flash Player is being used as a web browser plugin by many developers for voice and video calls over the web.

Learning from the mistakes of the past and knowing where we stand at present will help us build the Internet video communication systems of the future. I present my point of view on the evolution, challenges and mistakes of the past, and, moving forward, describe the challenges in bridging the gap between web and VoIP. I highlight my contributions at various stages in the journey of Internet audio/video communication protocols.

Saturday, August 20, 2011

Flash-based audio and video communications in the cloud

Internet telephony and multimedia communication protocols have matured over the last fifteen years. Recently, the web is evolving as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise apps and multi-party collaboration. Unfortunately, there is a disconnect between web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, the Flash Player is being used as a web browser plugin by many developers for web-based voice and video calls. We describe the challenges of video communication using a web browser, present a simple API using a Flash Player application, show how it supports wide range of web communication scenarios in the cloud, and describe how it can interoperate with Session Initiation Protocol (SIP)-based systems. We describe both the advantages and challenges of Flash Player based communication applications. The presented API could guide future work on communication-related web protocol extensions.

More details are available in our white-paper. The associated software and example use cases are available as flash-videoio and siprtmp projects. The white-paper also serves as the architecture and design document of these projects.

Voice and Video Communications on Web

I co-authored and presented a paper on "SIP APIs for voice and video communications on the web" at IPTcomm 2011. The paper compares various alternative architectures, and presents the components of our ongoing project at IIT, Chicago. We are open to sponsorship of the project to further continue its R&D work. Please feel free to get in touch with me or Prof. Davids if you are interested in sponsoring student projects in her lab related to this technology.

The paper and the presentation slides are available. The project page, open source code, and free demonstration page are also available.

Abstract: Existing standard protocols for the web and Internet telephony fail to deliver real-time interactive communication from within a web browser. In particular, the client-server web protocol over reliable TCP is not always suitable for end-to-end low latency media path needed for interactive voice and video communication. To solve this, we compare the available platform options using the existing technologies such as modifying the web programming language and protocol, using an existing web browser plugin, and a separate host resident application that the web browser can talk to. We argue that using a separate application as an adaptor is a promising short term as well as long-term strategy for voice and video communications on the web. Our project aims at developing the open technology and sample implementations for web-based real-time voice and video communication applications. We describe the architecture of our project including (1) a RESTful web communication API over HTTP inspired by SIP message flows, (2) a web-friendly set of metadata for session description, and (3) an UDP-based end-to-end media path. All other telephony functions reside in the web application itself and/or in web feature servers. The adaptor approach allows us to easily add new voice and video codecs and NAT traversal technologies such as Host Identity Protocol. We want to make web-based communication accessible to millions of web developers, maximize the end user experience and security, and preserve the huge global investment in and experience from SIP systems while adhering to web standards and development tools as much as possible. We have created an open source prototype that allows you to freely use the conference application by directing a browser to the conference URL.

Thursday, June 16, 2011

A Proposal for Reference Implementation Repository of SIP-related RFCs

One of the root causes of non-interoperable implementations is the misinterpretation of the specification. A number of people have claimed that SIP has become complicated and has failed to deliver its promise of mix-and-match interoperability. There are two main reasons: (a) the number of SIP related RFCs and drafts is growing faster than what a developer or product life-cycle can catch up with, and (b) many of the RFCs and drafts are not supported by an open implementation which results in misinterpretation of some aspects of the specification by the programmers. The job of a SIP programmer is to (1) read the RFC and draft for SIP or its extensions, (2) understand the logic and figure out how it fits in the big picture or how it relates to the existing SIP source code, (3) come up with some object-oriented class diagram, classes' properties and pseudo-code for their methods, and finally (4) implement the classes and methods.

Clearly the text in RFCs and drafts cannot be as unambiguous as real source code of a computer program. So many programmers may read and implement some features differently, resulting in non-interoperable implementations. Having a readily available pseudo-code for SIP and many of its extensions relieves the programmer of error-prone step (2) above, and resolves any misinterpretation at an early stage. There is a huge cost paid by the vendor or provider for this programmer's misinterpretation of the specification.

This project proposal is to keep an open and public repository of reference implementation of RFC 3261 and other SIP-related extensions. If this repository is maintained by public bodies such as SIPForum and open source community, it will enable easy access to developers and enable better interoperability of new extensions.

The goal of this effort will be to encourage submission of reference implementations by RFC and Internet Draft authors . In case of any ambiguity, the clarification will not only be applied to specification but also to the reference implementation.

If we use a very high level language such as Python then the reference implementation essentially also serves as a pseudo code, which can be ported to other programming languages. The goal is not to get involved in the syntax of a particular programming language, but just express the ideas more formally to prevent misinterpretation of the specification. Perhaps if Python is not suitable, then a similar high level language syntax can be defined.

This will greatly simplify the job of a programmer, and in the long term, will result in more interoperable and robust products seamlessly supporting new SIP extensions and features. The programmers will have fewer things to worry about; hence can write more accurate code in the short time. From an specification author's point of view, it will encourage him/her to write more solid and implementable specification without ambiguity, and encourage him/her to provide the pseudo-code in the draft. From a reviewer's point of view, one can easily judge the complexity of various alternatives or features, e.g., one can say that adding the extension 'foo' is just 10 lines of pseudo-code to the base SIP reference implementation.

It will help RFC and draft authors in seeing the complexity and implementation aspects of their proposal. Sometimes an internet-draft proposes multiple solutions without any details on them. This is partially due to the lack of implementation and complexity evaluation of the various approaches. With reference implementation and pseudo-code repository, the author can provide a patch to the existing code to evaluate the complexity of the proposal.

A few years ago I wrote a tool to annotate software source code with RFC/draft, so that when you are reading a class or method in a source code file, you can quickly know which part of the RFC/draft it implements. Please see an example here and here. Such annotations in reference implementation will help in co-relating the RFC/draft with the actual implementation.

If there is wide support for this proposal, we can raise it to SIPForum or other bodies, we can help get started and bootstrap the repository of reference implementations of a few SIP-related RFCs. Then we can invite contributions from the community and RFC/draft authors towards completing the implementations. Please post your comment to let us know what you think.

Sunday, June 12, 2011

RESTful communication over WebSocket

This article shows how to implement generic resource oriented communication on top of synchronous channel such as WebSocket. This is a continuation of my previous article on REST and SIP [1] and provides more concrete thoughts because I now have an actual implementation of part of this in my web conferencing application. Other people have commented on the idea of REST on WebSocket [2]. (Using the term RESTful, which inherently is stateless, is confusing with a stateful WebSocket transport. Changing the title of this article to "Resource oriented and event-based communication over WebSocket" is more appropriate.)

Following are the basic principles of such a mechanism.
  1. Enable resource-oriented (instead of RPC-style) communication.
  2. Enable asynchronous notification when a resource (or its child resource) changes.
  3. Enable asynchronous event subscribe, publish, and notification.
  4. Enable Unix file system style access control on resources.
  5. Enable the concept of transient vs persistent resources.
Consider my favorite example of web conferencing application. The logged in users list is represented as resource /login relative to the hosting domain, and the calls list as /call. If the provider supports concept of registered subscribers, those can be at /user. For example, /user/ can be my user profile.

Now let us look at how the four motivational points apply in this example.

1) Enable resource-oriented (instead of RPC-style) communication.

Every resource has its own representation, e.g., in JSON or XML. For example, /login/ can be {"name": "Bob Smith", "email": "", "has-video": true, "has-audio": true}. The client-server communication can be over HTTP using standard RESTful or over WebSocket to access these resources.

Over WebSocket, the client sends a request of the form '{"method":"PUT","resource":"/login/", "type":"application/json","entity":{"name":"Bob Smith", ...}}' to login. The server treats this as same as that received on just HTTP using RESTful PUT request.

A resource-oriented (instead of RPC-style) communication allows us to keep all the business logic in the client, which uses the server only as a data store. The standard HTTP methods allow access to such data, e.g., POST to create, PUT to update, GET to read and DELETE to delete. POST is a special method that must return the resource identifier of the newly created resource. For example, when a client does POST /call to create a new call, the server returns {"id": "conf123"} to indicate that the new resource identifier is "conf123" relative to /call and call be accessed at "/call/conf123".

2) Enable asynchronous notification when a resource (or its child resource) changes.

Many web applications including web conferencing need the notion of asynchronous notifications, e.g., when a user is online, or a user joins/leaves a call. Traditionally, Internet communication has used protocols such as SIP and XMPP for asynchronous notifications. With the advent of WebSocket (and the popular project) it is possible to implement persistent client-server connection for asynchronous notifications and messages within the web browser.

In this mechanism, a generic notification architecture is applied to resources. A new method named "SUBSCRIBE" is used to subscribe to any resource. A subscriber receives notification whenever the resource or its immediate children are changed (created, updated or deleted). For example, a conference participant sends the following over WebSocket: '{"method":"SUBSCRIBE","resource":"/call/conf123"}'. Whenever the server detects that a new PUT is done for "/call/conf123/participant12" or a new POST is done for "/call/conf123" it sends a notification message to the subscriber over WebSocket: '{"notify":"UPDATE","resource":"/call/conf123","type":"application/json","entity":{ ... child resource}, "create":"participant12"}'. On the other hand, if the moderator does a PUT on "/call/conf123", then the server sends a notification as '{"notify":"PUT","resource":"/call/conf123","type":"application/json", "entity":{... parent resource}}'. In summary, the server generates the notification to both the modified resource "/call/conf123/participant12" as well as the parent resource, "/call/conf123".

The notification message contains a new "notify" attribute instead of re-using the "method" attribute to indicate the type of notification. For example, "PUT", "POST", "DELETE" means that the resource identified in "resource" attribute has been modified using that method by another client. In this case the "type" and "entity" attribute represent the "resource". Similarly, "UPDATE" means that a child resource has been modified and the details of the child resource identifier is in "create", "update" or "delete" attribute. In this case the "type" and "entity" attribute represent the child resource identified in "create", "update" or "delete".

The concept of notifications when a resource change is available in ActionScript programming language. For example, a markup text can use width="{size}" to bind the "width" property of a user interface component to the "size" variable. Whenever the "size" changes the "width" is updated. A property change event is dispatched to enable the notification. Similarly in our mechanism, a resource can be subscribed for to detect change in its value or the value of its children resources by the client application.

3) Enable asynchronous event subscribe, publish, and notification

The previous point enables a client to receive notification when a resource changes and these notifications are server generated notifications. Additionally, we need a generic end-to-end publish-subscribe mechanism to allow a client to send notification to all the subscribers without dealing with a resource modification. This allows end-to-end notifications from one client to others, via the server.

When a client subscribes to a resource, it also receives generic notifications sent by another client on that resource. A new NOTIFY method is defined. For example, if a client sends '{"method":"NOTIFY","resource":"/login/","type":"application/json","data":{"invite-to":"/call/conf123","invite-id":"6253"}}', and another client is subscribed to /login/, then it receives a notification message as '{"notify":"NOTIFY", "resource":"/login/","type":"application/json","data":{...}}'. In summary, the server just passes the "data" attribute to all the subscribers. The "notify" value of "NOTIFY" means an end-to-end notification generated because another client sent a NOTIFY method.

In a web conferencing application, most of the notifications are associated with a resource, e.g., conference membership change, presence status change, etc. Some notifications such as call invitation or cancel can be independent of a resource, and the NOTIFY mechanism can be used. For example, sending a NOTIFY to /login/ is received by all the subscribers of this resource.

4) Enable Unix file system style access control on resources.

Without an authentication and access control mechanism, the resource oriented communication protocol described earlier becomes useless. Fortunately, it is possible to design a generic access control mechanism similar to Unix file system. Essentially, each resource is treated as a file and a directory. In analogy, all the child resources of this resource belong to the directory, whereas the resource entity belongs to the file. The service provider can configure top-level directories with user permissions, e.g., anyone can add child to "/user", and once added will be owned by that user. Thus if user Bob creates /user/bob, then Bob owns the subtree of this resource. It is up to Bob to configure the permissions of its child resources. For example, it can configure /user/bob/inbox to be writable by anyone but readable only by self, e.g., permissions "rwx-w--w-". This allows a web based voice and video mail application.

Unlike traditional file system data model with create, update, read and write, we also need permissions bit for subscription. For example, only Bob should be allowed to subscribe to /user/bob so that other users cannot get notifications sent to Bob. The concept of group is little vague but can be incorporated in this mechanism as well. Finally, a notion of anonymous user needs to be added so that any client which does not have account with the service provider can also get permissions if needed.

In summary, the permissions bits become five bits for each of the four categories: self, group, others-authenticated, others-anonymous. The four bits define permissions to allow create, read, update, write and subscribe. Existing authentication such as HTTP basic/digest, cookies or oAuth based sessions can be used to actually authenticate the client.

5) Enable the concept of transient vs persistent resources.

In software programming, application developers typically use local and global variables to represent transient and persistent data respectively. A similar concept is needed in the generic resource oriented communication application. So far we have seen how to read, write, update and create resources. Each resource can be transient, so that it is deleted when the client which created the resource is disconnected, or persistent which remains even after the client disconnects. For example, when a client POSTs to /call/conf123, it wants that resource to be transient which gets deleted when the client is disconnected. This causes the resource to be used as a conference membership resource, and the server notifies other participants when an existing participant is disconnected. On the other hand, when a client POSTs to /user/, it wants it to be the persistent user profile which is available even when this user has disconnected.

The concept of transient and persistent in the resource-oriented system allows the client to easily create a variety of applications without having to write custom kludges. In general a new resource should be created as transient by default, unless the client requests a persistent resource. Whenever the client disconnects the WebSocket all the transient resources (or local variables) of that client are deleted, and appropriate notifications are sent to the subscribers as needed.


I have implemented part of this concept in my web conferencing project. The server side (called as service provider) is a generic PHP "restserver.php" application that accepts WebSocket connections and uses a back-end MySQL database to store and manage resources and subscriptions. Each connected client is assigned a unique client-id. There are two database tables: the resource table has fields resource-id (e.g., "/login/"), parent-resource-id (e.g., "/login"), type (e.g., "application/json"), entity (i.e., actual JSON representation string), and client-id, whereas the subscribe table has fields resource-id of the target resource and client-id of the subscriber. The subscriptions are always transient, i.e., when the client disconnects the all subcribe rows are removed for that client-id. The resources can be transient or persistent. By default any new resource is created as transient and the client-id is stored in that row. When the client disconnects all the resources with the matching client-id are removed and appropriate notifications generated. A client can create persistent resource by supplying "persistent": true attribute in the PUT or POST request, and the server puts empty client-id for that new resource.

The generic "restserver.php" server application can be used in a variety of web communication applications, and we have shown it to work with web conferencing and slides presentation application.