Tuesday, November 10, 2009

Protocol Jungle of Internet multimedia communication

The diagram shows several protocols for Internet multimedia communication. (Click on the diagram to see the full size picture.) In the protocol jungle, a protocol is analogous to a species, its real-world implementation or deployment is an animal of the species. Some animals or species compete with each other for survival. Some animals live with each other in harmony. Some animals do not care or interact with each other since they live in different place, i.e., application or domain. Evolution and mutation results in long lasting survival of some species whereas others become extinct. Unlike using a protocol zoo metaphor, I use a protocol jungle, because there is really a competition between protocols when big companies have invested in certain protocol unlike a closely guarded and nurtured zoo system.




The diagram shows the species and its relationship with other species, e.g., whether A uses B or whether A and B are friendly. Due to space constraint, some items are grouped together, e.g., all the audio/video codecs, and some relationships are missing, e.g., RTMP is friendly with Speex. Ideally, we need a multi-dimensional representation to show multiple aspects of the jungle and how they are related. The following text lists the protocols that serve similar or common functions, and usually are competing within that function.

FunctionProtocols
Structured data encodingXML, ASN.1, RFC822, others
Audio encodingG.711, G.723.1, G.722, G.726, G.728, G.729, MP3, Speex, Nellymoser, AMR, Silk, GIPS, etc.
Video encodingH.261, H.263, H.264, MPEG, Sorenson, Vidyo, etc.
Media transportRTP/RTCP, SRTP, ZRTP, Skype, IAX, RTMP, RTMFP
RendezvousSIP, H.323, Skype, Stratus/RTMFP
Session descriptionSDP, H.245, Jingle
Session negotiationSIP/SDP, H.245, Jingle, Skype, RTMFP
Call signaling and controlSIP, H.225/Q.931, Skype, IAX, MGCP, SCCP (Skinny), RTMFP
Streaming media controlRTSP, RTMP
Session announcementSAP
ConnectivityICE/STUN/TURN, Skype, RTMFP
Remote Procedure CallSOAP, XMLRPC, REST, RTMP
Programming callsCGI, CPL, CCXML, MSCML
Programming voice dialogVoiceXML
Instant messagingXMPP, SIMPLE, MSRP
PresenceXMPP, SIMPLE
Shared resource accessREST, XMPP, XCAP
Shared stateXMPP, RTMP, HTTP


As you can see that a SIP system typically employs one protocol for one task or a few related tasks, but integrated monolithic systems such as those based on RTMP/RTMFP, Skype or IAX tend to combine multiple functions in the single protocol. I have not listed H.32x protocols other than H.323 because those are intended for non-IP networks. Nevertheless, there are several H.32x systems, e.g., for room based video conferencing or for carrying voice among carriers.

Interworking

With multiple protocols available for the same function, interoperability or interworking among those becomes important. I have talked about SIP and XMPP interworking in the last post. I have hands-on experience with several of the interworking scenarios among protocols shown in the diagram.

H.323-H.324: One of my projects in my first job was interworking between H.323 and H.324. Since both these systems use H.245 as the main session description and negotiation, the interworking task is relatively simple. I also worked on part of H.320 system to try to build H.323-H.320 interworking, but did not complete.

SIP-H.323: One of my first project during my M.S. at Columbia University was SIP-H.323 interworking. I have written sip323 software and couple of internet drafts and papers [1] on this. My PhD thesis gives a complete interworking procedure for basic call setup and registration. The conclusion was that while basic call setup and registration are easy to interwork, the full interworking of all the supplementary services is not feasible and not even needed in many cases. Since both SIP and H.323 use RTP/RTCP for media transport and can use the same set of codecs, the signaling gateway is efficient. The company SIPquest which productized my software demonstrated 10k simultaneous calls (this article).

SIP-RTSP: These protocols serve different purposes, but it is possible to build a system that needs both these functions in a standard compliant way. The sipum software is a voice mail and answering machine that uses SIP for calls and RTSP for recording and playback of media. Since both these use RTP/RTCP for media transport and can use the same set of codecs, the software is efficient as the media path can bypass the software. Please see my papers [1] for details.

SIP-RTMP: There have been several attempts at implementing Flash based SIP systems and SIP-RTMP translator is one of the approach. Some existing projects that implement these are siprtmp, gtalk2voip, red5phone and flaphone. Since RTMP is an integrated streaming protocol which can also do control and RPC, the translator is inefficient since it needs to incorporate the media path as well.

SIP-Skype: Being a proprietary protocol, it is not easy to interwork with Skype. However, Skype itself uses SIP to allow trunking with PSTN providers, and recently there was some news about SIP-based Skype gateway for enterprise.

SIP-IAX: Although IAX is open, it is an integrated protocol that combines media and signaling in the same connection, hence suffers from the same scalability problem as other integrated protocols like RTMP. Asterix also has a SIP gateway so that it can talk to SIP-enabled devices, especially carrier equipments.

SIP-XMPP: There is a interest group that discusses this in depth. My last post gives more links about the interworking scenarios using a gateway or co-location in the client.

SIP-RTMFP: Given the P2P promise of RTMFP, a gateway between these two protocols will be able to connect the proprietary Adobe protocol with the rest of the world for a true web-based end-to-end media path. I haven't seen any system that does this.

SIP-H.320: This gateway is particularly useful for existing room based video conferencing systems that want to connect with more Internet-style SIP devices. The idea is similar to SIP-H.323 translator, and in fact a real deployment may use two gateways: SIP-H.323 and H.323-H.320 in practice.

RTMP-XMPP: Since RTMP and XMPP serve two completely different functions, there is no need to interoperate. However, people have built systems that use XMPP for messaging and signaling while using RTMP for media path. Unfortunately since Jingle extension wants to define its own end-to-end session, it becomes not so useful for exchanging RTMP server session information. In particular use XMPP custom extensions based on presence and message to rendezvous, but do session control and call management in RTMP itself.

XMPP-SIMPLE: The SIP-XMPP interest group is also looking at SIMPLE-XMPP translation. However, given the disconnect between the two protocols, it is likely that all the presence and message updates go through the gateway and hence not as efficient as one would want for presence and instant messages.

RTMP-Skype: Now this is going to be really tough because firstly Skype is still a proprietary protocol, and secondly, both these are integrated protocols hence requiring complete conversion of signaling and media. An specific example could be allowing people to access Skype from web pages, e.g., by having a simple RTMP server in the Skype application itself. This works if Skype is running on your local computer. Alternatively, you need the Flash application to connect to Skype super-nodes running on public computers using RTMP. This poses security risk and is inefficient. Why inefficient? because RTMP over TCP means that only the applications on public Internet will be able to receive the connection, and RTMP is not really good for real-time interactive communication because of its latency and buffering. However, if such gateway are incorporated in Skype, then it truly become ubiquitous to web applications.

RTMP-RTSP: These are two competing streaming protocols. Instead of having a gateway that translates between the two protocols, it might be better to build an integrated client or integrated server -- you can record using RTMP and view using Quicktime (RTSP), or you can use the same client to access real-time streams from RTMP or RTSP. Since RTMP incorporates RPC along with streaming control and media path, whether as RTSP is just streaming control, a complete translation of all the functions may not be feasible.

ASN.1-XML: There has been effort to standardize this, e.g., XER. The proposed H.325 standard by ITU-T will use XML while allowing compatibility with some of the predecessors which are in ASN.1 PER. ASN.1 and XML are just data formats and for the purpose of P2P-SIP, they are not very significant.

If you have data about the usage in real deployment for particular protocol(s), feel free to post your comment.

[1] My publication page http://kundansingh.com/#papers

Monday, November 09, 2009

SIP vs XMPP or SIP and XMPP?

(This post is unrelated to P2P, and describes the differences between the two sets of protocols SIP and XMPP. I have implemented both SIP and XMPP, as well as used several existing libraries for SIP and XMPP, so I can comment on the two sets of standards from a developer point of view as well)

History
SIP was invented to provide rendezvous for session establishment and negotiation on the Internet. XMPP (or Jabber) was invented to do structured data exchange such as synchronous or active presence and text communication among group of people. XMPP evolved from instant messaging and presence, whereas SIP evolved from Internet voice/video communication. Later, XMPP added support for session negotiation using the Jingle extension, and SIP community added extensions such as SIMPLE to support instant messaging and presence.

Technically comparing SIP and XMPP is like comparing apples and oranges because the core protocols serve different purposes: session randevous/establishment vs structured data exchange. On the other hand, because of the extensions invented in both the protocol worlds, SIMPLE and Jingle, they now have overlapping functions, and can be compared. When one compares SIP vs XMPP, actually the comparison is SIP/SIMPLE vs XMPP for IM and presence and/or SIP/SDP vs XMPP/Jingle for session negotiation. Even though the goals of the two sets of protocols are converging, there are fundamental architectural differences that I will enumerate in this article. There are other articles on SIP vs XMPP [1, 2, 3].

Differences: SIP vs XMPP
The following table lists the crucial differences between the two sets of protocols.


SIPXMPP
PurposeProvide rendezvous for session establishment and negotiation where the actual session is independent, e.g., over RTP media transport.Provide a streaming pipe for structured data exchange between group of clients with the help of server(s), e.g., for instant messaging and presence
ProtocolText-based request-response protocol similar to HTTP, where core attributes are signaled using headers, and additional data using message body, e.g., session description of capabilities.
XML-based client-server protocol to create a streaming pipe on which it sends request, response, indication or error using XML stanza between client and server, and between servers.
TransportUsually implemented in connection-less UDP as well as connection-oriented TCP transport. Also works over secure TLS transport.Works over connection-oriented TCP or TLS transport.
ConnectionA user-agent is both client and server, hence can send or receive connections, in case of TCP or TLS. This does not work well with NATs and firewalls, hence extensions are defined to use reverse connections when server wants to send message to client.The client initiates the connection to the server, which works well with NATs and firewalls. Additionally, extensions are defined such as BOSH to carry XMPP stanza over HTTP to work with very restricted firewalls


There are many other differences, e.g., the way a URI is represented, or how authentication is done, or what kinds of messages are supported. I will not go into details of those since they tend to become too specific for the kind of application and we miss the important points. From a developer's point of view 'ease of programming' is very important.

Ease of programming
Both SIP and XMPP are easy to implement. My 39 peers project has modules for both in few thousand lines of Python code. Although the basic protocol is easy to implement, a complete system such as a collaboration client with audio/video and messaging/presence support is very complex.

Because of the way these protocols have originated, they are well suited for certain kinds of applications. For example, if you want to build an audio/video communication system, it is better to start with SIP. Features such as interoperability with other VoIP phones, incorporating any-cast call distribution, or using existing VoIP provider for trunking are easy and readily available using SIP. If you want to build an instant messaging and presence client, it is better to start with XMPP. Features such as friends roster, group chat, blocking a user, storing offline messages, etc., are readily available using XMPP. Any advanced communication or collaboration system needs to include both these kinds of features.

XMPP has solved the application's problems and has defined mechanisms for several commonly used features in an instant messenger-type or shared state-type application, e.g., group chat, visiting card, avatars, etc. The emphasis is on application design, use cases, and practical solutions.

I think there are two main reasons for SIP's difficulty among developers: (1) the emphasis of SIP is on interoperability rather than application and feature design, and (2) the emphasis in SIP community is to have one protocol solve one problem, which requires implementing a plethora of protocols for a complete system. Let me explain these further.

When a new VoIP features is implemented by one phone, it must interoperate with another phone or VoIP service provider. Hence most SIP extensions focus on wire-protocol and interoperability mechanisms. Although specifications of several SIP extensions are available, there are no evaluation or open reference implementation on how they fit in the overall design. More recently efforts have been made, including my RFC 5638 (Simple SIP Usage Scenario for Applications in the Endpoints), to simplify the specifications for certain types of SIP applications -- those endpoints that want to work in web and Internet world without the legacy of the traditional telephony systems.

Secondly, SIP community tries to keep one protocol to solve one problem. Some extensions deviate from this guideline, but they are exceptions. The problem comes when this design principle involves implementing several distinct protocols just to get a complete system. For example, a SIP system incorporates other external mechanisms such as STUN, TURN, ICE, reverse-connection-reuse and rport-based symmetric request routing to solve the NAT and firewall traversal problem, and still does not guarantee media connectivity in all scenarios unless HTTPS/TCP tunnel in used. Implementing instant messaging and presence involves implementing several RFCs and drafts related to Event, PUBLISH, CPIM, PIDF, XCAP, MSRP, and still the application does not have all the features of commonly available XMPP client. In summary the SIP community has created numerous extensions for solving several problems in a way that scares away a new developer!

As you can see, both these reasons (emphasis on interoperability and one-protocol-one-problem) are ideal in theory. So what is wrong? The practice. To solve these problems, (1) IETF working groups should not proceed with a draft without an open-source and simple reference implementation, (2) IETF working groups should build reference applications combining several protocols for different kinds of applications and evaluate (a) consistency and (b) ease of programming.

Consistency indicates whether the new extension is consistent with existing guidelines, best practices, protocol format, as well as design principles. For example, if an extension incorporates a new processing in the server which could have been done in the endpoint, then it is against the principle of intelligence in the endpoints. Such extensions should be marked as such so that developers know the trade-off. There are only a few good design principles, hence creating a consistency matrix of extensions against principles should be easy.

Ease of programming is determined by three things: (1) how easy it is to implement the set of protocols, (2) how easy it is to build a real application using those protocols, and (3) how easy it is to build the real application using existing platforms and tools. The first is usually available as a software library, the second as an application and the third is re-usability. It should be easy to not only build the library but also use the library to build a usable application. Every new extension adds new things to the library, which cause more interaction in the application and hence more complexity. When a software project is started, usually the interoperability is not the highest requirement, but the re-usability, short development time and real prototype application are crucial requirements. Once the project is started on one path, it is very difficult to change the path by changing the core communication protocol. If there are reference implementations then not only they help you get started quickly but it also becomes easy to see how much additional complexity a particular SIP extension brings to the application. An important programming quote: less is better than more!

The flexibility of SIP also comes with its limitations. For example, SIP is flexible to support both UDP and TCP transport. However, UDP is treated as a second-class citizen by many programming languages or libraries even today, e.g., Tcl didn't support built-in UDP socket when it came out, and Adobe ActionScript does not have built-in UDP sockets for Flash Player even now. This prevents a developer from building a complete SIP stack as Flash application, for example. However, if you peek further, you would expect that if UDP is not supported then the platform is not suitable for real-time communication anyway. However, this does not prevent web-style developers to implement XMPP in ActionScript, and perhaps tweak it to support signaling of media sessions as well. The result is a broken or non-interoperable software application.

Reviewing the evolution of SIP vs XMPP specifications, I think XMPP has defined an architecture that allows adding new extensions easily and hence reduces the application complexity, whereas SIP extensions have focused on interoperability and wire-protocol without much needed attention to application design. While application design may seem unnecessary for protocol specification, it is very important in the short term. Consider a developer who uses some data structures for representing protocol elements. If a new extension is defined in XMPP, and it reuses the existing XML format that gets readily mapped to the data structures, it becomes very easy to incorporate this new extension in his source code. If a new extension is defined in SIP or SDP, which re-uses an existing mechanism of another protocol for which there is no real implementation available, then the developer will first have to implement that other mechanism, then integrate it with SIP or SDP. The mechanism may have its own formatting which needs to be incorporated in the data structures. Essentially the developer will have to spend more time implementing such an extension. In the end, the actual format of the message whether text-based or XML-based is not terribly difficult once you have a library for message formatting and parsing. However, if an extension uses a different format, connections, sessions, etc., that are not readily available in existing libraries and tools, complexity arises. For example, adding ICE to SIP/SDP created custom format whereas ICE in XMPP/Jingle re-used XML. Another example is how an particular endpoint is identified in XMPP vs SIP. In XMPP the URI itself is extended to include the resource, e.g., "user@domain/resource", whereas in SIP new extension such as globally routable user-agent URI (GRUU) is defined which is, well, more programming effort!

Scalability and performance
SIP is inherently a peer-to-peer protocol whereas XMPP is inherently client-server. Tasks that are easy in client-server systems such as shared state, roster storage on server, or offline messages on server, are well accomplished with XMPP. On the other hand, one of the primary goal of SIP is to keep the intelligence in the endpoint. Ideally, a SIP proxy server does not even maintain the session state for the SIP dialog. Few messages in SIP such as REGISTER and PUBLISH are intended for client-server communication. In XMPP, server is a must and all signaling communication goes through the server. There are message semantics defined for the types of messages, e.g., client-server information query, client-server-client message sending, client-server event publishing and server-client event notifications. Clearly client-server applications are limited by scalability and performance of the server. For example, an instant messaging session need not go through the SIP server saving bandwidth and processing at the server. But that means you lose the offline message storage feature at the server. In real SIP applications today, servers have become an integral part of the system and hence the scalability difference diminishes. In fact, the bulky message format of SIMPLE makes it less scalable than XMPP for presence updates that go through the server. Note also that although P2P-SIP is possible, a P2P-XMPP is not easy because XMPP is inherently client-server.

Once we know this, we understand that SIP and XMPP systems solve two different problems, are designed for two different architectures and have evolved with two different guidelines. From here, you can do two things: either try to incorporate/translate all the features of one system to the other and eventually give up, or try to design your system that uses best of both worlds.

Interworking and co-location
There have been interworking attempts to inter-operate SIP/SIMPLE and XMPP, especially the IM and presence part [draft-saintandre-sip-xmpp-*, draft-veikkolainen-sip-voip-xmpp-*]. The first reference shows how to implement a gateway to connect between SIP and XMPP networks, and the second shows how to implement a client that can support both SIP and XMPP and co-relate the two protocol messages if the user is connected to both servers by the same provider. The popular OpenSER (now OpenSIPs and Kamailio) SIP server has a Jabber module to inter-work with XMPP network. People have developed clients that can understand both SIP and XMPP. Interworking is complex, and not all features can be completely translated or used from one protocol to another, unless the protocol is changed a lot with custom hacks.

Conclusion
Industry experts predict that both SIP and XMPP will stay for a long time. Rather than arguing about the differences or trying to mend the protocols to be like each other, one could build systems that use both these protocols for what each is good at. XMPP is good at creating application level streaming/secure/client-server pipes that can be used for shared state, one-to-many message delivery and publish-subscribe-notify-type use cases. SIP is good at rendezvous of session establishment and negotiation of session parameters for a separate session establishment.

To interwork between XMPP and SIP, you could (1) use a gateway at the server to translate the basic functions, (2) learn or send SIP parameters over XMPP message from a client, or (3) use SIP to establish XMPP chat session with a client. For example, a multi-protocol client of user alice@example.net may be talking to bob@home.com over SIP, and discover that both clients support XMPP, and then add each other in XMPP roster or start an XMPP chat session. Alternatively, if they are chatting over XMPP and discover that the other supports SIP as well, then they initiate a SIP session to do multimedia call. Implementing both the protocols in the client is better than in the gateway for scalability and robustness. There are other interworking architectures possible, e.g., having two XMPP servers use SIP to communicate with each other or talk to a trunking provider, or having an integrated SIP-XMPP server that allows both SIP and XMPP users to seamlessly communicate with each other. These modes, however, are not interesting from a P2P point of view.

Thursday, October 29, 2009

Reliability and scalability in P2P-SIP

In this article I list the important definitions that contribute to reliability and scalability of distributed systems, especially P2P-SIP. Scalability is defined as the ease with which a system can handle growth of demand (load, users, requests, etc.), and reliability is defined as the ease with which a system handles failure or loss (fault tolerant, availability).

Stateless: The amount of state stored in each networked component limits the scalability (as well as robustness). A truly stateless protocol or service does not need to store any state in the system. At the application level, a session oriented protocol such as RTSP or RTMP is stateful, whereas HTTP is stateless which can be made stateful using session cookies. At the transport level, TCP is stateful whereas UDP is stateless. SIP networked components come in several flavors: stateless, transaction stateful and session stateful. A stateful component has limited scalability not because of storage requirement of the state, but due to matching of a new request against the existing states -- this requires exclusive or read-write-locked access to shared resources and hence is slow. On the other hand a stateless request has all the information that is needed for handling that request. Secondly, a stateful component has limited reliability because if the component fails the state is lost and must be re-established, e.g., using new SIP transaction or new HTTP authentication. If you must use stateful component, then some form of distributed or replicated shared state is desirable.

Partitioning: Data or service partitioning among multiple machines helps in reducing load on each machine. A naive approach of using a hash function 'H(data.key) mod N' works for distributing data based on the lookup key among N identical and robust servers. However, the hashing algorithm remaps majority of the data if N changes -- new server is added or old one fails. On the other hand, consistent hashing using a large hash space, e.g., MD5, maps both data using H(data.key) and machines using id=H(machine.name) in the same identifier space. A machine can then store the data whose keys are close to its own id. This principle is used in several structured P2P algorithms such as Chord, Pastry, CAN, and works especially well with large number of machines with significant amount of churn. Partitioning can be applied to services as well. Partitioning improves system scalability, but comes with an overhead of maintaining the correct partition when machines come and go.

Replication: Replication of data improves its reliability (also known as availability) as well as scalability to some extent. In a simple master-slave replication of data, write is done to the master which replicates the command to the slave, and read can be done from either master or slave servers. With higher failure probability, such as in P2P, you need more number of replicas. With N replica of some data, you can still get the data if N-1 machines holding the replica fail. Replication interacts with partitioning -- you want to keep access to the replica almost as fast as the original data. Typically a machine can replicate the data to its N (typically small, 2-8) neighbors and when the machine fails, the next data query automatically gets to its ex-neighbors. Note that another approach where a different hashing function, e.g., H(i+data.key), is used to store the i'th replica does not work well, because replication comes with an overhead of having the machines to keep the replicas up-to-date.

Redundancy: Replication is an example of redundancy of data. Redundancy can be applied to servers and services as well. A stateless protocol or service helps in easily deploying redundant server nodes. Data redundancy goes beyond simple replication of data object to N places. Suppose a large file is split into M chunks such that only N chunks are needed to reconstruct the original file, N<=M. Assuming that many data sources are fast, this approach not only improves reliability but also performance in terms of how quickly you can download the file. Such techniques are used in P2P file sharing applications, and can be applied to P2P-SIP as well, e.g., for storing video mails or live streaming. Redundancy improves reliability as well as scalability of the system.

Load sharing: Load sharing is one type of redundancy where the load of the system is shared among N redundant machines. Although not required, load sharing typically works in conjunction with partitioning, e.g., in two-stage SIP server farm where the first stage servers forward the requests to the second stage server clusters based on H(data.key). As with redundancy, a stateless protocol or service can be easily shared whereas a stateful system requires more work. P2P systems are inherently load shared among the participating peers. Load sharing improves system scalability.

Iterative vs recursive: Iterative request routing is one where a client sends a request to one destination, receives a response that redirects it to second destination, and so on. Recursive request routing is one where a client sends a request to one node, which sends request to another node, and so on. Iterative vs recursive is also called as redirect vs proxy. Clearly, iterative poses less load on the networked element but more on client, whereas recursive is opposite of that. If scalability of networked element is desired then iterative should be preferred. However, NATs and firewalls make iterative request processing difficult on the Internet. Secondly, the topology, bandwidth and connections among the networked elements sometimes make recursive routing faster and more efficient in practice than iterative. The decision to go iterative vs recursive affects the number of message that needs to be handled as well as the state carried in the message or stored in the networked element. It is easier to incorporate redundancy in iterative mode, since only the client needs to re-try to redundant destinations.

Keep-alive: Network protocols usually have periodic keep-alive messages to ensure connectivity or to detect failures. Stateful Transport protocol such as TCP has built-in keep-alive mechanisms that can be activated using socket API. Application protocols employ some kind of application level keep-alive, e.g., XMPP has an extension to do ping, SIP has session timers. These not only detect failure due to network but also due to server software crash. The keep-alive messages help in improving the reliability of the system by quickly detecting complete or partial failures. Keep-alives are especially important in P2P network because of the large number of network paths that can fail.

Exponential back-off: After a failure has been detected, a reconnection or resend is attempted. The time for such attempts needs to be backed-off exponentially -- if the failure happens at time 0, then send first attempt at t, say t=0.5s, and subsequent ones at 2*t, 4*t, 8*t, and so on, until it reaches a cut-off, say 5 min. After that keep attempting periodically, say, every 5 min. If the failure is transient or one time, then this mechanism quickly reconnects. On the other hand, if the failure is longer term, then it reduces the load and bandwidth for reconnection attempts. SIP uses exponential back-off when sending subsequent requests in the event of failure.

Request-response: Application protocols typically come in two flavors: send-and-forget and request-response. Most signaling and control protocols such as HTTP and SIP follow the request-response architecture. Media streaming protocols such as RTP usually don't send response for every message. A request-response protocol automatically makes the entity sending the request a client and the entity receiving the request a server. The client-server distinction may be on per-transaction basis, e.g., SIP has each endpoint act as user-agent-client and user-agent-server. Similarly, in P2P network every peer acts as both client and server, while there may be some nodes behind NAT that can act only as client. A request-response architecture is needed where reliability of the message delivery is important or when RPC semantics is desired in the protocol.

Redundant connections: Redundant connections are another form of redundancy found in distributed systems. For example, a client may be connected to multiple servers. It periodically pings the server and selects the best server to actually send the request to. If the best server fails, it can fail-over to the next best server. Redundant connections improve both reliability and performance of the system. Redundant connections are also useful with geographically distributed server farms to locate the closest server. The idea is to dynamically adapt instead of having statically configured connections.

Bi-directional master-slave: Master-slave replication is used for data reliability and to improve scalability of read-dominated applications, as mentioned before. In a bi-directional master-slave configuration, both machines act as both master and slave at the same time. Any write to any of the machine gets propagated to the other, and hence any read can be done from any of the machines. This improves scalability for write-dominated applications also, such as SIP server. The bi-directional replication can be extended to more than two machines by incorporating a circular ring topology of replica machines. The bi-directional replication comes with the over head of having to maintain replicas on all the machines, and some way to solve a race condition where two updates to two different machines result in eventual consistency. For certain types of data, such as SIP contact locations of the users, it is possible to achieve such consistency.

Vertical vs horizontal scalability: A lot of time you will hear people talking about vertical vs horizontal scalability. Vertical scalability implies that when the load increases you identify the bottleneck and add a new component (e.g., CPU, memory, disk) to your machine to improve the scalability. Horizontal scalability implies that you design the system such that when the load increases you add another machine in the network to handle the load, e.g., another server in the server farm. P2P systems are horizontally scalable. Clearly vertical scalability has a limit beyond which it may not scale, whereas server farms can scale linearly by partitioning.

Proxy and Cache: Caching improves performance and scalability of the system. DNS has epitomized the concept. Caching is also used in web and media streaming protocols. The idea is to install a cache in an intermediate networked element which uses the cache instead of sending subsequent requests to the actual destination. HTTP is designed to heavily use caching to improve performance and scalability of the servers. Caching a negative response is more challenging since the time-to-live for the cached entry is unknown. On the other hand, real-time communication protocols such as SIP and RTP have limited use for caching in the network. Nevertheless caching of data improves the performance and scalability of the SIP server, e.g., by using in-memory cache of the SIP contact locations instead of reading from the database every time. With caching comes the overhead of maintaining consistency among redundant data.

Crash vs byzantine failure: Crash and byzantine failure are two types of failure models: a crash indicates that the system has stopped working and it may be possible to detect its failure, e.g., using keep-alive; whereas a byzantine failure indicates that the system does not behave consistently as per the agreed protocol or algorithm all the time. Hence, it is very difficult to detect byzantine failures, especially if it is due to malicious intention. Byzantine failure and malicious node problem is especially important in P2P since the peers may not trust each other. Both crash and byzantine failure reduce the reliability of the system. Various mechanisms mentioned in this article help mitigate the crash failure, but do not help much with the byzantine failure.

Readings:
[1] Singh, K. and Schulzrinne, H., "Failover, load sharing and server architecture in SIP telephony", Computer Communication 30, 5 (Mar. 2007), 927-942. DOI= http://dx.doi.org/10.1016/j.comcom.2006.08.037 [Author's copy]
[2] K.Singh, "Reliable, scalable and interoperable Internet telephony", PhD Thesis, Computer Science Department, Columbia University, New York, NY 10027, June, 2006.

Friday, October 23, 2009

Security in P2P-SIP

I frequently receive questions on security in P2P-SIP, mostly from researchers looking for a new topic to explore. Security in P2P-SIP (and in P2P in general) is a challenging problem. In this article I summarize my understanding of the challenges and open problems.

The first literature on P2P-SIP mentions that P2P-SIP needs to solve the challenges of client-server Internet telephony as well as privacy, confidentiality, malicious node behavior and "free riding" problems of P2P. For example, a malicious node may not forward the call requests correctly or may log all call requests for future misuse. A later publication formally identifies the key security challenges and potential solutions. The paper on survey of DHT security techniques presents a comprehensive listing of challenges, solutions and problems. Let us classify the challenges:

User Authentication: Similar to client-server SIP, authentication is essential. A receiver need to verify that a sender posing as sip:bob@example.net is actually the owner of that identifier. If the user identity is based off some other information or identity owned by the user, e.g., email address, phone number, postal address, social-security number, credit card number, PKI, X.509 certificate. etc., then it is possible to delegate the identity to that mechanism, e.g., by sending email or phone caller ID verification. The challenge can be further divided into: whether the user owns the identity? whether the user can randomly pick his identity to anything? whether a user can be made to believe that he has (wrong) ID or password? whether a malicious user can get password from another user in the pretext of authentication so that the malicious user can later assume the other user's identity?

Node Authentication: Additionally, since a number of P2P algorithms use the node identity to locate a node or define data storage criteria, the node ID is also a candidate for spoofing. A receiver must verify that the sender owns the node ID that it is posing as. The problem can be divided into sub-problems: whether the node ID are randomly picked or self generated by (malicious) nodes or assigned securely by some authority? whether the node ID can be spoofed in the protocol messages or data storage? whether a node can be made to believe by other nodes that it has (wrong) ID? whether a malicious node can get the authentication credentials of another node and later assume other node's identity? These problems if not addressed can result in other problems such as man-in-middle or denial-of-service (DoS) attacks. The most important question is: Can authentication be done in P2P without relying on central trusted authority?

Overlay Routing: A malicious node in a P2P network can drop, alter or wrongly forward a message intentionally manipulating the correct routing algorithm to disrupt the network and hence availability. This partly depends on node ID assignment mechanism, whether a node can intentionally place itself in the topology at a particular place? Further questions to ask: what fraction of malicious nodes affect what fraction of P2P network? What is the relationship between performance (availability, routing and data storage) of P2P and the fraction of malicious nodes or users?

Overlay Maintenance: A malicious node may invite more malicious nodes or copies of itself in the P2P network. A malicious node may partition the P2P network so that one part can not reach the other. A malicious node may reject join requests from other good nodes to prevent them from joining the network. The questions: what fraction of malicious nodes can affect what fraction of P2P network availability? Can a malicious node eventually affect the whole network given enough time? Can a malicious node affect the discovery of bootstrap node by other nodes that affects the joining process of other nodes? Can a malicious node intentionally place itself in the topology at a particular place (e.g., as super peer), so that it affects more number of overlay messages?

Free riding: A P2P network works because the peers do. If a node refuses to serve as a peer, but just use the service of the other peers, how do you handle this? Can the system enforce or give incentive to a particular node to become part of the overlay? What fraction of the nodes must be part of the P2P overlay for the overlay to work?

Privacy, Confidentiality, Anonymity: Unlike the client-server telephony, in P2P-SIP the call signaling and media messages may traverse through other nodes in the system. Can other nodes know who is calling whom and hence infringe on user's privacy? Worse, can a malicious peer listen to the conversation (audio, video, text, etc.) between two other peers? Can the system allow you to make anonymous calls so that the receiver does not know who is calling? Can the system allow you to receive calls (e.g., any-cast calls to call centers) without divulging your identity to the caller?

SIP services: The client server SIP implements several new features and services, but those have limited use in P2P-SIP because of the trust model. For example, programmable services using SIP-CGI or SIP Servlet are difficult, e.g., unless the receiving peer can completely trust the calling peer's CGI. Emergency services, spam prevention and lawful interception that have been researched in client-server SIP are pretty challenging in P2P-SIP.

Cost of security: Most of the existing protocols on the Internet suffer because people don't implement or deploy enough security. For examples, the front web page of many banks do not use HTTPS/TLS but have login forms. The reason is that system and operations engineers see security as an overhead, and do not use unless really needed. P2P takes this to extreme because the (in)security of one node can affect several in the network. The questions: What is the cost of security? How much does performance suffer in terms of number of messages, overhead, delay, for a particular security mechanism?

Given these problem there are several approaches the researcher are taking in solving. However, the core of some of these problems still remain unsolved. The general approach is to define a sub-set of the P2P-SIP system which works for the given security mechanism. For example, the P2P authentication is very challenging -- hence most implementations use a central certificate authority (CA) and everyone trusts that -- similar to the web browser model which comes installed with some root CA. The other approach is to build a closed P2P network of trusted implementations and provide the service to the rest of the untrusted users, e.g., OpenDHT and Dynamo. This works similar to the server farm model, except that the server farm is built using sub-set of P2P features -- self adjusting, less configuration, distributed data storage, geographically distributed. Another approach is to build the closed and proprietary system and protocol which prevents (to some extent) others from injecting the malicious node in the system, e.g., Skype. Unfortunately, sooner or later the protocol gets reverse engineered and the security is not longer present. The research on distributed trust, reward, or credit/debit system works well for file sharing but has not be successfully proven for P2P-SIP. Finally, some researchers focus on the statistics and availability of the whole network, with the theory that a small fraction of malicious nodes do not disrupt the whole network. If there is enough incentive for the nodes to remain good, this may work well.

If you are interested, please read the article on when P2P makes sense? In particular, if (1) most of the peers do not trust each other AND (2) there is not much incentive to store the resources then P2P does not work well because the system does not evolve naturally to work. Think of it as people who do not trust each other and they do not have much incentive to help others or to store information for other people, will a person be able to get information he needs that another person has? The subset of problems I listed in the previous paragraph all try to twist the problem such that peers trust each other, i.e., (1) and the system tends to evolve naturally to work. Still more research is needed in (2) to identify and develop the incentive model for P2P-SIP use case.

Tuesday, October 13, 2009

The (in)security of Flash Player's crossdomain

This article discusses the (in)security of Flash Player's crossdomain or cross-domain-policy mechanism and why it is against P2P. Anyone who has worked with Flash Player's network (URLLoader, Socket, etc) to implement a protocol would know the problem and pain that Flash Player causes due to it (broken) crossdomain security.

The problem: A programmable content downloaded from site1 running in user's browser should need explicit permission to connect to or use content from another site2. Otherwise, a Flash application may randomly connect to or use content from any other site without user's knowledge and pose security risks.

The real problem: The real problem is in the way Flash Player tries to solve the above problem. If a Flash application is downloaded from site1, and wants to access or connect to another site2, then that site2 must give explicit permission using a web accessible http://site2/crossdomain.xml or in-line cross-domain-policy response in the TCP connection. The crossdomain.xml file lists all the other domains (such as site1) whose Flash applications are allowed to connect here, and also lists all the ports to which they can connect. There are options to give wild-card (*) for domains and ports. Thus, only if site2 trusts site1, will it allow site1 to connect.

The first problem with this approach is the trust model: Flash Player asks site2 instead of the user for permission. This means user still does not have control what other sites the Flash application connects to; if site1 used wild-card domains then any application can connect to it; and admins of site1 and site2 must co-ordinate and collaborate. In most real deployment, this means that site1 and site2 are owned by the same entity and the deployment builds a false sense of closed walled garden of client-server applications.

The second problem is that it is very easy to work around: if site1 is wants to use or connect to site2, but site2 does not trust site1, then site1 can install a connection proxy on its site1 and have the Flash application connect to site2 via this proxy on site1. So it does not really protect site2 from access from any third-party Flash application -- i.e., there is no closed walled garden for site2. What it actually means is that if site2 has a content or service then anybody can build a Flash application to access that content or service as long as he can host a proxy on the Internet. You just need _one_ person in the whole Internet with good bandwidth and an open proxy to potentially break the crossdomain trust model of _all_ the Flash applications.

The third problem is that it assumes Flash Player binary will not be reverse engineered: This is the worst sense of security in the literature. When a Flash application is downloaded from site1 by Flash Player, and wants to make connection to a another site2, it just checks the URL from where the Flash application was downloaded. If the URL contains the domain of site2, then usually the crossdomain check is not done, but if the URL does not contain site2, then it first gets crossdomain.xml file. If some person is able to hack and modify the URL variable inside Flash Player or in the transit on non-secure HTTP, then the Flash Player will effectively ignore the crossdomain check.

The fourth problem is the way it has evolved: The initial implementations of crossdomain policy was broken, with lots of ways to work around. With newer implementations of Flash Player some of these problems have been fixed. However, that means the older mechanism is deprecated and newer mechanism no longer works with older. Concrete example of the problem follows. Should an application downloaded from http://site1/p/app.swf be allowed to connect to site1:5222 (non-HTTP port), or to http://site1 or http://site1/q/second.swf? If yes, then public hosting of personal web pages effectively open up all the Flash applications of all the hosted users on that server. So why not define a meta-policy at the top level http://site1 which controls everything else? How can site2 allow only one application from site1 but not other to connect? How can site2 allow only applications that are signed by site2 to connect but not others, even if those applications are hosted on other sites? No answer!

The fifth problem is that it depends on other unsecured mechanisms: HTTP and DNS. I won't describe details on these, but it would have been nicer if the security was based on code signing or standard authentication mechanisms instead of comparing the hostname from the HTTP URL.

The sixth problem is its incompatibility with existing systems: When implementing a custom non-HTTP protocol, the Flash Player sends the first request as <policy-file-request/> on the connected socket. What if the service does not understand this request? Well use meta-policy. In that case for it to work, the meta-policy needs to be served from say, 843. What if another server is present on the machine on port 843? Or what if you want to add handling of this policy-file-request in your own service protocol? In my personal experience getting this in an XMPP server was a nightmare. What if Flash Player puts a nul character after its request? You cannot control the request or response for policy-file from ActionScript because it happens automatically in Flash Player before a socket is actually connected in the ActionScript.

The solution: In the context of open services and/or open network and/or P2P, the crossdomain is a problem rather than a solution. If site2 has hosted an open service, it should not restrict anybody to build applications to connect to site2. Thus site2 should put a crossdomain.xml with wild-card domains and ports. If site1 has a built a Flash application for an open service, the application should be allowed to connect to any service that follows the protocol as long as the user approves the connection. Thus site1 should build Flash application with correct user authentication, and then proxy the connection via site1's proxy to the site2's service instead of having it connect directly to site2. This allows site1 and site2 to operate independent of each other as long as they implement a common service protocol, e.g., XMPP or P2P-SIP.

The real solution: A correct security implementation in Flash Player should have done something like the following:
1. When the Flash application tries to connect to another new site, ask the user (similar to the camera and microphone security settings) if he wants to allow the connection. Also give an option to remember the approval if needed.
2. Allow site2 to sign a Flash application, and require that only signed Flash application with so-and-so root certificate should be allowed to connect to site2.
Beyond these any site should be free to implement its own authentication mechanism.

Friday, September 04, 2009

The Internet Video City


(For the last month and half, I have been aggressively involved in another open source project, "videocity". This article describes the salient features and novel ideas in that project.)

The goal of the Internet video city project is to provide open source software tools, both client and server, for video communication and sharing. Unlike other file sharing systems, this is targeted towards video and live video sharing in small groups. Unlike other video communication services, this project provides the tools needed to build a service.

High level description


At the high level, the video communication is abstracted out as a city. An individual can signup with his email user@domain.com and own a home with URL of the form http://server:5080/user@domain.com. This is also the location of the default guest room of that user. The user can build other rooms inside this URL, e.g., for hosting a online family gathering, he can get a room with name "Family Gathering" and the room URL of the form http://server:5080/user@domain.com/Family.Gathering. Each room can be made public or private. A public room is accessible to anyone visiting the URL of the room, whereas a private room needs explicit permission to enter.

Once you have entered a room, you see other members in the room, and can communicate with others using real-time audio, video and text chat. You can share media files such as photos and videos from your computer with others in the room. You can also share online photos and videos with others. All these shared resources are put in an active session and would disappear when the room is closed, i.e., all members have left the room.

The owner of the room can decorate his room by uploading, recording or editing the room's content. A room's content is described using an XML file containing multiple play lists. Each play list contains sequence of media files or URLs. When you enter a room, you see all the pre-configured play lists in that room. This allows the owner to, for example, create a room with his family pictures and videos in a slide show, and give out the URL to others to view the photos. A media resource in a play list can be text, image or audio/video. The image and audio/video can be uploaded from user's computer, downloaded from a web URL or recorded using user's camera in real-time. The play list can be readily edited using drag-drop, built-in text editor or various button controls.

Each signed in user also has an inbox. The inbox is a special XML file that gets loaded when a user logs in, and contains play lists that are sent by other users to this user. When you enter a room, you have an option to send a play list to the owner of the room, which turns up in the owner's inbox. You can record the play list using your camera, or create one using resources available from the web. The play list stored in the inbox is privately available only to the owner of the room.

This simple concept of play list and rooms, allows us to implement various communication scenarios. For example, real-time communication, video mails, publicly posted videos, and video web sites.

Novel idea


One of the novel concept used in the project is that of soft-card. A soft-card is a digital version of your ID card or visiting card. There are two types of cards: a Private login card is your confidential ID card that you use for login to the site, an Internet visiting card is your room's visiting card, which you give out to your friends so that they can visit your room. Usually each signed in person has a private login card, and each room owned by the person can have an Internet visiting card.

A soft-card looks like a digital image of your real ID and visiting cards. It is actually a image file in PNG format. The image has a photo, your name or your room's name, some list of key words identifying your room, and a URL of your room. Unlike a regular PNG file, a soft-card has additional meta information that is used in secure identification and access. In particular, your private login card has your RSA private key (refer to PKI) and your Internet visiting card has X.509 certificate using RSA public key signed by the server. These meta information such as keys, certificates, names, emails, keywords, etc., are stored in information chunks of the PNG file itself.

Similar to public key cryptography, these digital files can allow us to implement security, authentication, access control, privacy, confidentiality, etc. Essentially, anything you can do with PKI, you can do with these soft-cards. Additionally, these soft cards give a visual appearance of an ID card or a visiting card containing the URL which they represent. Users receive them in email on signup, and can give out visiting card to others in email. An example visiting card is shown at the top of this article. If you edit the card's file or image in any way, e.g., converting to JPEG and back, or edit using photo editors, then the card's key information will become invalid and unusable. Note that a card is valid only within the domain it is created for. Thus a card created for http://server1/room1 can not be used by http://server2/room1 even if both server1 and server2 virtual domains are hosted by the same server.

Once we have the login (private key) and visiting (public key) cards, implementing rest of the security mechanisms is straight forward. For example, resources in an inbox can be encrypted using public key of the owner, so that only a private login card can decrypt it. The public rooms are signed by owner's private key, so that anyone with the visiting card of the room can verify the signature. When sending a media resource to another user, PKI can be used to establish a secure session of communication. A room can be made private by allowing only connections from people who have valid visiting card for that room, and have the owner send out visiting card to his friends and family using an independent channel such as email. A room can be made public by uploading the visiting card to the room itself, so that anyone with the URL can first download the visiting card (i.e., public key) and use that to connect to the room. Although we haven't implemented most of the security mechanisms, we have the basic soft-card concept implemented in the project. In particular, you can create your cards, edit the layout of the card during creation, download them after creation, and use them to upload in the client to join a room or to log in. One thing to note is that within the Flash Player environment, the amount of security using PKI is limited. But since we have our own video server implementation as well, we can do some novel tricks in that regard.

Product design ideas



There are several product design ideas we implemented in the project: (1) consistency, (2) flowing and smooth interface, and (3) performance. In this section, I describe these ideas and how they are implemented.

Consistency is very important in user interface design. The look and feel of various buttons should be consistent. Common operations should be consistent with what people are used to doing. For example, most windows users see the 'close', 'maximize', 'minimize' buttons on the top-right corner. Most mac users see the bottom bar as tools or commands bar. Most instant messaging users see notifications on the bottom-right corner of their screen. We used these concepts in our UI design as well.

Flash allows us to implement nice, smooth and flowing user interface. When you go from one room to another, the view slides your window from one room to another. The sliding window component in the project nicely abstracts out the details of this container. When a help video is played, it animates to the full view, and when it is paused, it goes back to the original position. For help videos, flowing subtitles along with audio/video give a better user experience. Computer users are comfortable with drag-and-drop operations using the mouse. In our project, the play list editing, video window re-organizing, delete button, etc., use the drag-and-drop mode of operation.

Performance is important once the project grows to a significant size. In particular, a Flash Player spends lot of cycles rendering images. This is improved significantly in our project since we use only programmatic skins for all our buttons and icons. Moreover, programmatic skins scale nicely when going to full screen or different size.

There were a number of lessons we learned in this project from the product design perspective. Moreover, being responsible for both product design and product engineering helped us avoid ambiguity, which is usually seen in multiple team projects.

The big picture



Although, the project is still "work in progress" and a lot of work is remaining, I wanted to give a big picture of the project. Flash Player is a great browser plugin. However being proprietary makes it hard for others to use it in full potential. For example, until recently the video communication was restricted to only Flash media server, or file upload were not allowed from local computer to Flash Player without going through the server. Although Adobe is making significant progress in keeping the developer community engaged, (e.g., making RTMP protocol open, or making file uploads and downloads available in new Flash Player) there will always be some restriction in the Flash Player. For example, absence of H.264 encoder or good audio quality/preprocessing engine prevents us from using it efficiently in true H.264 video communication or good real-time audio communication. In any case, since the RTMP protocol is open, and since there are a number existing open source RTMP implementations, one can use back-end RTMP based servers to perform some processing.

This videocity project gives us back-end tools to intercept RTMP, integrate web communication, and expose a single server to support various requirements of video conferencing. One can ask whether this will scale? The answer is, may be, not. The reason for doing the project though is that it fits nicely in the big picture of P2P-SIP based communication framework. Flash gives a nice ubiquitous browser based front end, whereas our videocity server gives tools that can be integrated with peer-to-peer network. Thus we can gain from advantages of both worlds.

Distributing a conference in a P2P network is an already researched problem. Several solutions exist, ranging from application level multicast for large conference, to full mesh small conferences, to picking a few servers as relay bridges. Maintaining shared distributed state of the conference and collaboration is interesting to explore. The SIP community has done significant work in centralized conferencing framework, e.g., in the IETF XCON working group. The P2P-SIP working group is creating protocol for standards based peer-to-peer network maintenance and lookup for SIP service. Finally, some API or interface specification is needed for the videocity's client-server model so that others can build clients or server adaptors to integrate between XCON, P2P-SIP and videocity. In particular, we will define all the interface elements such as format of the soft-card, various RPC calls for uploading or downloading resources, sharing play lists, authenticating users, as well as communication mechanisms.

In summary, the project gives developers a starting point from where you can build video communication service, video message platform, video recording and editing system, collaboration engine, media sharing software, video blog web site, video rooms, multi-party conferencing applications, desktop clients, browser extensions, application sharing, new client applications, and so on. The client-server tools available in the project allow you to record a video or snapshot photo from your camera and store it in local file, create play lists of various heterogenous media resources, and share live and stored media with others using the system.

There is no hosted service for this software, and we don't plan to have one. This is because our goal is to go peer-to-peer, where various installations of the software will discover and communicate with each other!

Thank you for your reading time, and we love feedback!

Saturday, August 29, 2009

Beauty of open source

This article presents an analogy between software projects and beauty: open source projects have natural beauty, whereas commercial projects acquire beauty through cosmetics and makeup. [see previous article].



Beauty

Software project


Natural beauty is, well, natural, whereas cosmetics give artificial sense of beauty.

Open source software are built when some motivated developer feels like building something, where as commercial software are mostly built by engineers who are paid and forced to build.


Natural beauty is long lasting, whereas makeup wear down after few hours or days.

Open source software lives longer, whereas commercial projects tend to get out beaten by competition sooner or later.


You are born beautiful and don't have to pay for natural beauty, whereas cosmetics cost money, big money.

Open source software are mostly free, whereas you have to pay for commercial software.


It is hard for salesman to sell fruits and water to enhance your beauty, whereas it is easy for salesman to sell cream and nail-polish. Sometimes these advertisements are deceptive.

Usually you don't find people advertising their open source work much, whereas companies have dedicated sales team to sell the commercial product. Sometimes these sales pitch are deceptive.


Natural beauty is usually open and does not hide scars, whereas cosmetics are meant to hide scars, marks, etc., to give a (false) sense of beauty.

Open source software are open source code, where developers can jump right in the code and see things. Commercial software hides the source code, and instead presents documentation, power points, sales pitch, etc., to give a (false) sense of what is inside and actually hiding what is inside (source code).


Natural beauty does not require support. On the other hand if you are putting a nail polish, you will need a remover; if you are putting on make up, you will need to remove it before bed.

Open source software usually comes with no support. If you have it, you own it, and you put up with it. Commercial software usually comes with (expensive) support system. If you have it, you need to keep paying for bug fixes and upgrades. Otherwise it will harm you sooner or later.

Sunday, July 05, 2009

Programmable SIP server

I had posted an article on generic SIP API earlier. I implemented a first version in my 39 peers project over the weekend. I also used the API to implement a simple SIP proxy and registrar server. I tested the server using X-Lite clients in an intra-net. The basic SIP registration and call routing with record-route seems to work well.

There are precisely two modules in the implementation: sipapi to implement the core of the API and sipd to implement the server, with 252 and 110 lines, respectively, of Python. The first version only supports simple call routing as needed for a server, without any advanced features such as NAT traversal. It also supports fail-over and load sharing models using the two stage SIP server farm as described in my PhD thesis (although with same set of servers performing both first and second stage for different set of users).

In the implementation, the API exposes an Agent class that represents a listening endpoint. The agent dispatches various events such as "incoming" to signal an incoming message. The application, or SIP server in this case, attaches a local function to handle the "incoming" event and process the event. The processing logic is inspired by SIP Express Router (SER)'s config file. The following command creates the listening agent on the given listening IP and port using UDP transport.

from app import sipapi
agent = sipapi.Agent(sipaddr=('192.168.1.3', 5060), transports=('udp',)).start()


Once created you can attach your function (say "route") to handle incoming messages. The handler function gets an event representing the incoming message, and acts on it using the methods available on "event.action" property.

def route(event):
if len(str(event)) > 8192: return event.action.reject(513, 'Message Overflow')
if event.method == 'INVITE' and event.uri.user == 'anyone':
event.location = URI('sip:someone@somewhere.com')
return event.action.proxy()
...
agent.attach("incoming", route)
sipapi.run()


The event object available to the incoming message handler has an agent property representing the original Agent object. This allows you to access state and configuration elements on the agent. The API also defines a Location class to store contact locations from an event, and retrieve contact locations for a URI. The incoming event has several action methods such as accept, reject, proxy, redirect, challenge. If an action is not invoked, then it calls the default action method that responds with '501 Not Implemented' response to the incoming message.

The API and server can be extended to implement additional features such as NAT traversal, presence server, etc. For example, new event types can be defined to indicate presence change and allow the user to take action on these events. Alternatively, additional modules can define methods to modify SDP or SIP request to handle NAT traversal similar to how SER's nathandler module works. If you would like to work on these server extensions, do let me know.

Saturday, June 27, 2009

How does Google video chat work in gmail?

This post is just a speculation (aka guess work).

Google has a video chat function from within the Gmail web pages. This function is not available in the GTalk client yet. Google requires you to download a plugin which enables the video chat function form gmail. The video is rendered using Flash Player. In this article I present my understanding of how it works.

Flash Player exposes certain audio/video functions to the (SWF) application. But the Flash Player does not give access to the raw real-time audio/video data to the application. There are some ActionScript API classes and methods: the Camera class allows you to capture video from your camera, the Microphone class allows you to capture audio from your microphone, the NetConnection/NetStream classes allow you to stream the video from Flash Player to remote server and vice-versa, the Video class allows you to render video either captured by Camera or received on NetStream. Given these, to display the video in Flash Player the video must be either captured by Camera object or received from remote server on NetStream. Luckily, ActionScript allows you to choose which Camera to use for capture.

When the Google plugin is installed, it exposes itself as two Camera devices; actually virtual device drivers. These devices are called 'Google Camera Adaptor 0' and 'Google Camera Adaptor 1' which you can see in the Flash Player settings, when you right click on the video. One of the device is used to display local video and the other to display the remote participant video. The Google plugin also implements the full networking protocol and stack, which I think are based on the GTalk protocol. In particular, it implements XMPP with (P2P) Jingle extension, and UDP-based media transport for transporting real-time audio/video. The audio path is completely independent of the Flash Player. In the video path: the plugin captures video from the actual camera device installed on your PC, and sends it to the Flash Player via one of the virtual camera device driver. It also encodes and sends the video to the remote user. In the reverse direction, it receives video (over UDP) from the remote user, and gives it to the Flash Player via the second of the virtual camera device drivers. The SWF application running in the browser creates two Video objects, and attaches them to two Camera object, one each for the two virtual video device, instead of attaching it to your real camera device. This way, the SWF application can display both the local and remote video in the Flash application.

What this means is that for multi-party video calls, either (1) the plugin will have to expose as more video devices (is there any limit on devices?), or (2) somehow multiplex multiple videos in same video stream (which is CPU expensive), or (3) show only one active remote participant in the call (which gives bad user experience).

An open question to ask: will it be possible to use the Google's plugin to build our own Flash application and somehow use our own network application/protocol to implement video call? Hopefully Google will make the plugin API available to public some day.

Monday, June 22, 2009

Problems in RTMP

Adobe's RTMP or Real-Time Messaging Protocol was recently made available to public as an open specification as part of Adobe's Open Screen initiative. Most of the protocol has already been implemented in third-party software such as Red5, rtmpy and rtmplite much before this specification became public. In this article I take a critical look at the protocol.

There are three parts in the specification: (1) RTMP chunk stream, (2) RTMP message format and (3) RTMP command messages. At the high level, there are different types of messages such as command, data, audio and video. The last specification describes the high-level RPC (remote-procedure call) for various commands and their responses such as creating a network stream or publishing a stream. The actual formatting and parsing of individual types in a command are specified using AMF (Action Message Format) which comes in two flavors: AMF0 and AMF3. The messages that control the protocol such as setting the window size of lower layer or bandwidth for the peer, are specified in the second specification. Finally, the first specification defines the low level chunk format and separates the high level message stream from low level transport (chunk) stream.

The first (and worst) problem with RTMP is that it is overly complex in doing what it does. One reason is that it was poorly designed without extensibility or competing peer protocols in mind, and later on "fixed" itself to extend new features. As an example of complexity: the chunk stream ID field in the first specification was initially intended to be up to 63 but later extended to 65599. For ID 2 to 63, the first byte stores the value in its most significant 6 bits. For ID in the range 64-319 the second byte stores the value minus 64, whereas the first 6 bits of first byte store 0. For values between 64-65599, the second and third bytes store the value using a complicated formula whereas the first six bits of the first byte store 1. Another example is the timestamp field which is 24-bits. However, the protocol supports 32-bits timestamp such that if the value is more than 24-bits than the 24-bits are all 1's, and the actual (extended) timestamp is stored after the header. What is surprising is that a binary protocol called RTP (Real-time Transport Protocol) existed before RTMP was conceived, and had well defined and well thought-of message layout. For example, RTP has version field for extensibility, and 32-bit timestamp. Unfortunately, RTMP didn't learn from the peer protocol and suffered in the form of excessive complexity.

RTMP is designed to work only on TCP, and cannot work on UDP without several modifications. One well understood conclusion of early Internet multimedia research was that UDP is better suited than TCP for real-time media transport. While RTMP calls itself as real-time, it was designed to work solely on TCP. There is no sequence number to handle lost packets, hence it relies on the lower layer (TCP) to provide guaranteed packet delivery. Note that timestamp cannot be used to detect lost packets. The header optimization does not work if packets are delivered out-of-order. The new RTMFP does work over UDP but has its own set of problems and is not yet an open specification.

RTMP has several unnecessary elements. The chunk stream mechanism is not necessary and actually hurts the performance of real-time media transport, besides complicating the implementation. In particular, for client-server communication where typically number of connections/streams between one client-server pair is one, there is no good advantage of using chunks. It can have advantage in server-to-server communication in avoiding head-of-line blocking of one stream from another. Secondly, the initial bulky handshake of RTMP which, I believe, was intended to measure bandwidth or end-to-end latency, actually is not useful.

Media and control path should be separate. The IETF Protocols such as RTSP or SIP as well as ITU-T protocol H.323 exhibit this separation by delegating the media transport to separate RTP stream. This has several advantages because control path usually travels through application servers that are CPU and memory intensive, and have different scaling requirements than media servers which are bandwidth and disk intensive. Separating media from control path achieves scalability, robustness and distributed component architecture in the system. On the other hand, in RTMP control goes hand-in-hand with media. For example, the application server that handles shared objects and conference state, also handles media storage and transport.

RTMP has inconsistencies. First example is the use of some data types. The stream ID field appears at several places in the protocol, in different forms: 32-bit little endian, 32-bit big endian, and 64-bit floating point number. Second example is incoherency between layers: The default chunk size is 128 bytes. The default real-time audio captured from microphone is streamed to the server using Nellymoser encoded audio packets with two frames per packet. Each Nellymoser encoded frame is 64 bytes. Besides, there is a one byte header indicating the codec type. Thus each packet in the default case is 129 bytes. Thus, under default operation, a Flash Player should immediately change the chunk size from 128 to 129 to accommodate a full audio packet in a chunk (so as to avoid fragmenting it which will be inefficient). Going off by 1 byte indicates that something went wrong while designing the protocol for the default case.

When rest of the world was moving towards open standards such as RTP, Adobe embraced closed and proprietary RTMP. Adobe has been a proponent of proprietary technologies and imposing sub-optimal technologies to the developers and users. Another example is the RTMPE extension for encrypted RTMP communication. Readers are encouraged to read this article: "The major implication of this takedown notice is that Adobe has definitively told us that a fully-compliant free software Flash player is illegal. This is because RTMPE is part of Flash, circumventing RTMPE is illegal (in the US at least), and Adobe will never give a key to a free software project since they cannot hide the key. As a result, Flash cannot truly be a standard..."

Saturday, May 23, 2009

APIs for SIP applications

There are three types of SIP APIs: (1) source code API such as ones defined by JAIN SIP, libsip++ and pjsip, (2) high-level API to control the per-call behavior such as CPL, SIP-CGI, SIP Servlet, LESS, or (3) pseudo-code style API to control the behavior of the server or client such as SER or sipp config files. They all serve different purposes. The source code API is needed to create new software applications using existing libraries, the high-level API creates easy to define services, and the config file allows creating server or client behavior/scenarios.

With Python as the programming language, it is possible to create single software to expose these different types of APIs. This is because Python source code looks like human understandable pseudo-code if written with care. It makes sense to expose the APIs in my P2P-SIP 39 peers project to support such behavior. In this post, I present some of my initial thoughts on how to implement the generic API.

Firstly, the existing SIP module (rfc3261.py) already has an easy to use source code API. This is further enhanced in additional modules such as voip.py for user-agent specific functions. In particular, the source code API exposes object-oriented classes such as Transaction, Dialog, Stack, etc., to perform the functions defined in various layers in RFC 3261.

Secondly, the high-level API such as CPL and SIP-CGI can be implemented using per-user scripts in python itself (instead of XML for CPL, for example). The SIP server or client can import the per-user script based on the request-URI of the request and handle the processing. An example script to perform redirect to voice mail is shown below:

def redirect_to_voicemail(event):
event.location = URI('sip:jones@voicemail.example.com')
event.action.redirect()

def proxy_then_voicemail(event):
if event['From'].uri.host.endswith('example.com'):
event.location = URI('sip:jones@example.com')
e = event.action.proxy(timeout=10)
if e in ('busy', 'noanswer', 'failure'): redirect_to_voicemail(event)
else: transfer_to_voicemail(event)

basic.addEventListener('incoming', proxy_then_voicemail)

Because of the un-safe nature of Python, we will lose a number of features provided by CPL, but nevertheless the programmable API is easy to read and write, and nicely integrates with the existing source code. Similar extensions can be built for user agent applications similar to the LESS programming API.

Thirdly, instead of having the SIP client or server import the per-user script, the client or server itself can be written in the script. The following example shows how to translate some parts of SER config script to create a new server scenario.

from sipapi import *
import re

_debug = True # enable debug trace
open(('67.93.12.18', 5060)) # listen on this ip:port for incoming packets

def route(event):
# sanity check section
if event['Max-Forwards'] and int(event['Max-Forwards'].value) <= 0:
return event.action.reject(code=483, reason='Too many hops')
if len(str(event)) > 8192:
return event.action.reject(code=513, reason='Message overflow')
# this is used by sipsak to monitor the health of server
if event.method == 'OPTIONS':
if event['From'].uri.user == 'sipsak' and not event.uri.user:
return event.action.accept()
...
basic.addEventListener('incoming', route)

run() # the loop to process the SIP listening point

Similar scripts can be written to create client scenarios similar to how sipp creates scenarios from configuration files.

I feel a generic SIP API will make the job of Python programmers easier, instead of having to learn new APIs and implement them in custom SIP servers and clients. In any case, the article just poses some ideas, and I will be happy to mentor any student who would like to work on this project!

Sunday, May 17, 2009

Why Open Source?

In this article I present my view on Open Source Software. There are a number of articles elsewhere that talk about advantages of Open Source and why it works [1][2][3]. (I use the term Open Source to mean Free Software, instead of diving into OSS vs FS debate.)

Point 1: Most software organizations are driven more by business objectives, and less by technology.

If you look around you notice several types of software companies: most of them are commercial ones like Microsoft, there are some like Sun that are commercial but pretend to do open source work, very few like Red Hat that build business out of open source software, and finally, there are some true open source software like Linux and Apache. I think most of Sun's Open Source attempts are for business reasons to provide competition to other existing businesses.

Point 2: Software programming is an art.

It is like learning a new language or doing sculpture or painting. Initially a programmer is too involved with the syntax and semantics of the language constructs and application. Once he is proficient, the programming comes naturally. For example, a good Java developer will think in terms of high level modules and classes, but once he starts implementing them things move smoothly instead of worrying about where to put the open bracket or when to create a new method. After a while, the software development process becomes an art of making beautiful, modular and efficient software. Some people have a built-in talent for the particular art, but most people learn by practice, practice and practice.

Point 3: Good art requires personal motivation.

Doing a scientific experiment is different than painting a new picture. At the broad-level, given the set of input data, and experimental setup, one is likely to achieve the same result in an experiment. On the other hand, an art piece requires inherent motivation of the artist and his personal inspiration to do something new, something different. Sometimes the inspiration gets driven by commercial interest, in which case an artist may end up producing art work without much motivation -- e.g., in the commercial Indian film industry, a music director has to produce tens or hundreds of scores per year, affecting the quality of the art as well as causing plagiarism from western music. On the other hand, motivated musicians who record one album a year do generate quality and innovative work.

Considering the above three points, a software engineer who is paid to work on a particular technology or piece of software is less likely to be personally motivated to create that piece of software. On the other hand, an open source developer who is not paid for his work initially, starts the open source work because of his personal motivation to create that piece of software. Thus, an open source software is more likely to be of better quality compared to a commercial software with the same amount of testing. Hence commercial software requires quality assurance to make it competitive with open source. Although quality assurance can reduce software bugs, it doesn't make the 'art' as good as the open-source version. If you compare the design of the Apache web server or SIP express router with their commercial counter parts, you can understand what I mean by 'art' in this context.

When I look at an open source software, I assume it reflects ideas and vision of the innovative and inspired developers who were motivated to write that piece of software. When I look at a commercial software, I assume it is created by software engineers who got paid to write code, to write documents, to test code, and more than that it is sold by salesmen who are paid to sell those pieces of software. I don't see much personal motivation or inspiration in the loop, and I don't aspect the design or the implementation of the software to be a good piece of art. (In small companies where all the people have same vision or inspiration about the software, it is possible to create quality work. However, things change over a period of time as the company grows and new software engineers are paid to perform work on other people's innovations.)

In the context of P2P-SIP, we haven't seen much of commercial interest beyond the few initial contenders. The main reason is that P2P-SIP inherently is peer-to-peer, and against the business model of services that the modern web/phone industry is so accustomed to. In other words, the industry has not figured out a way to make money out of open P2P-SIP. On the other hand, there are a number of developers who created open source prototype applications out of personal inspiration. The next steps for the open source P2P-SIP developer community: to build a bigger community, advertise and publish their work, prove that it works better than existing solutions, and use it on a daily basis!

Tuesday, May 05, 2009

Documenting RFC implementations

Just wanted to document how I am using the existing documentation such as IETF RFCs and Internet-Drafts in my 39 Peers project. You can looks at some samples at here or here.

I wrote a htmlify.py script. The script uses the SilverCity python package to decorate the base python code. To include the documentation from RFCs, the script interprets the python source code to identify lines such as

# @implements RFC2617 (HTTP auth)

If found, it downloads the particular RFC from IETF website, removes any formatting empty lines such as near page breaks, and numbers all the pages and lines. It then stores the resulting file as a text document which you can lookup and use in your documentation.

Furthermore, the script identifies lines such as

# @implements RFC2617 P3L16-P3L25

If found, the line is replaced by a HTML DIV block with content of the documentation in RFC2617 text file from page-3 line 16 to page-3 line 25.

I find this technique pretty handy, and keeps my decorated source code with inline documentation extracted from RFCs and drafts, instead of I having to write extensive document.

Friday, February 27, 2009

Problems due to NATs and firewalls

Network Address Translators (NAT) and firewalls create problems for end-to-end connectivity on the Internet. This not only affects P2P-SIP but also client-server SIP. In this article I post some example numbers to illustrate the point.

These numbers are for example only: suppose there are 10% public Internet nodes, 30% nodes behind good (cone or address restricted) NAT, 30% nodes behind bad (symmetric) NAT and 30% nodes behind UDP blocking firewalls (F). Let's denote these as P=10%, G=30%, B=30%, F=30%. Here the public Internet nodes are typically from universities and research institutes, those behind good NAT are usually from residential DSL/Cable access, those behind bad NAT are partly from residential and partly from enterprise environment, and those behind UDP blocking firewalls are from enterprise and corporate networks. Suppose a call event between any two pair of nodes is independent of each other for the probability analysis purpose and nodes are equally likely to call any other node. Thus, percentage of calls between two public Internet nodes is (10%)^2 = 0.01 = 1%.

Now let us enumerate the NAT and firewall traversal techniques available to SIP. STUN helps with good NAT, whereas TURN relay is needed for bad NAT. ICE is used to negotiate the connectivity using STUN or TURN bindings. A TCP-based relay (or even HTTP relay) is needed for UDP blocking and very restricted firewalls. (what about TCP hole punching and other techniques?) A STUN server is light in terms of bandwidth utilization, whereas a TURN relay needs high network bandwidth and hence costs the service provider more money. Same is the case with TCP-based relay.

In a call if one participant is behind a UDP blocking firewall (F), then the call must use a TCP relay. This amounts to 1-(1-F)^2 = 51% calls going through TCP relay.

In a call if both participants are behind bad NAT, then we need a TURN relay. This amounts to B^2 = 9% of the calls.

If one participant in a call is either on public Internet or good NAT and other is on public Internet, good NAT or bad NAT, then the media can go end-to-end using STUN bindings. This amounts to 40% of the calls.

In conclusion, the VoIP provider will need to host UDP or TCP relays for 51+9=60% of the calls. This is not a good proposition.

In real world, the call events are not independent of each other: probability of a corporate user calling another corporate user within the same corporation is high. Also probability of a home user calling another home user is also high. For example, a SIP service targeted towards consumers can expect to have most of the calls among residential users. Thus, the percentage of calls that can be end-to-end is much higher than 40%. Similarly, an enterprise VoIP system can expect to have mostly internal intra-enterprise calls, which do not need to cross the enterprise firewall. Hence the percentage of calls needing the relay is not as high as 60%. Let us analyze these two use cases separately.

Suppose, for a consumer SIP service, the distribution of nodes is P=15%, G=50%, B=30%, F=5%, i.e., less number of users are from bad NAT or UDP blocking firewalls. In this scenario about 20% calls need a relay whereas 80% calls don't.

In an enterprise VoIP system, suppose 60% calls are intra-office and 40% are with outside the office network, then only those 40% calls need a relay whereas 60% calls don't. In a properly engineered enterprise VoIP system, appropriate ports are opened for UDP as well as appropriate media relays are installed in DMZ which facilitates smooth media path for inter-office communication.

While we can play with these numbers as much as we want, the fact remains that a significant percentage of calls need media relay, either UDP TURN relays or TCP relays. This puts unnecessary burden on the VoIP service provider to install and manage relays and buy network bandwidth for those relays, or simply disallow calls that require relay (in which case they may lose customers).

In a peer-to-peer system with super nodes such as Skype, these super nodes can act as media relays and hence save a lot of bandwidth and maintenance cost for the provider. There are some things to consider though: a node behind public Internet can become UDP as well as TCP relay for any call, whereas a node behind good NAT can become only UDP relay with some workaround, but not a TCP relay. This puts too much burden on nodes behind public Internet.

Let us consider the original example with P=10%, G=30%, B=30%, F=30%. In this case the 51% of calls that require TCP relay must use one of the 10% P nodes. When acting as a relay, the bandwidth requirement at the relay is twice that of when the node is in a call. Suppose each node makes N calls a day, and generally speaking needs bandwidth for N calls. However, a public Internet node not only needs bandwidth for its own N calls, but also for relaying 5xN calls of other users which amounts to total bandwidth for 11xN calls. Thus, while the super-node architecture is beneficial to the provider, it heavily punishes users on the public Internet. (My guess is that number of public nodes using VoIP are about 4-5%, which further burdens the public nodes).

A managed P2P-SIP infrastructure can be a good alternative, where corporations and universities donate hosts/bandwidth on high speed network to act as relays/super-nodes. Alternatively, one can have an incentive system to promote hosts to become relays and super-nodes.

Tuesday, February 24, 2009

RTMFP vs SIP

Adobe's RTMFP is not P2P-VoIP as exemplified by Skype. On the other hand, RTMFP is closer to client-server SIP or H.323 where signaling happens via a server and media path can be end-to-end between the endpoints. When people refer to RTMFP as P2P, it is more like 'end-to-end media' similar to client-server SIP.

Why is RTMFP important? The previous Adobe protocol RTMP is strictly client-server even for media path. This gives poor quality for real-time media communication because media packets go from client to server, that too over TCP, and then are redistributed to the other client, again on TCP. End-to-end media based VoIP systems existed before Adobe implemented RTMP. I suppose the difficulty of NAT and firewall traversal and lack of interactive video communication requirement in Flash Player resulted in RTMP. Adobe corrected this mistake in the new protocol RTMFP which allows NAT and firewall traversal (to some extent) and allows end-to-end media path without going through the server. Although, the signaling is still going via the central server.

Once we understand this difference between P2P-VoIP and RTMFP, lets enumerate the differences between an RTMFP-based and a client-server SIP-based communication system.

1. RTMFP is a closed protocol, although Adobe recently opened up the previous RTMP. On the other hand, SIP is an open standard from IETF. This means anyone can implement SIP whereas only Adobe can implement RTMFP. That also means that a bug in the RTMFP protocol or its implementation is outside the scope of public review such as for security experts.

2. RTMFP is an integrated protocol that has support for signaling, encryption, media flow (flow control and congestion control), NAT traversal. Whereas SIP is just one piece of the puzzle, that is used in conjunction with RTP/RTCP, SDP, STUN, TURN, ICE, SRTP, etc. to build a complete system. In that regard there is more scope for interoperability problems in SIP systems. The SIP interoperability test (SIPit) events have helped in solving interoperability problems among current products for over a decade. (see next point on why RTMFP alone may not be sufficient?)

3. Based on the available documentation, RTMFP works on UDP. Whereas SIP can work on UDP as well as TCP. In an RTMFP application, the client should fall back to TCP-based RTMP if for some reason UDP is blocked for the client-server communication. This also means that the client will lose some of the benefits such as encryption available in RTMFP. There are other protocols RTMPS and RTMPE to facilitate security and encryption over TCP-based RTMP.

4. Although RTMFP works on UDP, it implements additional flow control and TCP-friendly congestion control. This helps media traffic deal with network congestion and slow receivers. On the other hand most existing SIP system do not implement such mechanisms in the media path. While this looks like an advantage in RTMFP, it turns out to be a problem because of the way it is implemented. In particular, the network components are disconnected from the media source components such as camera and microphone. The rate control mechanisms are implemented in network components which internally slow down the media traffic by delaying or dropping the UDP media packets. On the other hand the encode quality settings on camera and microphone components are unaffected. This results in packet drops due to congestion and hence choppy video or audio drop-outs. A good application built on top of RTMFP is supposed to get feedback from network components and adjust the encode quality parameters (framerate, bitrate, quality) in the camera and microphone components so that the packet drops are reduced. Thus, unless the application is smart enough to deal with this, the disconnected implementation of rate control and media source causes quality problems in RTMFP.

5. Both RTMFP and SIP can use media relays to workaround NATs and firewalls. However, RTMFP does not use a super-node architecture where some clients (Flash Player instances) act as relays, whereas (P2P) SIP can use existing client nodes to act as media relays. This means that when using RTMFP, the service provider must bear all the bandwidth cost of the relays, whereas in (P2P) SIP the cost can be distributed among the users because of the peer-to-peer nature. I analyze the cost due to NAT and firewall traversal in my next post.

Sunday, February 22, 2009

Why does client-server video conference fail?

I analyze some problems in client-server communication for multi-party video conferencing.

Audio communication differs from video in two important ways: (1) usually in a conference only one person is speaking at any time whereas everyone's video is on, (2) audio codecs are usually fixed bit-rate whereas video codecs adjust bit-rate based on various parameters such as available network bandwidth and desired frame-rate.

Problem 1:
In a client server mode, because video coming from one participant needs to be distributed to all the other participants, the bandwidth and processing requirement at the server can be higher; unlike audio where usually only one person is speaking. Secondly, the downstream video bandwidth requirement at the client increases with the number of participants in a conference. In an N-party conference, each client will have usually one outbound audio stream, one inbound audio stream, one outbound video stream and N-1 inbound video streams. Note that this problem is worse for peer-to-peer (P2P) video conference, where everyone is sending video stream to everyone else: in which case there are N-1 inbound and N-1 outbound video streams at each client. For asymmetric network access (ADSL or Cable), where upstream bandwidth is lower than downstream, this causes early saturation in outbound network bandwidth. Shutting down video stream or reducing the video quality while a person is not speaking saves some bandwidth especially for speaker mode conferences.

Problem 2:
Second point of difference is that audio is usually encoded using fixed bit-rate codec whereas video bit-rate is adjusted based on several parameters such as available network bandwidth, desired quality and frame-rate. In a client-server environment most implementations use the client-to-server network quality information to decide what bit-rate to use for client's video encoding. Consider a two party client-server conference, where first client is closer to the server hence has lower latency. The first client decides to use high quality high bitrate video encoding. On the other hand the second client decides to use low quality low bitrate video encoding. This asymmetry causes the first client to receive poor quality video whereas the second client's downstream link gets congested with high bitrate video. The problem is further aggravated if in a multi-party conference there is only one participant on poor quality network. The problem is caused because we use client-server network latency metric instead of end-to-end network latency metric in deciding the video encoding bitrate.

Problem 3:
Sometimes, the conference server imposes bitrate control to limit the traffic towards a low bandwidth client. However, for efficiency reason the server doesn't re-encode the video packets. Instead, it just drops non-Intra frames if there is not enough bandwidth. This causes marginal to no improvement primarily because Intra frames are several times bigger than other frames. Secondly, it causes choppy video which further degrades the experience. The layered encoding in MPEG solves this problem.

Problem 4:
Larger video packets may not traverse end-to-end over UDP. An encoded audio packet is usually small, of the order of 10-80 bytes per 20 ms. On the other hand an intra-frame video packet size can be much larger, say 1000-10000 bytes. When media packets are sent over UDP, and the packet size is large, there is high probability of getting the packet dropped. This is because of the MTU restriction and middle-boxes (NAT and firewall) in the media path. An UDP packet of size larger than MTU (typically approx 1300-1400 bytes) gets fragmented at the IP layer such that subsequent fragments after the first one do not have the UDP header information (such as source and destination port numbers). A port inspecting NAT or firewall that doesn't handle fragmentation correctly may drop such subsequent fragments, causing loss of the whole UDP packet at the receiver end. Thus, video over UDP has to take care of additional fragmentation and reassembly, and/or discovery of path MTU in the application layer.

Problem 5:
The server may allow video over UDP as well as TCP from the clients, typically to support NAT and firewall traversal. If some clients are over TCP and others over UDP, then the server also needs to proxy packets from one to other. If the client over TCP assumes ordered packet delivery, then the server will also need to do buffering, packet re-ordering and delay adjustment, which further adds to the implementation complexity of the server. The problem is not that visible for audio beyond a glitch in sound, whereas for video the view may get completely corrupted until the next Intra frame.

Problem 6:
A slightly related problem is when the conference server does audio mixing but video forwarding. In this case, the server must perform delay adjustment, packet re-ordering, and buffering for the audio path. However, for efficiency reason it may blindly forward the video packets among the participants. Thus the synchronization information between the audio and video gets lost, and performing lip synchronization at the receiving client becomes a challenge. A correct implementation of the server should act as an RTP mixer, i.e., include the contributing source information in the mixed audio stream, and distribute RTCP information to all that participants for synchronization. (How to do this if each audio call leg is a separate RTP session?)

Some of these problems (2,3,5,6) can be solved to some extent by using peer-to-peer video conferencing.

Friday, January 16, 2009

Asynchronous Internet Programming

Programming an Internet application requires several asynchronous events such as network events, file I/O, timers and user interactions. In this article I walk through some existing design patterns for asynchronous programming.

Most programmers think synchronously, i.e., in a single control flow. For example, in a web server implementation, a socket is opened for listening, and when an incoming connection is received it accepts the connection, receives the request, and responds to the client. There are several steps in this flow that can block, e.g., waiting for incoming connection or incoming request. If a resource such as thread or process is blocking, it causes inefficiency in the overall system design. For example, if one thread is blocked serving a request from one client, another thread needs to be created to serve request from another client. Creating a thread-per-request causes too many thread creation and deletion in the system when the request-per-minute load increases.

Event handlers:
The first approach is to break your program into smaller chunks such that each chuck is non-blocking. Then use a event queue to schedule these chunks. The earlier windows programming (WinProc) falls under this category. The main loop just listens for the events on an event queue. When a user input is received such as mouse movement, click or keyboard input, a event is dispatched to the event queue. The event handlers installed in the program handle the events, and may post additional events in the queue. External events such as socket input can also be linked to this event queue. One problem with this design is that several event handlers need to share state (hence global variables) which results in complex software.

while not queue.empty():
msg = queue.remove(0)
dispatch(msg) # calls the event handler for msg

Some event-driven programming languages such as Flash ActionScript enforces this event handler design practice coupled with object-oriented design. Typically, in ActionScript, individual object has event queue and can dispatch events, instead of exposing a global event queue to the programmer. An object can listen for events from another object, e.g., a controller program can listen for click event on the button, and handle it appropriately. Thus, there can be some data hiding, information separation, and better software design using event handlers. In other languages (C++) libraries such as Poco that facilitate event driven object oriented programming. Nevertheless, one problem with this design is that logically similar source code needs to be split across different functions and methods, and sometimes different classes. For example, a timer event handler defined as a separate function from the timer initialization code.

button.addEventListener('click', clickHandler);
...
function clickHandler(event:Event):void {
...
}


Inline Closures:
Java as well as ActionScript solve this problem using inline closures. In particular, you can define event handlers inline within a method definition, thus keeping the logically similar code together in your program file. However, this is not always possible, e.g., if the same timer needs to be started from several places in the code. In this case, it does make sense to keep the timer handler as a separate method.

timer.addEventListener('timer', function(event:TimerEvent):void {
... // process timer event
}


Deferred Object:
Twisted Framework solves this problem using the Deferred Object pattern. The idea is as follows: instead of breaking the control flow source code every time a blocking operation occurs, the source code is broken when the result of the blocking operation is needed. This makes a lot of difference in programming, and makes the software cleaner. For example, earlier when a socket connection was initiated we needed to install an event handler for the success and failure result and perform the rest of the processing in those event handlers. Now, using the Deferred Object pattern, the socket.connect returns a deferred object, which is used in the same control flow as if it were a result of the connect method. The result object can be passed around elsewhere like a regular object. Only when we need to do something on completion of the connection, we wait on the result object'. This can be done either synchronously using result.wait() or asynchronously by installing a success and error callbacks. Provisions are there to allow one deferred object to wait on result of another deferred object before proceeding. An example follows:

result = socket.connect(...)
result.wait()
if result:
... # connection successful


Co-operative multitasking:
Python's multitask module allows cleaner source code by taking advantage of co-operative multitasking. The idea is to build application level threads of control which co-operate during blocking using the built-in yield method. A global scheduler runs in the main application. I find this to be the most clean design in my programming. An example follows.

data, addr = yield multitask.recvfrom(socket, ...)
yield multitask.sendto('response', addr)


One major problem with these approaches is that because of the inherent underlying single event queue software architecture, we cannot take advantage of multi-processor CPU architecture easily. Thus, we must use multi-threaded design in our software. Most of the earlier designs can be extended to multi-threaded application with some care. For example, one can run multiple global event loops listening on a single shared and locked event queue or one event-queue per thread where the dispatcher schedules it to the appropriate thread's queue. Clearly, the first one is more efficient from a queuing theory perspective. Care must be taken to lock the critical section in event handlers, or control flow that may get accessed from different threads at the same time.

Consequently, we get several multi-threaded designs: thread-per-request, thread-pool, two-stage thread-pool, etc. The performance comparison of these designs in the context of a SIP server is shown in my paper titled Failover, Load Sharing and Server Architecture in SIP Telephony, Sec 6.

Thursday, January 15, 2009

Programming languages for implementing SIP

The programming language used for the implementation can affect the software architecture. For example, Flash ActionScript is a pure event-based language with no way of implementing a blocking operation. Hence, when a connection is made on a socket object, the socket object will dispatch the success or failure event. The caller installs the appropriate event handlers to continue the processing after the connection is completed.

C
There are two main reasons for implementing SIP in C: ability to compile on several platforms and very high performance. The primary advantage of implementing a SIP stack in C is that it can be easily ported and compiled on variety of platforms especially embedded platforms. Usually a C compiler is available for a platform, whereas others such as Java interpreter or C++ compiler may not be. Secondly, because there is no overhead (e.g., in terms of run-time environment and code size), the performance is usually the best. The main problem with implementing a SIP stack in C is the development time and cost of maintenance of the software. Finding bugs and adding a feature in a C program is usually more challenging than other languages. However if the software is well designed then the problem can be alleviated to a large extent. In any case, the number of lines of code that needs to be written in C is usually much more than the other high level languages such as Java or Python.

The pjsip project presents an implementation of SIP and other related protocols. It has been used in a variety of real-world projects and has proven itself to be a good SIP implementation especially where performance matters or the resources are limited.

C++
The object oriented design allows better reusability and maintainability compared to programming in C. However, the number of lines of code is still large. If advanced C++ features such as standard template library (STL) are used then portability may become a concern for certain embedded platforms.

One of my earlier SIP implementation was in C++ (and C) at Columbia University. Another example implementation is reSIProcate, an open source SIP stack.

Java
The Java programming language is very popular among corporate world and enterprise application developers. The standards community has developed APIs that cover several aspects of SIP implementation and some of its extensions. This allows the application to be built against those APIs whereas the actual implementation of the SIP stack can be provided by several vendors. The main problem with an implementation written in Java is that it tends to be too verbose. Java on one hand gives the illusion of a very high level language, but on the other hand requires the programmer to write a lot of code even to do a small thing. Part of this is because of the way the language is defined -- in particular the exception handling and strict compiler enforced type checking. This requires the programmer to do a lot of typecasting and hence there is potential for run-time errors. Another problem with Java is that the run-time could become a memory hog.

NIST SIP Stack provides the reference implementation of the JAIN SIP API.

Tcl
The Tcl programming language is not as popular as the other high level languages such as perl, PHP or Java. However, because of the simplicity in the language construct it is very easy to learn. Unless the software is designed right, it becomes very difficult to maintain a large piece of software.

Columbia’s SIP user agent, sipc, is written in Tcl.

ActionScript
ActionScript (or ECMAScript 4), improves on the Java programming language by allowing much smaller source code size of the implementation and much shorter syntax for common operations. However, there are two major limitations: the platform is limited to Flash Player or AIR (Adobe integrated run-time) and the language is purely event-based. The limitation of Flash Player may not seem important at the beginning, but prevents certain features. For example, the current version of Flash Player doesn’t have UDP or TLS sockets that could be used for SIP. Even the functions of a TCP socket is limited in that you can only initiate connections but cannot receive incoming connections. This prevents us from implementing a complete SIP stack in ActionScript without support from Flash Player or an additional plugin. The Flash Player version for embedded devices is usually older than the current version which makes portability an issue. Because of the run-time overhead the performance is limited. The media codecs supplied by the Flash Player 9 or earlier are proprietary codecs that are usually not supported by any other implementation. This causes interoperability issues as well.

Python
The object oriented nature, the compact coding style and very small source code size of the implementation makes Python a very good choice for implementing application protocols such as SIP. There is some overhead because it is an interpreted language; however the overhead is comparable to that of Java run-time. The interpreter is now usually available for embedded platforms as well, making it more portable than other languages such as ActionScript or C++.
The biggest advantage of an implementation in Python is that the code size is drastically smaller than other languages. I have implemented the basic SIP stack in less than 2000 lines of Python source code. Compare this with the Java implementation of SIP which has more than 1000 files. The lower size means that not only the development time is smaller, but also the testing, code review and maintenance cost is much lower.

P2P API

In this article I describe several sets of APIs for P2P. For a software engineer, designing a good API is very important. A good abstraction of the underlying concept leads to a good API. For example, the data model of storing key-value pair in a P2P network is usually abstracted as a hash table. Programmers like seeing existing semantics in an API because they are already familiar with those semantics. This article uses API ideas from OpenDHT, JXTA, Adobe Flash Player, and IETF P2P-SIP draft.

A good API
  1. should take the form of use-cases
  2. is very general and concise: does one thing and does it well
  3. is self-explanatory and is similar to existing concepts, models, practices
  4. is independent of implementation details
At the high level, there are three abstractions for P2P APIs: data storage, peer connectivity and group membership. There are several special cases among these abstractions.

Data Storage
A distributed hash table (DHT) is a form of structured P2P network with data stored using hash table abstraction. In particular, it provides put, get and remove API methods. Programmers are familiar with container semantics of hash-tables. In Python, this looks like:

a = DHT()
a['key1'] = 'value1'
print a['key1']
del a['key1']

Let us apply this to a use-case of P2P-SIP user location storage. In particular, the key is the user identifier of the form 'kundan@example.net' and value is the user location of the form 'kns10@192.1.2.3:5062'. With this we see several problems in the existing container semantics of the API listed above. Firstly, a user can have several locations, in which case a call request is sent to all those locations as in SIP forking proxy behavior. The following modification to the API causes more confusion because set takes a single value whereas get returns multiple.
a['kundan@example.net'] = 'kns10@192.1.2.3:5062'
print a['henning@example.net'] # prints all contacts of henning

To solve this we can assume a 'set' semantics for a[k]
a['key1'] += 'value1'
a['key2'] += 'value2'
print a['key1'] # print a list of values
a['key1'] -= 'value1' # remove specific key1-value1
del a['key1'] # remove all values for key1

Another problem is that this API is not secure or authenticated. A secure DHT API based on public-key infrastructure can sign the stored value on set and del.
a = DHT(privatekey=...)   # supply the private key of the owner
print a['key1'] # print all values for given key
print a(owner=...)['key1'] # print values only by given owner
A third problem is that the API is not extensible to pure event-based languages such as Flash ActionScript. In ActionScript, there are no blocking operations, hence set and get much be defined asynchronously. ActionScript already has SharedObject abstraction to deal with remote shared storage of objects. This can be reused for the API as follows:
so = DistributedSharedObject.getRemote(privatekey=...)
so.setProperty('key1', 'value1') # sets the key1-value1 pair
so.retrieve('key2') # initiates get
so.addEventListener('sync', syncHandler)
so.addEventListener('propertyChange', changeHandler)
function syncHandler(event:SyncEvent):void {
... # put is completed
}
function changeHandler(event:PropertyChangeEvent):void {
if (event.property == 'key2') # get is completed
trace(so.data['key2'])
}

Another problem is that the API doesn't take into account the time-to-live of key-value pair. This can be solved by supplying a default timeout in the constructor.
d = DHT(privatekey=..., timeout=3600)   # default TTL of one hour

Note that a data storage API can be built on top of a routing and connectivity API. Since we would like to separate the abstractions (data storage and peer connectivity/routing), we will define separate sets of APIs for these.
d = DHT(net=..., .privatekey=..., timeout=...) # use the given connectivity (net) object

Peer connectivity and routing
The connectivity and routing layer deals with maintenance of P2P network. Programmers are familiar with socket abstraction and there has been effort to map P2P to socket abstraction [2].
s0 = ServerSocket(...)  # a P2P node is created
s0.bind(identity=..., credentials=....) # node joins the P2P network
The bind method is similar to the JXTA semantics, and similar to the attach method as proposed in IETF P2P-SIP work. Actual communication with a specific peer can happen over a connected socket. A connected socket is returned in the connect or accept API methods on the server-socket.
s1 = s0.connect(remote=...) # connect to the given remote identity
s2, remote = s0.accept() # receive an incoming connection from remote
The connection procedure takes care of negotiating connectivity checks using ICE (or similar algorithm) to allow NAT and firewall traversal.

Once connected, the socket can be used to send or receive any message to the peer.
s1.send(data=..., timeout=...)
data = s1.recv()
The original server-socket can be used to send or receive messages to specific peers without explicit connection.
s0.sendto(data=..., remote=...)  # send to specific peer
data, remote = s0.recvfrom(timeout=...)
Sometimes, a node needs to send a message to any available node close to a given identifier. This can be implemented using overloaded sendto method that takes either a remote address or a key identifier. The latter performs P2P routing to the given destination key identifier.
s0.sendto(data=..., key=...)
data, remote, local = s0.recvfrom(...)
if local == s0.sockname: # received for this node identifier
...
else: # received to the given key presumably close this node
key = local
A connected socket can be disconnected or a node removed from the P2P network using the close method.
s1.close()  # close the connection
s0.close() # remove the node
These abstractions allow building a distributed object location and routing APIs [1] using locality aware decentralized directory service.

Group membership
Group membership APIs are similar to the multicast and anycast socket APIs. A node can join or leave a group, and a message can be sent via multicast or anycast to one or more nodes in a group. Let us define a group identifier similar to that in JXTA. We extend the previous socket abstraction to create a new group.
s3 = s0.join(group=...)  # join a given group
The join method returns a semi-connected socket object which can be used to send or receive packets on the given group.
s3.send(data=...)        # send to every node in the group
s3.sendany(data=...) # send to at most one node in the group
data, remote, key = s3.recv() # receive a multicast or anycast data
The socket API gives an intuitive abstraction to understand the P2P concepts. The actual implementation of the group membership may be more complex than simple routing and connectivity. Closing a group socket leaves the group membership.

We have seen how a P2P API can evolve to accommodate various concepts into existing well known abstractions such as hash table and socket.

References:
1. Towards a common API for structured peer-to-peer overlays (2003) http://oceanstore.cs.berkeley.edu/publications/papers/pdf/iptps03-api.pdf
2. The Socket API in JXTA 2.0 http://java.sun.com/developer/technicalArticles/Networking/jxta2.0/

Monday, September 29, 2008

Welcome to 39 peers!

I have launched an open-source project named "39 peers". From the web site:

"The 39 Peers project aims at implementing an open-source peer-to-peer Internet telephony software using the Session Initiation Protocol (P2P-SIP) in the Python programming language. The software is still incomplete -- especially the P2P part.

Peer-to-peer systems inherently have high scalability, fault tolerance and robustness against catastrophic failures because there is no central server and the network self-organizes itself. Internet telephony can be an application of peer-to-peer architecture where the participants locate and communicate with each other without relying on expensive or managed service providers. 39 peers project is an attempt to provide a open source and free-for-all peer-to-peer network targeted towards open standards based real-time communication.

The 39 peers project is developed for student developers and researchers to experiment with new ideas. It is written in Python scripting language. It supports open protocols such as IETF SIP and RTP. It is licensed under GNU/GPL license."

Thursday, June 05, 2008

memcached

Recently I got a chance to look at memcached a distributed in-memory cache.

There are two ways of scaling databases: (1) replicating with master-slave such that reads can be done from any slave database whereas writes needs to be done in master, and (2) distributing the data among segments such that accessing a data needs to access only a particular segment instead of the whole database. The first technique doesn't work well for scaling SIP-related contact data, because number of writes (login/logout, presence updates) are significant compared to number of reads (call routing, IM delivery). Distributing the data among segments has been done before, and my thesis also describes it in the context of SIP telephony.

The memcached system uses this second technique, hence has much better performance in certain cases. It is targeted towards web applications that generate dynamic web content by accessing the database at the back-end. One could run memcached on several servers, potentially co-located with the web servers in the server farm. The dynamic web service (e.g., written in PHP) is then modified to first look in the cache, and if not found then query the actual database (and update the cache as well). The cache provides a distribute hash-table (DHT) interface.

When the data is updated in the cache, it gets stored in the appropriate memcached instance based on some hash of the data key. When a query is done, that instance is queried for the data. Each cache instance can be configured with a limit of memory size, typically 1-3GB based on the available memory in the system. The distributed nature of the cache allows you to linearly scale the cache by just adding more instances of the cache on different machines.

At first glance this looks very promising for use in a P2P-SIP application. However, there are some serious limitation. Although memcached has built-in hashing mechanism for data storage, there is no automatic replication of data (hence no fail-over), the hashing needs to be implemented by the client itself (hence no recursive queries). In particular, memcached implements only one part of a full P2P data storage application such as OpenDHT. Because of this limitation, it cannot be used effectively for P2P-SIP client implementations without adding a lot of additional software. Nevertheless, memcached can potentially be used for P2P-SIP proxy server farm.

As a new project idea, it would be interesting to try to use memcached in openser SIP proxy server to readily build a P2P SIP server farm without much overhead! If you are project student willing to work on this, I will be happy to mentor you.

Wednesday, January 02, 2008

Business out of P2P-SIP

Here are some of my thoughts on how to make (or save) money with P2P-SIP.

There are two points to note: (1) an end-user doesn't know (or care) whether he communicates using client-server or peer-to-peer as long as he gets all the features (audio, video, text, conferencing, offline message, etc), security and speed. (2) unlike service infrastructure where the provider can make money out of user base (either charging per call or per minute as in Skype-out or via advertisements as in youtube?), there is no solid business model for P2P-SIP. Since it is open protocol, anyone can develop a client and break free from the provider. This of it as a email client technology -- how many vendors make money selling email clients?

There are some alternatives to this ideal view of P2P-SIP though. For example, one could control the user base -- keep the enrollment of the user controlled by the provider and allow communication among the clients of the same provider. This requires that the implementation be proprietary (at least the security and validation part), otherwise anyone may write an alternate implementation to bypass the validation. Thus the protocol doesn't remain open and is not P2P-SIP.

Alternatively, one could build a server-farm of SIP proxies using P2P technology so that the maintenance cost of the servers is reduced. However, this still requires dedicated set of servers (electricity, bandwidth, call centers, etc), hence the cost saving is very low. Even though you may be using P2P-SIP you won't save much money.

Thirdly, one could build and sell P2P-SIP software, e.g., for enterprises, that doesn't require expensive call managers or central servers. This works well, but has one problem. People have come to expect that communication, at least PC to PC, is free. Unless the client software provides tons of new features, which cannot be easily done by another vendor, selling such a client software is also difficult.

So, how do you make money? You don't. Instead, you can save money.

You can save money that you would otherwise be spend on maintaining the server infrastructure or you can enable new businesses that were earlier not possible because of close systems. Most of the current communication mechanisms have evolved in a tightly controlled environment, e.g., Mobile phones, Skype, Yahoo, etc. Instead what we need is an open mechanism similar to emails and web. An email client is not tied to a single server. A web client (browser) is not tied to a single server or web site. A communication client should not be tied to a single provider.

Consider an open system where anyone can call anyone else without paying to a provider beyond the Internet connection fees. This opens door for a whole new set of businesses. For example a small time vendor, consultant or topic expert can lend his service in real-time for a small fee without going through the hassle of setting up agreement with a provider. Existing systems such as social network sites or content distribution networks can have additional feature that allows a user to interact with his friends, or other users on the same site, or other users watching the same content, irrespective of the attachment of the user to the provider. There is a huge potential for innovations in the communication client space, similar to how web browsers have evolved.

So, who can do this? I don't think it will come from existing businesses who have large user base -- this includes the likes of Yahoo, MSN, AIM, or Skype. The reason is that they wouldn't want to open up their users to others. I also don't think it will come from existing businesses that are targeted for selling advertisements -- the likes of Youtube, Google, etc.

The question to ask is who benefits from ubiquitous P2P-SIP? Media companies and content owners should invest in making P2P-SIP ubiquitous so that their existing system is improved or becomes more appealing to end users. Such companies can probably sponsor open source projects for P2P-SIP. If there is a better and free communication infrastructure, all sorts of new innovations are possible. For example, people would like to talk with their friends while watching a TV series online, a cricket match online or a movie online.

In conclusion, I feel that big companies that can benefit from real-time communication, but don't have an existing solution to work with, should invest in open-source P2P-SIP technologies. If you are interested in talking about it, feel free to drop me an email.

Friday, June 23, 2006

Data format and interface for external P2P mode

Last month, Henning and I submitted an internet-draft on "Data format and interface to an external peer-to-peer network for SIP location service". The text and HTML version are available. The idea is to use an interoperable XML format for accessing the DHT (which is based on OpenDHT's interface) and an XML format for storing user location data such as contact address, cryptographic keys and certificates, offline messages, presence watcher information, etc.

Structured vs unstructured P2P or Why we chose DHT for P2P-SIP?

One of the questions that people ask me is 'whether structured or unstructured P2P works better for SIP-based Internet telephony?'. My answer is that structured P2P such as distributed hash table (DHT) works better because of the following reason.

Unstructured P2P networks usually rely on caching and replication to improve the lookup performance and reliability. In the case of file-sharing application, once the file named "matrix-II.mpg" is uploaded in a P2P network such as Kazaa, the content of the file remains the same (un-mutable). Thus, for reliability the file may get replicated at multiple places. Intermediate nodes in the download path may cache the file locally to improve download performance by other downstream nodes. On the other hand, a phone's contact location or IP address uploaded under the resource named "bob@home.com" may change often (i.e., mutable), e.g., if bob changes his device or the device's IP address is changed. This makes all the randomly replicated and cached copy of the resource content useless. Thus, random caching and replication is not useful.

In unstructured P2P network, any replica of the file "matrix-II.mpg" is good enough, whereas in Internet telephony you want to reach only "bob@home.com" but not some replica of Bob. You can not replicate Bob's phone. The only thing that can be replicated is the mapping between Bob's identifier and his phone's IP address. However, to allow frequent updates and consistent view of the mapping, it is desired that the mapping is replicated in a structured manner so that both Bob can easily update the mapping and any prospective caller can easily query the up-to-date copy of this mapping. This means given the resource name, we can know where the resource should be present in the P2P network. Thus, it is structured.

Unstructured P2P is suitable for random or blind search and allows wild-card search. For example, you can search for a file name containing "matrix" keyword. On the other hand, in Internet telephony most calls are made to the known destination (phone number or destination user identifier). Hence, hash table interface that allows lookup based on a key is good enough. Distributed hash table (DHT)-style P2P algorithms such as Pastry, Chord, Bamboo provide ideal platform for such lookups.

Skype is believed to be unstructured among the super-node, but I think there may be some form of structure in how the data is stored among the super-nodes. Plus, there is no guarantee on upper bound on lookup cost which means it might be falling back to some centralized lookup if the number of hops in search exceed some limit. Since Skype is proprietary, we do not know how exactly it does lookups.

Friday, March 03, 2006

New internet-drafts

Last week, there were some new I-Ds on p2p-sip.

draft-matthews-p2psip-nats-and-overlays-00.txt
The draft analyzes various issues with NAT in the context of P2P-SIP and presents some important conclusions: (1) partial mesh of peers is more suited than ring or full mesh connections, for example; (2) connections can remain mostly static for maintenance of p2p; (3) structured p2p is more suited because of bounded connection count; (4) symmetric P2P (such as Kademlia?) is more suited because the connections across NAT are bidirectional; and (5) use of superpeers where all peers behind a common NAT can elect a small number of superpeers which handle connections across the NAT. The basic idea is to understand the NAT limitations such as number of pinholes allowed and timeout, and apply these for a deployable architecture.

draft-shim-sipping-p2p-arch-00.txt

The basic idea is to use the SIP-using-P2P architecture. There is some overlap with my 'external DHT' technical report. Some points of contentions are: the draft proposes P2P lookup first, and then fall back to DNS (RFC3263) which has a negative effect on performance since P2P lookup is slower. The format of the record can reuse the SIP Contact header instead of specifying individual terms such as TCP80, UDP5060, etc. The login server is clearly undesirable. But the draft is a good starting point. We will need more maturity in areas of security, authentication, and data format.

draft-mayrhofer-enum-domainkeys-00.txt
It proposes to use telephone numbers instead of email like user@domain format because one can validate that the number owner actually owns the telephone. The main problem is that phone numbers are too rigid and tied to physical port or line, thus we cannot have identifiers for virtual users or devices, e.g., sip:lamp@cs.columbia.edu belongs to an electric X10 controlled lamp in our lab. Secondly, phone numbers are hard to remember, thus needs address book. Thus it doesn't make much difference whether we use phone number or email as long as we know the remote's public key. The main problem to be solved is the distribution of public key in p2p. A certificate-based approach can be used instead.

draft-quittek-p2p-sip-middlebox-00.txt
It tries to list various requirements for NAT traversal in P2P-SIP. Not much new value. Ignores some of the existing NAT traversal schemes such as ICE.

In conclusion, there are still a lot of open issues in P2P-SIP, most notably in security, authentication and NAT/firewall traversal. Any solution for P2P-SIP must reuse the SIP's NAT traversal schemes such as ICE. There is a new version of best practices draft for NAT traversal in SIP.

Friday, February 24, 2006

Thunderbird as a deployment vehicle

One of the reasons for success of Skype is that they already had a deployment vehicle in the form of KaZaA file sharing network. I believe that for mass deployment of any P2P VoIP tool including open standard based P2P-SIP we need free software availability and a deployment vehicle.

Deployment vehicle is just a mechanism or tool that already exists and is widely used, e.g., tools for email, web, file sharing, or something more basic as Windows OS. For example, someone can add P2P-SIP support in popular email clients such as Microsoft Outlook Express or Mozilla Thunderbird. The latter is easier since it is open source.

Project description: the Thunderbird client can be extended to also serve as a (peer-to-peer) voice and IM client. IM buddies would appear in the Thunderbird address book, allowing to display buddies and their online status. IM transcripts might be saved and searched in the same way that messages are saved. This work can leverage a SIP project, http://www.croczilla.com/zap/, and might build on the Cockatoo project at http://cockatoo.mozdev.org.

One issue is that not many people use Thunderbird compared to Microsoft product (although I believe that will change in future). Alternatively, a web interface similar to Gmail and google-talk can be provided that can be easily accessed from any web browser. The challenge is to use P2P instead of central web server.

Email based identity in P2P-SIP

Our p2p-sip architecture (see the previous 'external DHT' post) suggests using email address as user identity for SIP. In particular, our web CGI script (as a certifying authority, CA) generates the user certificate verifying that the user owns the identifier, say bob@example.net, and send this to the email address bob@example.net.

This serves two purposes: (1) no login server: unlike Yahoo/MSN/Skype that need to connect to 'login server' on every login attempt, our p2p-sip uses the (central) CA only for first time use. The user certificate also gets stored in the P2P network, thus no need to connect to the 'login server' again. (2) any identity provider: unlike Yahoo/MSN/Skype/G-Talk that tie the identifier to a particular provider (i.e., @yahoo.com for Yahoo users), we can allow any user identifier as long as the identifier belongs to him.

To avoid use of a single identity (service) provider, one option is to use user's email address as his SIP user id. Thus, identity verification just involves making sure that his user id is a valid email address that belongs to him, and that user has a certificate that proves this.

Our web CGI script generates user certificate, where the user public key and certificate request is supplied by the user (automatically by sipc, on first time use). But to make sure that the user id is his email address, the certificate is sent in the email. This prevents you from using user id "bill@microsoft.com" because you can't get a certificate from our CA unless you own this email address. But if you have an email address say "bob@yahoo.com" or "Robert@msn.net" you can use this as your SIP identifier in p2p-sip, thus not tied to a single provider.

Sending in email is just one of the ways. Alternatively, if a group of users already have user certificates from other trusted entity such as verisign, they don't need to do email based certificates. Another possibility for future work is to also allow 'tel:' identity if the user can call from that telephone number (with caller id) to our VoiceXML service script that verifies that the user owns this telephone number and issues a new certificate (by directly storing in the DHT). This way other friends who know his phone number instead of email address, can also reach him on p2p-sip. Making an outbound call to tel: (similar to sending outbound email) for identity verification is probably not a good idea, unless user somehow pays for the call.

Identity protection (no one else can steal your identity) and verification (others can verify that you own this identity) is just one part of the P2P-SIP security. And using an email-based identity moves the problem of identity issuance and protection from P2P-SIP to your email provider.

Wednesday, February 22, 2006

Using an external DHT as a SIP location service

In Dec last year, I added support for an external DHT in my SIPpeer application, i.e., the SIP-using-P2P model. A technical report describes the details. In particular I used OpenDHT and built signing and encryption on top of it to allow secure P2P SIP with identity protection. The advantage of using a managed external P2P network is that we can safely assume that OpenDHT nodes are trusted in the sense that they perform DHT routing correctly. Thus the security problem is partly solved, and the identity protection can be done by issuing a certificate to the user if he/she has an email address.

Then in Jan and Feb this year, I wrote a Tcl based SIPpeer connector for OpenDHT, and with the help of Xiaotao Wu, added it to our SIP user agent, sipc. We are trying to see if we can make it opensource or at least freely downloadable for non academic use also. But first I will need to fix some of the usability issues. I will keep the current status posted here on time to time.

Tuesday, December 06, 2005

SIP-using-P2P vs P2P-over-SIP

I recently saw the video of IETF 64, P2P-SIP Ad hoc meeting. Some of the interesting points from the "SIP based vs DHT based approaches" are as follows: There are two components in P2P+SIP, namely Maintenance (of P2P/DHT) and Lookup (for sending SIP message). In the first SIP-based approach, both the components use SIP, whereas in DHT-based approach, both the components are done using the P2P protocol directly. The first approach just requires resolver-like library in the application, and usage of SIP is incidental. In the second, SIP is overloaded with additional functionality of maintaining the DHT/P2P network.

An intermediate approach is much better, where the DHT maintenance is done using the DHT/P2P protocol (instead of overloading SIP with that), and the lookup is done using SIP (by sending the call request to the next hop node in the P2P network). This gives rise to a P2P-SIP proxies model where the SIP proxies use P2P algorithm to locate the user and proxy the SIP request to the next hop based on the P2P lookup. This differs slightly from the first approach in that this doesn't require the caller node to lookup the user on P2P network before sending the SIP call request. The caller uses its local structures (such as finger table for Chord), locates the next hop and proxies the call request to the next hop. The intermediate approach takes the advantage of SIP-based model, as well as preseves the simplicity of DHT-based approach.

Monday, December 05, 2005

2 P2P or Not 2 P2P?

I read an interesting paper on whether P2P is appropriate for an application or not. Although the paper is based on file sharing work, it seems to suggest (see chart on page 8, and conclusions at the end) that P2P may not be appropriate for global Internet telephony kind of applications where participants do not trust each other and the resources are of low relevance (since individual phone number/user identifier in p2p-sip are not interesting to many peers; unlike popular files). The reason being, too much overhead due to distrust, and over due to non-natural p2p network evolution if the resources are not popular.

However, if the trust problem can be solved in p2p-sip, I believe that p2p-sip will be useful because it can provide serverless VoIP at different scale -- small organizations, to global Internet. The current skype model may fit as an appropriate application for p2p because the protocol is closed, which makes the different clients to trust each other to some extent (i.e., the application is not malicious).

First post: initial set of references

The recent news is that I am back from a long stay in India of about five months. I decided to start a new blog to put together my findings and readings on P2P-SIP and VoIP in general. If you would like to be added a team member for this blogger please drop me an email and I will add you!

I will start with references to my work. You can see a brief overview of our P2P-SIP design as a short paper, a more detailed architecture document or an extended overview paper. I also have an implementation running on Linux. The implementation report describes the details of the implementation so that others can also implement similar system.

There are many other papers on the p2p-sip website, including the original SoSIMPLE technical report, and internet drafts by Alan Johnston, Nimcat Networks and College of William and Mary.