Scalability is the ability to scale the system to higher load. Performance determines the throughput of the system under load . In theory, scalability and performance are orthogonal; you can handle higher load either by scaling the system or by improving the performance of individual components of the system. In practice, scaling and performance improvement are used together to improve the overall system.
Suppose a single machine can handle a load of N. If it is possible to handle 2N load by adding another machine, or kN load by adding another k-1 machines, then the application is designed to be scalable. On the other hand, you can always try to optimize your application or buy more expensive hardware to make your application handle 2N load in the single machine. Clearly there is a limit to the performance gain on a single machine. Also, for the same amount of overall improvement, typically scaling the system by adding redundancy is much cheaper than improving the performance of single machine by optimization or buying more expensive hardware.
This seems to indicate that scaling should always be preferred. Unfortunately the problem is that designing your application for scalability is not trivial. As an example, Google AppEngine (GAE) is designed to be scalable, but not necessarily high-performance . On the other hand, rational database such as MySQL can be optimized for high-performance, but designing your application to scale with MySQL is a challenge. In most web applications, typically the database server eventually becomes a bottleneck at high load. On the other hand Google's Bigtable is designed to be scalable. The tradeoff is that GAE API does not allow many relational database features such as join and hence requires the programmers to learn a new way of data storage and access.
Horizontal scalability refers to adding more machines to handle the load, whereas vertical scalability (which we call high-performance) refers to adding hardware components in existing machines such as more memory or better CPU to handle higher load .
High Scalability Techniques
Partitioning the data is most common scalability technique. It allows you to distribute different partitions on different servers. Consistent hashing has been used in distributed hash tables and distributed server farm to assist partitioning and replication of data in the presence of high churn when machines come and go frequently.
Stateless servers are much more scalable than stateful, because stateful servers may need to communicate with each other or share state which limits the scalability. Web servers and SIP proxy servers are easy to make stateless, whereas conference servers, presence servers or gateways are difficult to make stateless. Many applications too require stateful processing at the server, e.g., web applications that need stateful database storage. This concept can be used together with partitioning to build a two-stage server farm where first stage stateless servers just do load balancing whereas second stage stateful servers work on a small data partition. Unfortunately, some applications such as presence or publish-subscribe are too complex for easy data partition.
High Performance Techniques
The C10K problem  talks about the typical web server limitation of only about ten thousand simultaneous connections due to operating system and software constraints, and presents several references to improve the performance. The usual software performance bottlenecks are data copies, context switches, memory allocation and lock contention. Various techniques to handle these problems are summarized in .
Asynchronous and non-blocking IO are commonly used to convert blocking/synchronous methods to event-based. Although asynchronous and non-blocking refer to almost the same thing, there are certain crucial differences in the API . Non-blocking refers to making your methods not block and hence return immediately, e.g., with an error code indicating that the method is not complete. Typically, additional method is available to know the state of the IO. For example, socket API allows non-blocking mode, and can use select to check the state of the socket, whether read or write can be done or not. Thus, the application program has full control of when the read is done and in which thread/stack. On the other hand, asynchronous API are more event-based, where the application registers a method handler for an event, and the system calls the method when that event occurs from within the system thread, or posts that event to the main application's handler loop.
A well known topic of debate is whether event-based or threads are more suitable for high performance servers? Theoretically, both are equivalent with non-preemptive threads and co-operative multi-tasking. But in practice due to the way threads are implemented and resources needed by threads, event-based systems have performed better on single CPU machines. Unfortunately, pure event-based systems are difficult to take advantage of multi-CPU hardware.
Thread-pool and process-pool have been used to improve the system performance and take advantage of multi-CPU hardware. Both multi-process and multi-thread systems have been built in practice. The advantage of multi-process implementation is that multiple processes can listen for incoming connections on the same socket, whereas in multi-thread implementation only one thread can be listening on a socket. The problem in multi-process implementation is that it needs explicit inter-process communication using message passing or shared memory, whereas in multi-thread implementation it is easy to use global variables with mutex and conditions to share state. With respect to event-based systems, there are two design patterns: a reactor pattern allows the application to register for "ready" event and perform the read operation when event is received; a proactor pattern allows the application to register for "complete" event and receives the incoming data along with the read event .
The thundering herd problem in OS is that when an IO event is received all the waiting threads are woken up. But only one thread will handle the event and others will go back to sleep. This wastes CPU cycles. The problem and a solution is proposed in .
For a high-performance server implementation, general consensus is to always use non-blocking IO, and use thread or process pool with minimum number of threads/processes. The idea is that there should be one-thread/process per CPU. This paper  describes a SIP server architecture which can maintain few hundred thousand active TCP connections. For pure network IO it is possible to always use non-blocking IO on commodity hardware, whereas for disk IO it is not so easy. Hence, thread-pool model with worker threads to wait on disk IO completion have been used with success in the past.