By adding a lock, we can make a data structure “thread safe”.
typedef struct counter_t {
int value;
pthread_mutex_t lock;
} counter_t;
It adds a lock variable into the counter. However, there’s much overhead, and the performance doesn’t grow much when you add CPUs, and it’s not scalable.
If per thread runs as fast as just one thread, then it’s perfect scaling.
Instead of maintaining just one counter, we maintain one global counter, and a local counter for each CPU. Each of them has a lock. The thread running on a certain core adds the local counter assigned to the core. When a local counter exceeds a threshold $S$ (sloppiness), add it to the global counter.
The bigger S is, the more scalable the counter is, but the further the global counter is from the actual value.
Take insertion as example.

We’re assuming that malloc() is thread safe, so the lock could only wrap the actual insertion area.
The linked list above doesn’t scale very well. To enable more concurrency, we need hand-over-hand locking (过手锁, or lock coupling, 锁耦合). Every node has a lock, and when traversing the list, the thread acquires the lock of the next node, and releases the last node.
In practice, the overheads of acquiring and releasing locks is prohibitive. Instead, set a lock every some amount of nodes.
There are 2 locks for a queue, one for head, another for tail. This enables queuing and dequeuing to be processed concurrently.
By adding a fake node, we can split head and tail operation.