Document Title

The Original Program

So, we've been tasked with reading a text file and calculating the number of words in the file of various lengths, how many 3 letter words, how many 4 letter words and so on. In reality, the program might be doing anything but the pattern that is important is (a) Read file sequentially (b) process each line/record as processed (c) Output something at end.

I pull out my trusty laptop and quickly code the following program

	// Open input file
	FILE* pFile = _fsopen(pName, "rt", _SH_DENYNO);
	if (pFile == NULL) return 0;

	char buffer[1024];

	// Init output Counters
	int WordLen[32] = { 0 };

	// Loop reading each line
	while (fgets(buffer, sizeof(buffer), pFile)) {
		char* tok = strtok(buffer, " \t,.");
		while (tok != NULL) {
			int n = strlen(tok);
			if (n >= 32) n = 31;

			WordLen[n]++;

			tok = strtok(NULL, " \t,.");
		}
	}

	// Cleanup
	fclose(pFile);

Give it a quick run, and what happens - I wait 57 seconds for the result. Looking at my CPU usage, I see that one CPU core is running at 100% and the other is idle.

Great, this means if I split the above program into three threads, one reading the file and two processing the records, I should complete the task in about half the time.

Jump to Section: First Step, Multithreading | Second Step, Improve Processing Loop | Third Step, Improve File Reading

Multithreaded version #1

Technical side-note: The program shown below is only to illustrate how a program can be broken into threads and threads can communicate to each other. There are (intentionally) things that are sub-optimal, but I will get to them later

Back to my trusty laptop, and I quickly hammer out the following multithreaded version of the above.

You don't need to read all the pieces, skip to the bottom and see the MainRoutine first.
This program looks a lot more complex than my version above, I haven't tried to make it clean to read, and besides if it is going to run twice as fast, who cares!
I've intentionally only allocated 8 buffers for passing records between threads, if a thread needs more it will wait, otherwise one thread could consume lots of memory if it is faster than the others
The input file is being fully cached (744Mb) in the OS in the tests below, but all measurements are equally affected

Skip Over Program and continue reading

CSemaphore BufferSem;
#define RECSIZE 1
volatile char* spFreeBuffers[8] = { NULL };
volatile char* spLoadedBuffers[8] = { NULL };
volatile unsigned long sOverallState = 0;
int OverallWordLen[32] = { 0 };


UINT FileReader(LPVOID pV)
{
	/******************************
	this thread reads a file and places the data read into a buffer, it is designed
	just to read the file from the disk as quickly as possible, so does NOTHING else
	*******************************/
	FILE* pFile = _fsopen((const char*)pV, "rt", _SH_DENYNO);
	if (pFile == NULL) {
		sOverallState |= 1;
		return 0;
	}

	int bRun = 1;

	while (bRun) {
		// Allocate a buffer to write into
		char* pBuffer = NULL;
		while (pBuffer == NULL) {
			{
				CSingleLock sLock(&BufferSem, TRUE);
				for (int k=0; k < 8; k++) {
					if (spFreeBuffers[k] != NULL) {
						pBuffer = (char*) spFreeBuffers[k];
						spFreeBuffers[k] = NULL;
						k = 8;
					}
				}
			}

			if (pBuffer==NULL) {
				Sleep(0);
			}
		}


		// Fill the buffer with some data
		*pBuffer = 0;
		char* pKick = pBuffer;
		char buffer[1024];
		int bEof = 0;
		for (int j=0; j < RECSIZE; j++) {
			if (fgets(buffer, sizeof(buffer), pFile)) {
				if (j > 0) {
					*pKick++ = ' ';
					*pKick=0;
				}

				strcat(pKick, buffer);
				pKick += strlen(buffer);
			} else {
				bEof = 1;
			}
		}

		// Place the buffer in the work queue for processing threads
		{
			CSingleLock sLock(&BufferSem, TRUE);
			for (int k=0; k < 8; k++) {
				if (spLoadedBuffers[k] == NULL) {
					spLoadedBuffers[k] = pBuffer;
					k = 8;
				}
			}
		}

		// Handle end of file conditions
		if (bEof) {
			CSingleLock sLock(&BufferSem, TRUE);
			sOverallState |= 1;
			bRun = 0;
		}
	}

	fclose(pFile);

	return 0;
}

UINT BufferProcessor(LPVOID)
{
	/******************************
	this thread processes in memory buffers
	it only has to process buffers and output results.  There can be more than one of these threads
	active
	*******************************/

	int bRun = 10;
	int WordLen[32] = { 0 };

	while (bRun>0) {
		// Allocate a buffer to process
		char* pBuffer = NULL;
		{
			{
				CSingleLock sLock(&BufferSem, TRUE);
				for (int k=0; k < 8; k++) {
					if (spLoadedBuffers[k] != NULL) {
						pBuffer = (char*) spLoadedBuffers[k];
						spLoadedBuffers[k] = NULL;
						k = 8;
					}
				}
			}
	
			if (pBuffer==NULL) {
				// Didnt get a buffer?  Back off a little bit and try again
				// Except if EOF has been reached then assume all done after 10 attempts
				if ((sOverallState&1)==1) {
					bRun--;
				}
				Sleep(1);
			} else bRun = 10;
		}

		if (pBuffer) {
			// Process the buffer, for this example we are counting word lengths, but it
			// might be anything
			char* tok = strtok(pBuffer, " \t,.");
			while (tok != NULL) {
				int n = strlen(tok);
				if (n >= 32) n = 31;

				WordLen[n]++;

				tok = strtok(NULL, " \t,.");
			}


			CSingleLock sLock(&BufferSem, TRUE);
			for (int k=0; k < 8; k++) {
				if (spFreeBuffers[k] == NULL) {
					spFreeBuffers[k] = pBuffer;
					k = 8;
				}
			}

		}
	}

	// We are finishing, copy our local counts to the master copy
	// We didnt update master copy directly, as that would have been slower if done for each word
	CSingleLock sLock(&BufferSem, TRUE);
	for (int s=0; s<32; s++) {
		OverallWordLen[s] += WordLen[s];
	}

	if ((sOverallState&2)==0) sOverallState |= 2;
	else sOverallState |= 4;

	return 0;
}


/* ************************************************
   Main Routine here
   ************************************************/
long SimpleCount::BaseLine_Thread(const char *pName, int ret)
{
	// Init Control structures
	sOverallState = 0;
	for (int k=0; k < 8; k++) {
		spFreeBuffers[k] = (char*) malloc((1024*RECSIZE)+4);
	}


	// Start File Reading Thread
	AfxBeginThread(FileReader, (void*) pName);

	// Start two processing threads
	AfxBeginThread(BufferProcessor, NULL);
	AfxBeginThread(BufferProcessor, NULL);

	// Wait for everyone to finish
	while (sOverallState != 7) Sleep(1);

	return OverallWordLen[ret];
}

Now I crank it up and wait, and wait, and wait, and I wait a total of 780 seconds!! WTF. It was still using 50% CPU, so I did all this work and only made it much worse.

The program had a built in record packing factor (#define RECSIZE N), so lets see what happens as the file reader reads more than one line into the buffer before passing it to the processing threads.

	Non Threaded	1x Record Packing	10x Record Packing	100x Record Packing	1,000x Record Packing	10,000x Record Packing	100,000x Record Packing
Seconds (Wall)	57	780	77	44	34	32	33
CPU Usage	50%	50%	75%	85%	92%	100%	100%

Lesson 1 for multithreaded programs - passing data between threads is often expensive, so pass big chunks of work (but not too big)

We need to pass big chunks of work between threads, as each CPU (or core) can typically only communicate via Main Memory; and to a CPU, main memory is slow. When you read a variable that is currently in the cache on another cpu/core, both cpus need to communicate a flush back to main memory. This delay is roughly in the order of 500 instructions (depending on lots of factors, but you get the general idea) To reduce this delay and avoid wasting 500 cycles, we task each thread with doing a larger chunk of work and only passing data infrequently

Now we are using almost 100% of CPU, so we need to reduce our CPU demand next... and then we will look into reading the file faster. Ultimate performance is the point at which one component (disk IO, CPU etc) is max'ed out and we can no longer cost effectively improve that area.

Multithreading - Step #2 - Improve Processing Loop

So, we now have a multithreaded version of the program and it runs in around 34 seconds. I'm going to use 1,000x Record Packaging for now, as there doesnt seem to be much advantage from using a lot more memory. The question now is what is using all the CPU.

There are lots of ways to solve this question, depending on the toolset you have. If I take a wild guess that it is the main processing loop, then I can simply comment that block out to see the performance if processing time takes 0mS

		if (pBuffer) {
			// Process the buffer, for this example we are counting word lengths, but it
			// might be anything
			/* SNIP....
			char* tok = strtok(pBuffer, " \t,.");
			while (tok != NULL) {
				int n = strlen(tok);
				if (n >= 32) n = 31;

				WordLen[n]++;

				tok = strtok(NULL, " \t,.");
			}
			...END SNIP */


			CSingleLock sLock(&BufferSem, TRUE);

Running the program without the processing loop runs in 17 seconds and uses 50% CPU, so it looks like we are on the right track. For my example, the processing is quite trivial, so I can replace the strtok() with pure C

	if (pBuffer) {
		// Process the buffer, for this example we are counting word lengths, but it
		// might be anything


		const char* pKick = pBuffer;
		int len = 0;
		while (*pKick != 0) {
			if ((*pKick == ' ')||(*pKick == '\t')||(*pKick == ',')||(*pKick == '.')) {
				if (len >= 32) len = 31;

				WordLen[len]++;
				len=0;
			} else {
				len++;
			}
			pKick++;
		}
	


		CSingleLock sLock(&BufferSem, TRUE);

This runs in just 20 seconds, with around 50% usage (remember 2 cores, so this is 100% of one core). So it looks like we are now using 17 seconds and 100% of a single CPU just to read the file and load the buffers, so that will be the next target to improve.

The take-away from this is not that strtok() is slow, just that there is a faster method for the narrow needs of my processing loop. You can take this too far - remember that in several years someone else will need to read and support this code, so creating complex code to save a couple of seconds now, might have the side effect of hours of effort to understand it later.

Thinking about all this, if the File Reading thread is using 100% of a single core for 17 seconds (from our quick test above), then two processing threads seems overkill especically when they barely seem to use any CPU. They are reading the same shared variables as the file reader, so presumably they are affecting it. So I did some more tests, just one processing thread, and two threads at lower priority.

	// Start two processing threads
	AfxBeginThread(BufferProcessor, NULL, THREAD_PRIORITY_BELOW_NORMAL);
	AfxBeginThread(BufferProcessor, NULL, THREAD_PRIORITY_LOWEST);

Using thread priorities is a little coarse for production use, but is fine while we are just testing. On a production server running threads at lower priority might mean they get starved of CPU, and then our wall time will probably increase

	N/A	2 Threads	1 thread	2 threads	1 thread
	No Processing	Equal priority	Equal priority	Lower priority	Lower priority
Seconds (Wall)	17	20	18	18	18
CPU Usage	50%	52%	52%	52%	52%

By moving the Processing Threads in such as way that they don't impact the FileReader thread so much, we can save 10%. There is a bit more to this than I am glossing over at this stage, this results are for an unloaded 2 core single CPU, things change around a little when we move to different platforms.

Multithreading - Step #3 - Improve File Reading Performance

Technical side-note: This step is Windows specific, it shows how to use multiple outstanding async I/O. There are equivalents for most platforms (Unix, OpenVMS, etc) so concepts still hold true even if the code isn't directly portable.

TBS