Summary and Usage of Parallel Computing

Abstract: Parallel computing can speed up our programme. From data parallel to process parallel, there are many parallel tool and way that we can use. The blog will introduce The Parallel Patterns Library (PPL), Open Multi-Processing (OpenMP) and Open Computing Language (OpenCL). Other important parallel tools such as CUDA and Hadoop will be introduced as topics in later blogs.

In general, there are four kinds of levels parallelism in computing, which are bit-level parallelism (BLP), instruction level parallelism (ILP), data level parallelism (DLP), and task level parallelism (more commonly Thread-level parallelism, short as TLP).

For data level parallelism, there are four kinds of modes for computer to perform parallel calculations. They are SISD, MISD, SIMD, and MIMD, and the meaning of them is illustrated as follows.

There are some tools to achieve the parallel computing, and the parallel scale ranges from multi-thread, multi-processe, multi-core to multi-computer. The follows are some rough classifications.

MPI
The MPI is a process level parallelism. MPI adopts distributed memory system and explicitly (data distribution method) realizes parallel execution. Message transmission is performed between processes. The scalability of MPI is good, but programming model is complicated.

Pthreads
Pthreads is a thread level parallelism. It uses shared memory system and is valid only on POSIX systems (linux, mac OS X, Solaris, HPUX, etc.). It is a library that can be connected to C programs. Currently, the standard C++ shared memory thread library is still under development. It may be more convenient to use this library in future to implement C++ programs.

OpenMP
OpenMP is a thread level parallelism. It uses shared-memory-distribution system and implicitly (data allocation) implements parallel execution with poor scalability. Hence OpenMP only applies to SMP (Symmetric Multi- Processing symmetric multi-processing architecture and DSM (Distributed Shared Memory) system, but not suitable for clustering.

OpenCL
OpenCL is a framework for heterogeneous platforms and can applied on CPU, GPU or other types of processors. OpenCL is a language (based on C99) for writing kernels (functions running on OpenCL devices) and consists of set of APIs for defining and controlling the platform. It provides parallel computing based on task partitioning and data segmentation. OpenCL is similar to the other two open industry standards, OpenGL and OpenAL, which are used for 3D graphics and computer audio, respectively.

CUDA
The GPU is designed to perform complex math and set calculations. There are many stream multiprocessor(SM) in a GPU, which are similar to CPU cores. CUDA is the usefull tool Launched by Nvidia to programm the GPU. The similarities and differences between OpenCL and CUDA:

Differences:
- OpenCL is a general-purpose heterogeneous platform programming language that is cumbersome to take into account different devices.
- CUDA is a framework invented by nvidia specifically for programming on GPGPU. It is easy to use and easy to get started.
Similarities:
- They are based on task parallelism and data parallelism.

Hadoop
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without understanding the underlying details of the distributed architecture. Hadoop makes full use of the power of clusters for high-speed computing and storage.

1 PPL

The Parallel Patterns Library (PPL) provides algorithms that concurrently perform work on collections of data. These algorithms resemble those provided by the Standard Template Library (STL).

1.1 The `parallel_for` Algorithm

You can convert many for loops to use parallel_for. However, the parallel_for algorithm differs from the for statement in the following ways:

The parallel_for algorithm does not execute the tasks in a pre-determined order.
The parallel_for algorithm does not support arbitrary termination conditions. The parallel_for algorithm stops when the current value of the iteration variable is one less than last.
The _Index_type type parameter must be an integral type. This integral type can be signed or unsigned.
The loop iteration must be forward. The parallel_for algorithm throws an exception of type std::invalid_argument if the _Step parameter is less than 1.
The exception-handling mechanism for the parallel_for algorithm differs from that of a for loop. If multiple exceptions occur simultaneously in a parallel loop body, the runtime propagates only one of the exceptions to the thread that called parallel_for. In addition, when one loop iteration throws an exception, the runtime does not immediately stop the overall loop. Instead, the loop is placed in the cancelled state and the runtime discards any tasks that have not yet started.

1.2 The `parallel_for_each` Algorithm

The parallel_for_each algorithm resembles the STL std::for_each algorithm, except that the parallel_for_each algorithm executes the tasks concurrently. Like other parallel algorithms, parallel_for_each does not execute the tasks in a specific order.

1.3 The `parallel_invoke` Algorithm

The concurrency::parallel_invoke algorithm executes a set of tasks in parallel. It does not return until each task finishes. This algorithm is useful when you have several independent tasks that you want to execute at the same time.

1.4 The `parallel_transform` Algorithm

You can use the parallel transform algorithm to perform many data parallelization operations. For example, you can:

Adjust the brightness of an image, and perform other image processing operations.
Sum or take the dot product between two vectors, and perform other numeric calculations on vectors.
Perform 3-D ray tracing, where each iteration refers to one pixel that must be rendered.

Important: The iterator that you supply for the output of parallel_transform must completely overlap the input iterator or not overlap at all. The behavior of this algorithm is unspecified if the input and output iterators partially overlap.

1.5 The `parallel_reduce` Algorithms

The parallel_reduce algorithm is useful when you have a sequence of operations that satisfy the associative property. Here are some of the operations that you can perform with parallel_reduce:

Multiply sequences of matrices to produce a matrix.
Multiply a vector by a sequence of matrices to produce a vector.
Compute the length of a sequence of strings.
Combine a list of elements, such as strings, into one element.

1.6 Code Example for PPL Algorithms

Here is the code example for above algorithms.
Please add header #include <ppl.h>

#include <ppl.h>
#include <vector>
#include <random>
#include <sstream>
#include <iostream>


struct my_data
{
	int num;
	std::string note;
};

// Returns the result of adding a value to itself.
template <typename T>
T twice(const T& t) 
{
	return t + t;
}

//Test Parallel_for algorithm
void test_Parallel_for()
{
	concurrency::parallel_for(1, 6, [](int value) {
		std::wstringstream ss;
		ss << value << L' ';
		std::wcout << ss.str();
	});
}

//Test Parallel_for_each algorithm: regular
void test_Parallel_for_each1()
{
	std::vector<std::string> my_str = { "hi", "world", "hello", "c", "language"};
	
	concurrency::parallel_for_each(begin(my_str), end(my_str), [](std::string value) {
		std::cout << value<<std::endl;
	});

}

//Test Parallel_for_each algorithm: wrong use, data should be independent
void test_Parallel_for_each2()
{
	std::vector<std::string> my_str = { "hi", "world", "hello", "c", "language" };
	int mark = 100;
	concurrency::parallel_for_each(begin(my_str), end(my_str), [&mark](std::string value) {
		std::cout << value <<"  "<<mark<< std::endl;
	});

}

//Test Parallel_for_each algorithm: test class data parallel
void test_Parallel_for_each3()
{
	my_data data1 = { 100,"hi" };
	my_data data2 = { 200,"world" };
	my_data data3 = { 300,"language" };
	std::vector<my_data> my_str = { data1,data2,data3 };
	concurrency::parallel_for_each(begin(my_str), end(my_str), [](my_data value) {
		std::cout << value.num << "  " <<value.note << std::endl;
	});

}

//Test Parallel_invoke algorithm
void test_Parallel_invoke()
{
	int n = 54;
	double d = 5.6;
	std::wstring s = L"Hello";

	concurrency::parallel_invoke(
		[&n] { n = twice(n);  },
		[&d] { d = twice(d); },
		[&s] { s = twice(s);}
	);
	std::wcout << n <<" "<< d<<" "<<s<<std::endl;
}

//Test Parallel_transform algorithm
void test_Parallel_transform()
{
	std::vector<int> values(1250000);
	std::generate(begin(values), end(values), std::mt19937(42));

	std::vector<int> results(values.size());
	std::vector<int> results2(values.size());

	// Negate each element in parallel.
	concurrency::parallel_transform(begin(values), end(values), begin(results), [](int n) {
		return -n;
	});

	// Alternatively, use the negate class to perform the operation.
	concurrency::parallel_transform(begin(values), end(values),
		begin(results), std::negate<int>());


	// Demonstrate use of parallel_transform together with a binary function.
	// This example uses a lambda expression.
	concurrency::parallel_transform(begin(values), end(values), begin(results),
		begin(results2), [](int n, int m) {
		return n - m;
	});

	// Alternatively, use the multiplies class:
	concurrency::parallel_transform(begin(values), end(values), begin(results),
		begin(results2), std::multiplies<int>());

}

//Test Parallel_reduce algorithm
void test_Parallel_reduce()
{
	std::vector<std::wstring> words;
	words.push_back(L"Hello ");
	words.push_back(L"i ");
	words.push_back(L"like ");
	words.push_back(L"c ");
	words.push_back(L"language, ");
	words.push_back(L"and ");
	words.push_back(L"parallel ");
	words.push_back(L"programming.");

	// Reduce the vector to one string in parallel.
	std::wcout << concurrency::parallel_reduce(begin(words), end(words), std::wstring()) << std::endl;

}

int main()
{
	// Alternatively, use one of following test function

	//test_Parallel_for();
	//test_Parallel_for_each1();
	//test_Parallel_for_each2();
	//test_Parallel_for_each3();
	//test_Parallel_invoke();
	//test_Parallel_transform();
	test_Parallel_reduce();

    return 0;
}

2 OpenMP

The basic format of Compiler Directive(#pragma omp) is as follows:

1	#pragma omp directive-name [clause[ []] clause]]

Among them, “[]” indicates optional, and each Compiler Directive acts on the followed statement (the bracketed portion of “{}” in C++ is a compound statement).

Directive-name can be: parallel, for, sections, single, atomic, barrier, critical, flush, master, ordered, threadprivate (11 in total, only the first 4 have optional clauses).

The clause (sentence) is equivalent to the modification of Directive, which defines some parameters of Directive. The clause can be: copyin(variable-list), copyprivate(variable-list), default(shared | none), firstprivate(variable-list), if(expression), lastprivate(variable-list), nowait, num_threads(num) , ordered, private(variable-list), reduction(operation: variable-list), schedule(type[,size]), shared(variable-list) (13 total).

For example, #pragma omp parallel means that the subsequent statements will be executed in parallel by multiple threads, and the number of threads is preset by the system (generally equal to the number of logical processors, for example, i5 4 cores and 8 threaded CPUs have 8 logical processors). Optional clauses can be added to the directive. For example, #pragma omp parallel num_threads(4) still means that subsequent statements will be executed in parallel by multiple threads, but the number of threads is 4.

2.1 `parallel`

parallel indicates that the subsequent statement will be executed in parallel by multiple threads, which is already known. The statement (or block of statements) after #pragma omp parallel is called a parallel region.

You can use num_threads clause to change the default number of threads.

2.2 `for`

for directive divides multiple iterations of the for loop into multiple threads, that is, the iterations performed by each thread are not repeated, and the iterations of all threads are exactly all the iterations of the C++ for loop. Here the C++ for loop requires some restrictions so that you can determine the number of loops before executing, for example, C++ for should not contain breaks.

2.3 `section`

The section directive is used for task parallelism, which indicates that the following code block contains a section block that will be executed in parallel by multiple threads.

2.4 `critical`

When one line of code is executed by one thread, other threads cannot execute (the line of code is critical).

2.5 Code Example for OpenMP Algorithms

Here is the code example for above algorithms.
Please add header #include<omp.h>

Important: Please turn on the OpenMP in configuration Property~C/C++~Language~OpenMP Support~Yes

#include<omp.h>
#include<iostream>

//Test parallel directive
void test_openmp_parallel1()
{
#pragma omp parallel
	{
		std::cout << omp_get_thread_num();
	}
}

//Test parallel directive with thread number
void test_openmp_parallel2()
{
#pragma omp parallel num_threads(3)
	{
		std::cout << omp_get_thread_num();
	}
}

//Test for directive: one format
void test_openmp_parallel_for1()
{
	int data[1000];
	#pragma omp parallel
		{
			#pragma omp for
				for (int i = 0; i < 1000; ++i)
					data[i] = i;
		 }
}

////Test parallel directive: the othe format
void test_openmp_parallel_for2()
{
	int data[1000];
	#pragma omp parallel for
		for (int i = 0; i < 1000; ++i)
		 data[i] = i;
}

////Test section directive
void test_openmp_parallel_section()
{
#pragma omp parallel sections
	{
		#pragma omp section
		std::cout << omp_get_thread_num();
		#pragma omp section
		std::cout << omp_get_thread_num();
	}
}

//Test critical directive
void test_openmp_parallel_critical()
{
#pragma omp parallel num_threads(4)
	{
		#pragma omp critical
			std::cout << omp_get_thread_num() << omp_get_thread_num();
	}

}

int main()
{
	//Alternatively, use one of following test function

	//test_openmp_parallel1();
	//test_openmp_parallel2();
	//test_openmp_parallel_for1();
	//test_openmp_parallel_for2();
	//test_openmp_parallel_section();
	test_openmp_parallel_critical();
    return 0;
}

3 OpenCL

The main design goal of OpenCL is to provide a parallel computing platform that is suitable for a variety of different devices. However, the universal property is the sacrifice of convenience, so OpenCL programming is very tedious. So here, only an example of adding two numbers is provided for demonstration.
The following is the code example for OpenCL programming.

For environment configuration:

Add header #include <CL/cl.h>
The header file and library file can be found in CUDA tool kit, so add the additional include directory as follows(my configuration, you can look for your tool kit path):
- Property~C/C++~Regular~additional include directory~D:\NVIDIA\CUDA\CUDAToolkit\include
- Property~Linker~Regular~additional library directory~D:\NVIDIA\CUDA\CUDAToolkit\lib\x64
- Property~Linker~Import~additional dependencies~OpenCL.lib

#include <iostream>
#include <fstream>
#include <sstream>
#include <CL/cl.h>


	const int ARRAY_SIZE = 1000;

	//1. Select the OpenCL platform and create a context
	cl_context CreateContext()
	{
		cl_int errNum;
		cl_uint numPlatforms;
		cl_platform_id firstPlatformId;
		cl_context context = NULL;

		//Select the first of the available platforms
		errNum = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms);
		if (errNum != CL_SUCCESS || numPlatforms <= 0)
		{
			std::cerr << "Failed to find any OpenCL platforms." << std::endl;
			return NULL;
		}

		//Create an OpenCL context
		cl_context_properties contextProperties[] =
		{
			CL_CONTEXT_PLATFORM,
			(cl_context_properties)firstPlatformId,
			0
		};
		context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU,
			NULL, NULL, &errNum);

		return context;
	}


	//2. Create a device and create a command queue
	cl_command_queue CreateCommandQueue(cl_context context, cl_device_id *device)
	{
		cl_int errNum;
		cl_device_id *devices;
		cl_command_queue commandQueue = NULL;
		size_t deviceBufferSize = -1;

		// Get device buffer size
		errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceBufferSize);

		if (deviceBufferSize <= 0)
		{
			std::cerr << "No devices available.";
			return NULL;
		}

		// Allocate buffer space for devices
		devices = new cl_device_id[deviceBufferSize / sizeof(cl_device_id)];
		errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceBufferSize, devices, NULL);

		//Pick the first of the available devices
		commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);

		*device = devices[0];
		delete[] devices;
		return commandQueue;
	}


	// 3.Create and build program
	cl_program CreateProgram(cl_context context, cl_device_id device, const char* fileName)
	{
		cl_int errNum;
		cl_program program;

		std::ifstream kernelFile(fileName, std::ios::in);
		if (!kernelFile.is_open())
		{
			std::cerr << "Failed to open file for reading: " << fileName << std::endl;
			return NULL;
		}

		std::ostringstream oss;
		oss << kernelFile.rdbuf();

		std::string srcStdStr = oss.str();
		const char *srcStr = srcStdStr.c_str();
		program = clCreateProgramWithSource(context, 1,
			(const char**)&srcStr,
			NULL, NULL);

		errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

		return program;
	}

	//Create and build objects
	bool CreateMemObjects(cl_context context, cl_mem memObjects[3],
		float *a, float *b)
	{
		memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
			sizeof(float) * ARRAY_SIZE, a, NULL);
		memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
			sizeof(float) * ARRAY_SIZE, b, NULL);
		memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE,
			sizeof(float) * ARRAY_SIZE, NULL, NULL);
		return true;
	}


	// Release OpenCL Resources
	void Cleanup(cl_context context, cl_command_queue commandQueue,
		cl_program program, cl_kernel kernel, cl_mem memObjects[3])
	{
		for (int i = 0; i < 3; i++)
		{
			if (memObjects[i] != 0)
				clReleaseMemObject(memObjects[i]);
		}
		if (commandQueue != 0)
			clReleaseCommandQueue(commandQueue);

		if (kernel != 0)
			clReleaseKernel(kernel);

		if (program != 0)
			clReleaseProgram(program);

		if (context != 0)
			clReleaseContext(context);
	}

	int main(int argc, char** argv)
	{
		cl_context context = 0;
		cl_command_queue commandQueue = 0;
		cl_program program = 0;
		cl_device_id device = 0;
		cl_kernel kernel = 0;
		cl_mem memObjects[3] = { 0, 0, 0 };
		cl_int errNum;

		// 1.Select the OpenCL platform and create a context
		context = CreateContext();

		// 2. Create a device and create a command queue
		commandQueue = CreateCommandQueue(context, &device);

		//Create and build program objects
		program = CreateProgram(context, device, "Add.cl");

		// 4.Create OpenCL kernel and allocate memory space
		kernel = clCreateKernel(program, "add_kernel", NULL);

		//Create data to process
		float result[ARRAY_SIZE];
		float a[ARRAY_SIZE];
		float b[ARRAY_SIZE];
		for (int i = 0; i < ARRAY_SIZE; i++)
		{
			a[i] = (float)i;
			b[i] = (float)(ARRAY_SIZE - i);
		}

		//Create a memory object
		if (!CreateMemObjects(context, memObjects, a, b))
		{
			Cleanup(context, commandQueue, program, kernel, memObjects);
			return 1;
		}

		// 5.Set kernel data and execute kernel
		errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &memObjects[0]);
		errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &memObjects[1]);
		errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &memObjects[2]);

		size_t globalWorkSize[1] = { ARRAY_SIZE };
		size_t localWorkSize[1] = { 1 };

		errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,
			globalWorkSize, localWorkSize,
			0, NULL, NULL);

		//6.Read the execution result and release OpenCL resources
		errNum = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE,
			0, ARRAY_SIZE * sizeof(float), result,
			0, NULL, NULL);

		for (int i = 0; i < ARRAY_SIZE; i++)
		{
			std::cout << result[i] << " ";
		}
		std::cout << std::endl;
		std::cout << "Executed program succesfully." << std::endl;
		getchar();
		Cleanup(context, commandQueue, program, kernel, memObjects);


    return 0;
}

The kernel file is Add.cl is

__kernel void add_kernel(__global const float *a, __global const float *b, __global float *result)
{
	int gid = get_global_id(0);
	result[gid] = a[gid] + b[gid];
}

The all code files can be found in my Github

1 PPL

1.1 The parallel_for Algorithm

1.2 The parallel_for_each Algorithm

1.3 The parallel_invoke Algorithm

1.4 The parallel_transform Algorithm

1.5 The parallel_reduce Algorithms