Parallel Computing

🧪 Introduction to Parallel Computing Lab

Duration: ~1 hour
Environment: Linux (Ubuntu / WSL) and Windows


🎯 Learning Goals

By the end of this lab, students will be able to:

  • Understand that concurrent threads do not execute in a fixed order.
  • Identify computational hotspots using profiling tools.
  • Distinguish between data parallelism and task parallelism.
  • Partition work into independent tasks and parallelize them.

🧩 Part 0 — Hello Parallel World (Nondeterminism)

🧠 Concept

When two threads run concurrently, the order of execution is not deterministic — it depends on the operating system scheduler.


🧰 C Example — hello_threads.c

#include <stdio.h>
#include <pthread.h>

void* print_hello(void* arg) {
    printf("Hello ");
    return NULL;
}

void* print_world(void* arg) {
    printf("World!\n");
    return NULL;
}

int main() {
    pthread_t t1, t2;

    pthread_create(&t1, NULL, print_hello, NULL);
    pthread_create(&t2, NULL, print_world, NULL);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    return 0;
}

Build & Run:

gcc -pthread -o hello_threads hello_threads.c
./hello_threads

Example Outputs:

Hello World!

or

World!
Hello 

⚠️ The order is unpredictable — threads run concurrently!


🐍 Python Example — hello_threads.py

import threading
import time, random

def say_hello():
    time.sleep(random.random() * 0.01)
    print("Hello ", end="")

def say_world():
    time.sleep(random.random() * 0.01)
    print("World!")

t1 = threading.Thread(target=say_hello)
t2 = threading.Thread(target=say_world)

t1.start()
t2.start()
t1.join()
t2.join()

Run:

python hello_threads.py

Example Outputs:

Hello World!

or

World!
Hello 

✅ This demonstrates nondeterminism — thread execution order isn’t guaranteed.


🧭 Discovering your Machine -Installing and Using lstopo (hwloc) -

lstopo is part of the hwloc (Hardware Locality) toolset, used to visualize and inspect hardware topology (CPUs, cores, caches, NUMA nodes, etc.).


🐧 Linux Installation

1. Install via package manager

On Ubuntu / Debian:

sudo apt update
sudo apt install hwloc

2. Verify installation

lstopo

You should see a textual representation of your system’s topology.


3. (Optional) Generate graphical output

You can export a topology image:

lstopo topology.png

Windows Installation

1. Download the prebuilt ZIP

Go to the official hwloc page: 👉 https://www.open-mpi.org/software/hwloc/v2.12/

Choose the build matching your system:

  • hwloc-win64-build-2.12.2.zip → for 64-bit Windows
  • hwloc-win32-build-2.12.2.zip → for 32-bit Windows

2. Extract the ZIP

  1. Right-click the ZIP file → “Extract All…”
  2. Choose a destination folder, e.g. C:\hwloc-2.12.2-win64\
  3. Inside it, find the bin\ folder containing lstopo.exe.

3. (Optional) Add to PATH

To run lstopo from any terminal:

  1. Open System Properties → Advanced → Environment Variables

  2. Edit the Path variable and add:

    C:\hwloc-2.12.2-win64\bin
    
  3. Reopen your Command Prompt or PowerShell.


4. Run lstopo

Open Command Prompt and run:

lstopo

or to generate a PNG file:

lstopo topology.png

✅ Summary

Platform How to Install Command
Linux (Ubuntu/Debian) sudo apt install hwloc lstopo
Windows Download ZIP from open-mpi.org and extract lstopo.exe

🧩 Part 1 — Finding Hotspots (Profiling)

🧠 Concept

A hotspot is the section of your code where most of the execution time is spent. Profiling helps identify these bottlenecks so you can target parallelization effectively.

The following sequential program calculates the square roots of integers from 1 to 10,000,000. To identify the hotspot—the parts of the code that dominate runtime—we’ll run Python’s profiler and examine its report of time spent per function.


🐍 Python Example — serial_computation.py

import time

def slow_function(n):
    total = 0
    for i in range(n):
        total += i ** 0.5  # intentionally slow,(** is power to) 
    return total

def main():
    start = time.time()
    result = slow_function(10_000_000)
    end = time.time()
    print("Result:", result)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Run:

python serial_computation.py

Profile with cProfile:

python -m cProfile -s tottime serial_computation.py

Output:

    Result: 21081849486.439312
    Time: 1.751760482788086 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.732 1.732 1.732 1.732 hotspot.py:3(slow_function)
2 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
1 0.000 0.000 1.732 1.732 hotspot.py:9(main)
1 0.000 0.000 1.732 1.732 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 1.732 1.732 hotspot.py:1(<module>)

For more information about the Python profiler, see the official documentation.


🧩 Part 2 — Data Parallelism with Processes(Python)

🧠 Concept

Split the dataset into chunks and perform the same task on each chunk in parallel.


🧰 Code — parallel_computation.py

import time
from multiprocessing import Pool

def slow_function(start, end):
    total = 0
    for i in range(start, end):
        total += i ** 0.5
    return total

def main():
    n = 10_000_000
    num_workers = 4
    chunk = n // num_workers # (integer division, rounded down))
    #Calculate the ranges for each worker
    ranges = [(i * chunk, (i + 1) * chunk) for i in range(num_workers)]

    start = time.time()
    with Pool(num_workers) as p: # Create 4 Processes not threads.
        #returns a list of returns a partial sums
        results = p.starmap(slow_function, ranges) 
    total = sum(results)
    end = time.time()

    print("Result:", total)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Run:

python parallel_computation.py

Output:

    Result: 21081849486.443127
    Time: 0.6766617298126221 seconds

🧠 Demonstrating the Global Interpreter Lock (GIL) in Python

Let’s use the same example that we ran with multiprocessing, but this time using threads instead of processes.

🧩 Code Example — Using Threads

import time
from threading import Thread

def slow_function(start, end):
    total = 0
    for i in range(start, end):
        total += i ** 0.5
    return total

#Thread Function
def worker(start, end, results, idx):
    results[idx] = slow_function(start, end)

def main():
    n = 10_000_000
    num_threads = 4
    chunk = n // num_threads
    #Spliting the Job to equal chunks 1 for each thread.
    ranges = [(i * chunk, (i + 1) * chunk) for i in range(num_threads)]

    results = [0] * num_threads
    threads = []

    start = time.time()
    for i in range(num_threads):
        t = Thread(target=worker, args=(ranges[i][0], ranges[i][1], results, i))
        threads.append(t)
        t.start()

    #After Creating threads Main always has
    #to wait for the threads to finish. Otherwise the process
    #will terminate
    for t in threads:
        t.join()

    total = sum(results)
    end = time.time()

    print("Result:", total)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Output

    Result: 21081849486.443127
    Time: 1.7334659099578857 seconds

⚙️ What’s Happening

Even though we start 4 threads, the program won’t run faster than the single-threaded version. That’s because of Python’s Global Interpreter Lock (GIL).

The GIL is a mutex that protects access to Python objects, ensuring that only one thread executes Python bytecode at a time. This makes memory management simpler and prevents race conditions in CPython, but it also means:

Multiple Python threads cannot run CPU-bound code truly in parallel on multiple cores.


🧮 Comparison

Version Parallel type True multi-core? Expected performance
multiprocessing.Pool Separate processes ✅ Yes ~4× faster (on 4 cores)
threading.Thread Shared-memory threads ❌ No (blocked by GIL) ~1× (same as single thread)

💡 When Threads Do Help

Threads in Python are still useful for I/O-bound tasks — like waiting for:

  • Network responses
  • Disk reads/writes
  • User input

In those cases, threads can make progress while others are waiting, because the GIL is temporarily released during I/O operations.


🧩 Part 3 — Task Parallelism (C)

🧠 Concept

Run different tasks at the same time using multiple threads. The C program below creates two threads, each performing a different task. Thread 1 computes the sum of square roots for the integers from 1 to 10 million, while Thread 2 computes the sum of squares for the same range


🧰 Code — task_parallelism.c

#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <stdint.h>

const int64_t n = 10000000;

void* compute_square_roots(void* arg) {
    double total = 0;
    for (int64_t i = 0; i < n; i++)
        total += sqrt((double)i);
    return NULL;
}

void* compute_powers(void* arg) {
    double total = 0;
    for (int64_t i = 0; i < n; i++)
        total += pow((double)i, 2);
    return NULL;
}

int main() {
    
    pthread_t t1, t2;
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    pthread_create(&t1, NULL, compute_square_roots, NULL);  // Thread 1
    pthread_create(&t2, NULL, compute_powers, NULL);        // Thread 2
    pthread_join(t1, NULL);                                 // Waiting for thread 1 to finish
    pthread_join(t2, NULL);                                 // Waiting for thread 1 to finish
    clock_gettime(CLOCK_MONOTONIC, &end);

    double elapsed = (end.tv_sec - start.tv_sec)
                   + (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("Time: %f seconds\n", elapsed);
    return 0;
}

Build & Run:

# -lm Link the scientific math library, -o2 Use O2 Optimizations
gcc -O2 -pthread -lm -o task_parallelism task_parallelism.c
./task_parallelism

Discussion:

  • This is Task Parallelism — different computations run concurrently.
  • Each thread executes its own independent task.
  • Tasks must be independent — no shared state without synchronization.

🧩 Summary Discussion

Concept Example Key Takeaway
Nondeterminism Hello World threads Thread execution order is unpredictable
Hotspot Detection slow_function() Profiling shows where time is spent
Data Parallelism Splitting array chunks Same task on different data
Task Parallelism Two C threads with different work Different tasks run simultaneously
Performance Bottleneck Synchronization & serial code Limits scalability (Amdahl’s Law)

🧰 Tools Used

Tool Purpose Command Example
gcc -pthread Compile C pthread programs gcc -pthread hello_threads.c
python -m cProfile Profile Python code python -m cProfile script.py
perf / gprof (Linux) Optional C profiling gprof ./task_parallelism

🧠 Key Takeaways

  • Parallelism ≠ determinism — output order can vary.
  • Profiling is the first step toward optimization.
  • Data parallelism and task parallelism are two key strategies.
  • Always ensure independent tasks to avoid data races.
Next