🧪 Introduction to Parallel Computing Lab

Duration: ~1 hour
Environment: Linux (Ubuntu / WSL) and Windows

🎯 Learning Goals

By the end of this lab, students will be able to:

Understand that concurrent threads do not execute in a fixed order.
Identify computational hotspots using profiling tools.
Distinguish between data parallelism and task parallelism.
Partition work into independent tasks and parallelize them.

🧩 Part 0 — Hello Parallel World (Nondeterminism)

🧠 Concept

When two threads run concurrently, the order of execution is not deterministic — it depends on the operating system scheduler.

🧰 C Example — `hello_threads.c`

#include <stdio.h>
#include <pthread.h>

void* print_hello(void* arg) {
    printf("Hello ");
    return NULL;
}

void* print_world(void* arg) {
    printf("World!\n");
    return NULL;
}

int main() {
    pthread_t t1, t2;

    pthread_create(&t1, NULL, print_hello, NULL);
    pthread_create(&t2, NULL, print_world, NULL);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    return 0;
}

Build & Run:

gcc -pthread -o hello_threads hello_threads.c
./hello_threads

Example Outputs:

Hello World!

World!
Hello

⚠️ The order is unpredictable — threads run concurrently!

🐍 Python Example — `hello_threads.py`

import threading
import time, random

def say_hello():
    time.sleep(random.random() * 0.01)
    print("Hello ", end="")

def say_world():
    time.sleep(random.random() * 0.01)
    print("World!")

t1 = threading.Thread(target=say_hello)
t2 = threading.Thread(target=say_world)

t1.start()
t2.start()
t1.join()
t2.join()

Run:

python hello_threads.py

Example Outputs:

Hello World!

World!
Hello

✅ This demonstrates nondeterminism — thread execution order isn’t guaranteed.

🧭 Discovering your Machine -Installing and Using `lstopo` (hwloc) -

lstopo is part of the hwloc (Hardware Locality) toolset, used to visualize and inspect hardware topology (CPUs, cores, caches, NUMA nodes, etc.).

🐧 Linux Installation

1. Install via package manager

On Ubuntu / Debian:

sudo apt update
sudo apt install hwloc

2. Verify installation

lstopo

You should see a textual representation of your system’s topology.

3. (Optional) Generate graphical output

You can export a topology image:

lstopo topology.png

Windows Installation

1. Download the prebuilt ZIP

Go to the official hwloc page: 👉 https://www.open-mpi.org/software/hwloc/v2.12/

Choose the build matching your system:

hwloc-win64-build-2.12.2.zip → for 64-bit Windows
hwloc-win32-build-2.12.2.zip → for 32-bit Windows

2. Extract the ZIP

Right-click the ZIP file → “Extract All…”
Choose a destination folder, e.g. C:\hwloc-2.12.2-win64\
Inside it, find the bin\ folder containing lstopo.exe.

3. (Optional) Add to PATH

To run lstopo from any terminal:

Open System Properties → Advanced → Environment Variables
Edit the Path variable and add:
```
C:\hwloc-2.12.2-win64\bin
```
Reopen your Command Prompt or PowerShell.

4. Run lstopo

Open Command Prompt and run:

lstopo

or to generate a PNG file:

lstopo topology.png

✅ Summary

Platform	How to Install	Command
Linux (Ubuntu/Debian)	`sudo apt install hwloc`	`lstopo`
Windows	Download ZIP from open-mpi.org and extract	`lstopo.exe`

🧩 Part 1 — Finding Hotspots (Profiling)

🧠 Concept

A hotspot is the section of your code where most of the execution time is spent. Profiling helps identify these bottlenecks so you can target parallelization effectively.

The following sequential program calculates the square roots of integers from 1 to 10,000,000. To identify the hotspot—the parts of the code that dominate runtime—we’ll run Python’s profiler and examine its report of time spent per function.

🐍 Python Example — `serial_computation.py`

import time

def slow_function(n):
    total = 0
    for i in range(n):
        total += i ** 0.5  # intentionally slow,(** is power to) 
    return total

def main():
    start = time.time()
    result = slow_function(10_000_000)
    end = time.time()
    print("Result:", result)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Run:

python serial_computation.py

Profile with cProfile:

python -m cProfile -s tottime serial_computation.py

Output:

    Result: 21081849486.439312
    Time: 1.751760482788086 seconds

ncalls	tottime	percall	cumtime	percall	filename:lineno(function)
1	1.732	1.732	1.732	1.732	hotspot.py:3(slow_function)
2	0.000	0.000	0.000	0.000	{built-in method builtins.print}
1	0.000	0.000	0.000	0.000	{method ‘disable’ of ‘_lsprof.Profiler’ objects}
1	0.000	0.000	1.732	1.732	hotspot.py:9(main)
1	0.000	0.000	1.732	1.732	{built-in method builtins.exec}
2	0.000	0.000	0.000	0.000	{built-in method time.time}
1	0.000	0.000	1.732	1.732	hotspot.py:1(<module>)

For more information about the Python profiler, see the official documentation.

🧩 Part 2 — Data Parallelism with Processes(Python)

🧠 Concept

Split the dataset into chunks and perform the same task on each chunk in parallel.

🧰 Code — `parallel_computation.py`

import time
from multiprocessing import Pool

def slow_function(start, end):
    total = 0
    for i in range(start, end):
        total += i ** 0.5
    return total

def main():
    n = 10_000_000
    num_workers = 4
    chunk = n // num_workers # (integer division, rounded down))
    #Calculate the ranges for each worker
    ranges = [(i * chunk, (i + 1) * chunk) for i in range(num_workers)]

    start = time.time()
    with Pool(num_workers) as p: # Create 4 Processes not threads.
        #returns a list of returns a partial sums
        results = p.starmap(slow_function, ranges) 
    total = sum(results)
    end = time.time()

    print("Result:", total)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Run:

python parallel_computation.py

Output:

    Result: 21081849486.443127
    Time: 0.6766617298126221 seconds

🧠 Demonstrating the Global Interpreter Lock (GIL) in Python

Let’s use the same example that we ran with multiprocessing, but this time using threads instead of processes.

🧩 Code Example — Using Threads

import time
from threading import Thread

def slow_function(start, end):
    total = 0
    for i in range(start, end):
        total += i ** 0.5
    return total

#Thread Function
def worker(start, end, results, idx):
    results[idx] = slow_function(start, end)

def main():
    n = 10_000_000
    num_threads = 4
    chunk = n // num_threads
    #Spliting the Job to equal chunks 1 for each thread.
    ranges = [(i * chunk, (i + 1) * chunk) for i in range(num_threads)]

    results = [0] * num_threads
    threads = []

    start = time.time()
    for i in range(num_threads):
        t = Thread(target=worker, args=(ranges[i][0], ranges[i][1], results, i))
        threads.append(t)
        t.start()

    #After Creating threads Main always has
    #to wait for the threads to finish. Otherwise the process
    #will terminate
    for t in threads:
        t.join()

    total = sum(results)
    end = time.time()

    print("Result:", total)
    print("Time:", end - start, "seconds")

if __name__ == "__main__":
    main()

Output

    Result: 21081849486.443127
    Time: 1.7334659099578857 seconds

⚙️ What’s Happening

Even though we start 4 threads, the program won’t run faster than the single-threaded version. That’s because of Python’s Global Interpreter Lock (GIL).

The GIL is a mutex that protects access to Python objects, ensuring that only one thread executes Python bytecode at a time. This makes memory management simpler and prevents race conditions in CPython, but it also means:

Multiple Python threads cannot run CPU-bound code truly in parallel on multiple cores.

🧮 Comparison

Version	Parallel type	True multi-core?	Expected performance
`multiprocessing.Pool`	Separate processes	✅ Yes	~4× faster (on 4 cores)
`threading.Thread`	Shared-memory threads	❌ No (blocked by GIL)	~1× (same as single thread)

💡 When Threads Do Help

Threads in Python are still useful for I/O-bound tasks — like waiting for:

Network responses
Disk reads/writes
User input

In those cases, threads can make progress while others are waiting, because the GIL is temporarily released during I/O operations.

🧩 Part 3 — Task Parallelism (C)

🧠 Concept

Run different tasks at the same time using multiple threads. The C program below creates two threads, each performing a different task. Thread 1 computes the sum of square roots for the integers from 1 to 10 million, while Thread 2 computes the sum of squares for the same range

🧰 Code — `task_parallelism.c`

#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <stdint.h>

const int64_t n = 10000000;

void* compute_square_roots(void* arg) {
    double total = 0;
    for (int64_t i = 0; i < n; i++)
        total += sqrt((double)i);
    return NULL;
}

void* compute_powers(void* arg) {
    double total = 0;
    for (int64_t i = 0; i < n; i++)
        total += pow((double)i, 2);
    return NULL;
}

int main() {
    
    pthread_t t1, t2;
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    pthread_create(&t1, NULL, compute_square_roots, NULL);  // Thread 1
    pthread_create(&t2, NULL, compute_powers, NULL);        // Thread 2
    pthread_join(t1, NULL);                                 // Waiting for thread 1 to finish
    pthread_join(t2, NULL);                                 // Waiting for thread 1 to finish
    clock_gettime(CLOCK_MONOTONIC, &end);

    double elapsed = (end.tv_sec - start.tv_sec)
                   + (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("Time: %f seconds\n", elapsed);
    return 0;
}

Build & Run:

# -lm Link the scientific math library, -o2 Use O2 Optimizations
gcc -O2 -pthread -lm -o task_parallelism task_parallelism.c
./task_parallelism

Discussion:

This is Task Parallelism — different computations run concurrently.
Each thread executes its own independent task.
Tasks must be independent — no shared state without synchronization.

🧩 Summary Discussion

Concept	Example	Key Takeaway
Nondeterminism	Hello World threads	Thread execution order is unpredictable
Hotspot Detection	`slow_function()`	Profiling shows where time is spent
Data Parallelism	Splitting array chunks	Same task on different data
Task Parallelism	Two C threads with different work	Different tasks run simultaneously
Performance Bottleneck	Synchronization & serial code	Limits scalability (Amdahl’s Law)

🧰 Tools Used

Tool	Purpose	Command Example
`gcc -pthread`	Compile C pthread programs	`gcc -pthread hello_threads.c`
`python -m cProfile`	Profile Python code	`python -m cProfile script.py`
`perf` / `gprof` (Linux)	Optional C profiling	`gprof ./task_parallelism`

🧠 Key Takeaways

Parallelism ≠ determinism — output order can vary.
Profiling is the first step toward optimization.
Data parallelism and task parallelism are two key strategies.
Always ensure independent tasks to avoid data races.

Parallel Computing

🧪 Introduction to Parallel Computing Lab

🎯 Learning Goals

🧩 Part 0 — Hello Parallel World (Nondeterminism)

🧠 Concept

🧰 C Example — hello_threads.c

🐍 Python Example — hello_threads.py

🧭 Discovering your Machine -Installing and Using lstopo (hwloc) -

🐧 Linux Installation

1. Install via package manager

2. Verify installation

3. (Optional) Generate graphical output

Windows Installation

1. Download the prebuilt ZIP

2. Extract the ZIP

3. (Optional) Add to PATH

4. Run lstopo

✅ Summary

🧩 Part 1 — Finding Hotspots (Profiling)

🧠 Concept

🐍 Python Example — serial_computation.py

🧩 Part 2 — Data Parallelism with Processes(Python)

🧠 Concept

🧰 Code — parallel_computation.py

🧠 Demonstrating the Global Interpreter Lock (GIL) in Python

🧩 Code Example — Using Threads

⚙️ What’s Happening

🧮 Comparison

💡 When Threads Do Help

🧩 Part 3 — Task Parallelism (C)

🧠 Concept

🧰 Code — task_parallelism.c

🧩 Summary Discussion

🧰 Tools Used

🧠 Key Takeaways

🧰 C Example — `hello_threads.c`

🐍 Python Example — `hello_threads.py`

🧭 Discovering your Machine -Installing and Using `lstopo` (hwloc) -

🐍 Python Example — `serial_computation.py`

🧰 Code — `parallel_computation.py`

🧰 Code — `task_parallelism.c`