Taking computer science courses are already paying off in my career. Nothing too significant (yet) but I am witnessing small wins.
For example, this past summer I suffered through HPCA (high performance computing architecture), a course historically only offered in the longer semesters (e.g. fall, spring). In the course, I learned a lot of theory: barrier synchronization, processing pipelines, instruction scheduling, multi level caches, branch predicators, cache coherence protocols, and much more (many of which I have already purged from my memory). However, since most of that knowledge was primarily theory, very little of it I’ve been able to directly apply to my day to day job.
Until a couple days ago.
While at work, my co-worker had pointed out that the unlikely C function calls were sprinkled throughout our code base, used for branch prediction. And thanks to theory, I was able to describe to another co worker how the compiler (with its understanding of the underlying CPU architecture) will rearrange the assembly (or machine code) in order to avoid a branch prediction miss: if the wrong instruction is fetched, the processing pipeline must flush all the previous instructions queued in the pipeline, temporarily halting the pipeline, which ultimately reduces the throughput.
And this is just one example of how taking my masters had paid off. There have been a number of other situations (like understanding shared memory and interprocess communication) that I’ve been able to apply theory from academia to practice at work.
Thanks to the program, I’m feeling much more competent and confident working as a software developer at Amazon Web Services.
As a software programmer, I always had a vague understanding of how an operating system fetches data from memory. At an abstract level, I understood that a processor requests data from the memory controller, sending a message (with the memory address encoded) across the bus.
But I learned recently that in addition to the system’s physical memory—the same memory I used to squeeze into the motherboard when installing RAM—there are multiple hardware caches, completely abstracted from the programmer.
These caches sit between the processor and main memory, called: L1 cache and L2 cache and L3 cache. Each cache differs: in cost, in size, and in distance from the CPU. The lower the digit, the higher cost, the smaller in size, and the closer it sits to CPU. For example, if we compare L1 and L2 cache, L1 costs more, holds less data, and sits closer to the processor.
When the processor wants to retrieve data from memory, it sends a request first lands in L1 cache’s world. If L1 has that memory page cached, it immediately sends that data back to the processor, preventing the request from unnecessarily flowing towards the memory controller. This pattern of checking local cache and forwarding requests repeats until the request eventually reaches the memory controller, where data is actually stored.
The further we allow the CPU’s request to travel down the bus, we penalize the CPU, forcing it to wait, like a car at a stop sign, for longer cycles. For example, the CPU waits 4 cycles for L1 cache, 12 cycles for L2, 36 cycles for L3, and—wait for it—62 cycles when accessing main memory. Therefore, we strive to design systems that cache as much data as possible and as close to the CPU, increasing overall system performance.
We break down a cache into the following components:
As you can see from the image above, we organize our cache sets (S), lines (L), and blocks (B). One block of data represents 8 bits (1 byte) and every block of data is represented by a physical memory address. For example, the memory address 0x0000 may store 010101010 and 0x0001 may store 01110111 another. We group these blocks together into a line, which store sequential blocks. A line may store two or four or eight or sixteen bytes—it all depends on how we design the system. Finally, each line belongs to a set, a bucket that stores one or more lines. Like the number of bytes a line stores, a set can store one or two or three or forty—again, it all depends on our design.
Together, the total number of sets, number of lines, and number of bytes determine the cache’s size, calculated with the following formula: cache size = S x E x B.
In the next post, I’ll cover how a cache processes a memory address, determining whether it retrieves memory from cache or forwards the request to the next cache (or memory controller).
My plan on completing Data Structures and Algorithms in Java by December 31st, 2016, requires changing my reading strategy. Reading technical books cover to cover, like a novel, is rewarding but inefficient and impractical. How about a different approach ?
I’m applying active reading principles from How to read a book and A mind for numbers: active reading by questioning author’s intention, indentifying questions author is attempting to answer.
I skimmed the 15 chapters (although I had already read the chapters on maps) and below are my notes.
Before I skimmed each chapter, I paused and flipped to the page that maps each chapter/section to the ACM. I bookedmarked this page because it helps frame the entire study session — constant questioning of what, and why, I am reading the material.
Introduction to java and basic programming principles. Conditionals, loops — basic stuff.
Object oriented design. Introduction to design patterns such as the adapter pattern. The three criteria for good software: robustness (handling unexpected inputs), adaptiblity (adapting to changes and requirements), reusability (reusing code somewhere else).
Fundamental data structures including arrays and linked lists — singlely, doubley and circular linked lists. Covers the advantage of implementing linked lists over arrays, and vice versa. Compares shallow and deep copies: shallow method copies pointers but deep instantiates another copy of the reference.
Establishing nomenclature for comparing performacne: big-o. The question of what and why big-o notation, and its history. The three levels of performance: best, worst, and avergae. The 7 mathematical functions, described in ascending order, from best to worst:
Recursion. Impleneting a recursive search algorithm: binary search. When is recursion a bad idea? What a tail recursion is and how to eliminate it.
Additional data structures. Introducing additional abstract data types (ADT): stacks and queues. Two primary methods of implementing a stack: array and linked list. Leveraging modulus technique to maintain a circular queue. Dequeue, pronounced as “deck”, implements properties of both stack and queue: add to head or tail, and remove from head or tail.
High level concept of object oriented programming. Instead of relying on the client to initially size the array, a more client-friendly solution is offered: dynamic arrays. The iterator interface is introduced, as well as positional lists.
Generic trees and ordered trees – including binary and binary search trees, and techniques to traverse: preorder, postorder. Peek at breathe depth first — I thought this concept was reserved for graphs. The chapter ends use cases and trees can model and solve real problems.
Additional ADT: priority queue, which can be implementing with a positional list or a sorted list. The positional list offers O(1) to insert, but is O(N) for retrieving the minimum. The sorted list can, instead, offer O(1) for minimum but O(N) for inserting. Is there a data structure that offers the best of both worlds, something in between? Yes. The Heap. The heap offers log(N) for both insert and retrieving the minimum value. Additional features, such as removing an entry (not minimum) from a priority queue, require adapting the ADT.
The map ADT is introduced. A primitive data structure to implement a map, via an array, is introduced, by followed the hashtable. The chapter ends with sorted maps, skip lists and sets.
Binary search trees, a special type of tree. To combat the worst case scenario (inserting sorted items produces, effectively, a linked list) another ADT is introduced: AVL trees. Two other balancing trees are introduced: splay and (2,4) trees. Chapter ends with Red-Black trees.
Up until Chapter 12, only two elementary sorting algorithms have been introduced: insertion sort and select sort. Divide and conquer, an algorithm design pattern, is introduced with two concrete implementations: merge and quick sort — both with a runtime complexity of O(nlogn). Radix and bucket is an improvement in runtime complexity — O(N). The concept of stable sorting algorithms are introduced, and we face another type of issue: selecting the correct items.
Pretty neat topic on tackling “string matching.” Discusses a primitive method for string matching — I think we can I guess how this brute force method works — followed by advanced techniques: The Boyer-Moore algorithm and Knute-Morris-Prath Algorithms. Covers compression algorithms, including the huffman code. Covers dynamic programming, a method for matching a sequence of strings but reduces the runtime complexity, from exponential, by implementing “a few loops”.
Introduces another ADT – graphs and respective terms: edges, vertices, directed, undirected. Presents two different strategies for keeping track of neighbors: adjaceny matrix and adjancey list. Discusses (2) methods for searching for nodes: depth search first or breathe search first. Covers several methods for constructing a minimum spanning tree. Like OSPF, introduces Dijastreka’s algorithms for determining shortest paths.
Dives into characteristics of JVM: garbage collection. Introduces the PC — not personal computer — to keep track of function/procedure calls. Compares the types of memory and the hiearchy in terms of performance. How data is stored on external disks, such as B-Tree.
 – How to read a book – http://pne.people.si.umich.edu/PDF/howtoread.pdf and a similar article: https://www.farnamstreetblog.com/how-to-read-a-book/
 – ACM – https://www.acm.org/education/CS2013-final-report.pdf