Recommended Book Andrew S. Tanenbaum and Maarten Van Steen. Pearson International Edition Chapters 6, 7, 8, 11 and There is no fixed textbook or fixed set of literature that cover all the materials of the course.
Distributed systems for fun and profit
The course materials will be designed from a set of selected literature, especially during the later half of the course. Prerequisite Apply key ideas to maintain the correctness in distributed systems. Learning techniques to design and implement a distributed file system.
Learning and applying techniques to recover from faults in distributed systems. Measurable Outcomes Build models of distributed systems [LO 1]. Prototype distributed software systems [LO 1,2,3,4,5]. An early innovator in this space was Google, which by necessity of their large amounts of data had to invent a new paradigm for distributed computation — MapReduce. They published a paper on it in and the open source community later created Apache Hadoop based on it.
MapReduce can be simply defined as two steps — mapping the data and reducing it to something meaningful. Say we are Medium and we stored our enormous information in a secondary distributed database for warehousing purposes.
We want to fetch data representing the number of claps issued each day throughout April a year ago. This example is kept as short, clear and simple as possible, but imagine we are working with loads of data e. Each Map job is a separate node transforming as much data as it can. Each job traverses all of the data in the given storage node and maps it to a simple tuple of the date and the number one. Then, three intermediary steps which nobody talks about are done — Shuffle, Sort and Partition. They basically further arrange the data and delete it to the appropriate reduce job.
This is a good paradigm and surprisingly enables you to do a lot with it — you can chain multiple MapReduce jobs for example. MapReduce is somewhat legacy nowadays and brings some problems with it. Because it works in batches jobs a problem arises where if your job fails — you need to restart the whole thing. A 2-hour job failing can really slow down your whole data processing pipeline and you do not want that in the very least, especially in peak hours. Another issue is the time you wait until you receive results. In real-time analytic systems which all have big data and thus use distributed computing it is important to have your latest crunched data be as fresh as possible and certainly not from a few hours ago.
As such, other architectures have emerged that address these issues.
Namely Lambda Architecture mix of batch processing and stream processing and Kappa Architecture only stream processing. Distributed file systems can be thought of as distributed data stores. They typically go hand in hand with Distributed Computing. Wikipedia defines the difference being that distributed file systems allow files to be accessed using the same interfaces and semantics as local files, not through a custom API like the Cassandra Query Language CQL.
Google Code University
Boasting widespread adoption, it is used to store and replicate large files GB or TB in size across many machines. Its architecture consists mainly of NameNodes and DataNodes.
NameNodes are responsible for keeping metadata about the cluster, like which node contains which file blocks. DataNodes simply store files and execute commands like replicating a file, writing a new one and others. Unsurprisingly, HDFS is best used with Hadoop for computation as it provides data awareness to the computation jobs.
Said jobs then get ran on the nodes storing the data. This leverages data locality — optimizes computations and reduces the amount of traffic over the network.
go site Leveraging Blockchain technology, it boasts a completely decentralized architecture with no single owner nor point of failure. It stores file via historic versioning, similar to how Git does. It is still undergoing heavy development v0. They allow you to decouple your application logic from directly talking with your other systems. A message is broadcast from the application which potentially create it called a producer , goes into the platform and is read by potentially multiple applications which are interested in it called consumer s.
If you need to save a certain event to a few places e. Consumers can either pull information out of the brokers pull model or have the brokers push information directly into the consumers push model. RabbitMQ — Message broker which allows you finer-grained control of message trajectories via routing rules and other easily configurable settings.
Can be called a smart broker, as it has a lot of logic in it and tightly keeps track of messages that pass through it. Uses a push model for notifying the consumers. Kafka — Message broker and all out platform which is a bit lower level, as in it does not keep track of which messages have been read and does not allow for complex routing logic. This helps it achieve amazing performance. In my opinion, this is the biggest prospect in this space with active development from the open-source community and support from the Confluent team.
Kafka arguably has the most widespread use from top tech companies. I wrote a thorough introduction to this, where I go into detail about all of its goodness. Apache ActiveMQ — The oldest of the bunch, dating from Lets you quickly integrate it with existing applications and eliminates the need to handle your own infrastructure, which might be a big benefit, as systems like Kafka are notoriously tricky to set up.
If you roll up 5 Rails servers behind a single load balancer all connected to one database, could you call that a distributed application? Recall my definition from up above:. A system is distributed only if the nodes communicate with each other to coordinate their actions.
Therefore something like an application running its back-end code on a peer-to-peer network can better be classified as a distributed application. Regardless, this is all needless classification that serves no purpose but illustrate how fussy we are about grouping things together. Erlang is a functional language that has great semantics for concurrency, distribution and fault-tolerance. The Erlang Virtual Machine itself handles the distribution of an Erlang application. Its model works by having many isolated lightweight processes all with the ability to talk to each other via a built-in system of message passing.
The model is what helps it achieve great concurrency rather simply — the processes are spread across the available cores of the system running them. This swarm of virtual machines run one single application and handle machine failures via takeover another node gets scheduled to run. In fact, the distributed layer of the language was added in order to provide fault tolerance. Software running on a single machine is always at risk of having that single machine dying and taking your application offline.
Software running on many nodes allows easier hardware failure handling, provided the application was built with that in mind. BitTorrent is one of the most widely used protocol for transferring large files across the web via torrents. The main idea is to facilitate file transfer between different peers in the network without having to go through a main server. Using a BitTorrent client, you connect to multiple computers across the world to download a file.
When you open a. It helps with peer discovery, showing you the nodes in the network which have the file you want. You have the notions of two types of user, a leecher and a seeder. A leecher is the user who is downloading a file and a seeder is the user who is uploading said file.
The funny thing about peer-to-peer networks is that you, as an ordinary user, have the ability to join and contribute to the network. BitTorrent and its precursors Gnutella , Napster allow you to voluntarily host files and upload to other users who want them. The reason BitTorrent is so popular is that it was the first of its kind to provide incentives for contributing to the network. Freeriding , where a user would only download files, was an issue with the previous file sharing protocols. BitTorrent solved freeriding to an extent by making seeders upload more to those who provide the best download rates.
It works by incentivizing you to upload while downloading a file.
This causes a lack of seeders in the network who have the full file and as the protocol relies heavily on such users, solutions like private trackers came into fruition. Private trackers require you to be a member of a community often invite-only in order to participate in the distributed network. After advancements in the field, trackerless torrents were invented. This was an upgrade to the BitTorrent protocol that did not rely on centralized trackers for gathering metadata and finding peers but instead use new algorithms.
A distributed ledger can be thought of as an immutable, append-only database that is replicated, synchronized and shared across all nodes in the distributed network. Blockchain is the current underlying technology used for distributed ledgers and in fact marked their start. This latest and greatest innovation in the distributed space enabled the creation of the first ever truly distributed payment protocol — Bitcoin.
Blockchain is a distributed ledger carrying an ordered list of all transactions that ever occurred in its network. Transactions are grouped and stored in blocks.
The whole blockchain is essentially a linked-list of blocks hence the name. Said blocks are computationally expensive to create and are tightly linked to each other through cryptography. This hash requires a lot of CPU power to be produced because the only way to come up with it is through brute-force.
Miners are the nodes who try to compute the hash via bruteforce. The miners all compete with each other for who can come up with a random string called a nonce which, when combine with the contents, produces the aforementioned hash. Once somebody finds the correct nonce — he broadcasts it to the whole network. Said string is then verified by each node on its own and accepted into their chain.