In this project, we
develop a distributed system that is capable of deriving knowledge from any
dataset by making use of artificial intelligence. The system will run on an
Array of Inexpensive Machines (AIM) thereby achieving machine (or computer)
level parallelism. We also make use of homomorphic encryption in order to
ensure that even in the case of a data breach, no confidential information is
leaked to the open internet.
We aim to prove an
intuitive user interface that will allow users who do not have a proficient
background in computer science to extract knowledge from data.
Gottlieb, Allan and Almasi, George S, parallel computing is the type of computation in which
many calculations or the execution of processes are carried out simultaneously.
Large problems are divided into smaller ones which can be solved at the same
time. This project will exploit the concepts of parallel computing to the
maximum by creating a high performance computing cluster.
Traditionally, the different types of parallelism are
Bit level :
parallelism is a form of parallel computing based on increasing processor
word size. Increasing the word size reduces the number of instructions the
processor must execute in order to perform an operation on variables whose
sizes are greater than the length of the word.Instruction Level:
Instruction-level parallelism (ILP) is a measure of how many of the
instructions in a computer program can be executed simultaneously.Data Level: Data
parallelism is a form of parallelization across multiple processors in
parallel computing environments.Task Level : Task
parallelism is a form of parallelization of computer code across multiple
processors in parallel computing environments.
We propose a new type of parallelism called machine
(or computer) level parallelism. This type of parallelism can only be used when
we have multiple machines that are connected by using any high performance
The major advantages of using the concepts of
Parallelism gives major
performance boost to applications that do not contain data, control and
branch hazards.The applications that
tend to run in parallel consume less time than their serial counterpart.
In computer architecture, speedup is a process for
increasing the performance between two systems processing the same problem.
More technically, it is the improvement in speed of execution of a task
executed on two similar architectures with different resources.
Amdahl Law states that the performance improvement to
be gained using some faster mode of execution is limited by fraction if the
time faster mode can be used.
Figure (i) shows the architecture of the project by
taking 5 machines into consideration.
The architecture is divided into three different
Computation Layer: This
layer contains all the machines that will be running computation tasks.
This layer can be viewed as a high performance compute cluster (HPC).Command and Control
Center: This layer is manages the other two layers. The users will
directly interact with this layer. This layer will need little to no input
after the initial setup.Storage Layer: As the
name implies, this layer is used to store data.
The aforementioned layers are connecting by making use
of Gigabit switch that will facilitate a highly reliable and expeditive mode of
communication between the components.
The components in the layers of the architecture are:
File Store: The file
store is used to store all the encrypted datasets. The datasets will be
encrypted on the client side and will never be decrypted by the
application at all. All computations will be performed on the encrypted
dataset itself. Image Store: The image
store is used to store all the images of the highly optimised algorithms.
These images will be loaded onto a container and executed when necessary.Database: The database
is used to keep track of all the transactions that are happening. It is
used to maintain a permanent record.Gigabit Switch: The
switch ensure that file transfers between the various components take as
little time as possible.API Server: The API
server is used as the command and control center. It will automatically
manage the entire datacenter. It will require little to no user input
after it is set up initially. Web Server: The web
server is used to ensure that users with the required functionality. Native client: The
native client will encrypt the dataset before it is transmitted to the
file store. It will also be responsible for decryption of the results.Mobile apps and wearable
apps: These small components will ensure that the users are constantly
updated about the status of the algorithm. Container: This is
secure sandbox where all the computations will take place.Daemon: A daemon will be
present on every machine and it will take care of container management.
The end user can make use of any HTML5 browser or the
native client application or any mobile/wearable applications to draw a the
machine learning pipeline that he/she wishes to test. The UI will consist of a
canvas and a toolbar. First the user will be asked to enter the metadata of the
dataset that he/she wishes to use. Once this stage is complete, the user can
drag and drop items from the toolbar onto the canvas to create the machine
learning pipeline of his/her choice. After completion, the user can hit the
“Play” button and the system will verify if the pipeline is executable. If the
pipeline is not executable, the user will be notified about the same and a list
of possible solutions will also be provided. If the pipeline is executable, the
user will be allowed to upload the complete dataset.
In the 21st century, machine learning is used to aid
the decision makers in any organization at all levels. These decision makers
includes (but is not limited to) doctors who are determining whether a cancer
is malignant or benign, a CEO who is drafting a company’s business strategy, a
campaign manager deciding how to effectively target the base and so on. All
these decisions are all data driven and they need to mine confidential
It is no secret that there is constant increased int
the number of data leaks. Recently 40,000 Oneplus customers were hit by credit
card data breach. Yahoo breach involving confidential information of more than
3 billion users occured in 2013 but was reported only in the second half of
2016. Due to these leaks and delay in reporting the leaks, users have lost
confidence in the ability of corporations and conglomerates to store
In this project, we aim to remove the element of trust
from the system. We aim to do this, by making use of Homomorphic encryption.
Homomorphic encryption is a form of encryption that allows computation on
ciphertexts, generating an encrypted result which, when decrypted, matches the
result of the operations as if they had been performed on the plaintext. The
purpose of homomorphic encryption is to allow computation on encrypted data.
Even though the Homomorphic Encryption takes more time
to encrypt at the user side it ensures privacy of the data. It supports the
concept of securing the data at storage and during computation rather than
securing data while transmission.
Keeping in mind the above information, the user will
be allowed to encrypt the dataset on the client side. The dataset will be
encrypted by making use a cryptographically strong key at and send to the
this, the user can go Away from keyboard.1
The distributed system will parse the pipeline to
determine and provision the number of clusters that are needed. If the required
number of clusters can not be provisioned, the system will wait until such a
time arrives when the clusters can be provisioned. Once all the containers are
ready, the daemon will load the necessary images into the containers and will
start the container and continuously monitor the same until it is destroyed.
Every algorithm is made up of two parts, namely:
At any point in time, for a given algorithm there will
exist only one master and more than one slave.The master and the slave will
collaborate with each other to ensure that the pipeline is successfully
executed. Once the execution is complete, the results are stored back on to the
file store and a push notification will be sent to the user via SMS, Email and
app notifications. The user can now login, download and decrypt the results.
The encryption mechanism is end to end. Meaning, the dataset will be decrypted
only at the client side, that is, we will be employing end to end encryption.
encryption (E2EE) is a system of communication where only the communicating end
users can read the messages. It prevents potential eavesdroppers – including telecom
providers, Internet providers, and even the provider of the communication
service – from being able to access the cryptographic keys needed to decrypt
the conversation.It is also called public private key encryption. It uses
public key to encrypt,private key to decrypt.
Once a message has been encrypted using public key,
only private key will be able to decrypt the message back. This mechanism
allows to establish a secure communication links between users without having
to worry about the security of the message being compromised.
If the user is not satisfied with the result, he/she
can set up a loop to continuously run the same pipeline with different
parameters to achieve maximum train and test accuracy.
what is AFK?