Medha Atre
Scientific Consultant, Eydle Inc.
Actions
Medha Atre holds a Ph.D. in Computer Science from Rensselaer Polytechnic Institute (Troy, NY) with the focus on graph databases. She currently works with the founding team at Eydle Inc. to develop a distributed deep learning platform for optimizing computing cost and performance of problems at scale. Medha lends her expertise to help Eydle in their mission of making AI computing intelligent.
During Medha's PhD, she single-handedly developed 'BitMat' system from scratch, for handling very large RDF graphs and queries on them. BitMat has over 1000 downloads and 400 research citations to date. Medha has a unique experience of 20 years in academia and industry together, that includes being a Senior Researcher at the University of Oxford, an Assistant Professor at IIT Kanpur, and a postdoctoral researcher at the University of Pennsylvania. Before starting her PhD, she spent four and half years in industry as a software developer, and during her PhD she interned at Oracle Inc. and IBM T. J. Watson lab.
Links
Area of Expertise
Topics
Redis for Scaling Distributed Deep Learning
Abstract:
------------
At Eydle, we are reimagining distributed deep learning technology to optimize training speed and cost. We are reinventing the technology to handle fault tolerance, variable network latency and heterogeneity of devices leading to 70-90% reduction in cost. By using Redis as an eventual consistency key-value store for model parameters, we have achieved 1.5x faster transaction times.
Background:
-----------------
Distributed deep learning has gained a lot of interest in the past few years due to its cost effectiveness in scaling large deep neural net training jobs. A typical distributed deep learning setup has client and server architecture. Several clients work independently, in parallel, on a smaller portion of the training job, and create local copies of the neural net’s weights (parameters). The parameter server facilitates combining these individual client weights into a central copy. The parameter server in turn may have several sub-processes (threads) for vertical scaling.
Shared memory lock-free data structures or file-locking are the commonly used ways for storing the weights centrally on a parameter server, but they prevent horizontal scaling of parameter servers to shared-nothing compute nodes. A natural choice to circumvent this problem is to use a database to store the central weights, and let multiple parameter servers access them simultaneously. Conventional databases provide ACID properties, but their strong consistency in transaction processing can mar the speed of updates. In the prior published work on using lock-free data structures for distributed deep learning, it has been shown that deep learning training can tolerate some loss of updates without a significant impact on the accuracy of training. This makes eventually consistent main-memory databases a good choice for distributed deep learning for horizontal scaling.
Using Redis in distributed deep learning:
-----------------------------------------------------
In our work, we have used Redis for storing central parameter server weights accessed through multiple parameter services. Each service processes neural network weights received from clients, and updates the copy of weights in Redis. Because of using Redis, the individual parameter services do not have to worry about connecting to a shared memory or locking a file to make concurrent updates. Using Redis also relieves the parameter services from handling crashes themselves, which become necessary when using shared memory or file locking solutions. Using Redis’ “save” frequency configuration parameter, we can tune the frequency of disk writes, which is not possible in the traditional databases. It also avoids the complexity of managing shared memory in the event of a crash.
Experimental Results:
-----------------------------
We have used Redis 6.0.8 server to work with up to 6 simultaneous parameter services for updating Tensorflow weight objects of 21 MB size in individual “set” operations. Over a long distributed deep learning job, we typically do upwards of 2000 such “set” operations. As we increase the number of parameter services from 1 to 3, we notice that the average time of one Redis “set” operation goes up from 0.74 second to 0.87 second. With 5 parameter services the Redis “set” operation on average takes 1 second. As opposed to this, when we used MySQL to store and update weights with 3 simultaneous parameter services, it takes 1.29 seconds per SQL UPDATE transaction, which is even higher than Redis’ performance with 5 simultaneous parameter services.
Additionally, as we increase the simultaneous parameter services from 1 to 5, we do not notice any deterioration in the training accuracy, due to any loss of updates. On the other hand, there is a significant decrease in the total training time with the higher number of parameter services and clients, for achieving the same training accuracy. For example, going from 1 parameter service with 3 clients, to 5 parameter services with 5 clients, there is a marked decrease of more than 8 hours in the total training time.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top