Integrating Neural Network Parallel Training using Tensorflow with SLURM

Version
Download	26
Total Views	276
Stock	∞
File Size	1.34 MB
File Type
Create Date	20 September, 2018
Last Updated	31 October, 2023

Download

One of the most powerful characteristics of Google API for Machine Learning TensorFlow is its capability for distributed computation that allow users to automatically distribute the training process in different computing machines. Despite the fact that the implementation of these characteristics is relatively straightforward, their deployment in a typical High Performance Infrastructure based on queue management systems presents several issues. This report describes a complete python package that solves these issues in the Finis Terrae II supercomputer, which uses Slurm as queue system. In order to show the performance of the distributed TensorFlow in Finis Terrae II, an industrial case based on the experiment 707 (CyPLAM) from FORTISSIMO II was trained using the developed python package.

To share this story, choose any platform