Download 9
Total Views 66
File Size 1.34 MB
File Type pdf
Create Date 20/9/2018
Last Updated 31/10/2023

One of the most powerful characteristics of Google API for Machine Learning TensorFlow is its capability for distributed computation that allow users to automatically distribute the training process in different computing machines. Despite the fact that the implementation of these characteristics is relatively straightforward, their deployment in a typical High Performance Infrastructure based on queue management systems presents several issues. This report describes a complete python package that solves these issues in the Finis Terrae II supercomputer, which uses Slurm as queue system. In order to show the performance of the distributed TensorFlow in Finis Terrae II, an industrial case based on the experiment 707 (CyPLAM) from FORTISSIMO II was trained using the developed python package.