Version
Download 19
Total Views 229
Stock
File Size 1.34 MB
File Type pdf
Create Date 20 September, 2018
Last Updated 31 October, 2023
Download

One of the most powerful characteristics of Google API for Machine Learning TensorFlow is its capability for distributed computation that allow users to automatically distribute the training process in different computing machines. Despite the fact that the implementation of these characteristics is relatively straightforward, their deployment in a typical High Performance Infrastructure based on queue management systems presents several issues. This report describes a complete python package that solves these issues in the Finis Terrae II supercomputer, which uses Slurm as queue system. In order to show the performance of the distributed TensorFlow in Finis Terrae II, an industrial case based on the experiment 707 (CyPLAM) from FORTISSIMO II was trained using the developed python package.