Integrating Neural Network Parallel Training using Tensorflow with SLURM

Version
Descargar	52
Vistas totales	452
Stock	∞
Tamaño del archivo	1.34 MB
File Type
Fecha de Creación	20/9/2018
Última actualización	31/10/2023

Descargar

One of the most powerful characteristics of Google API for Machine Learning TensorFlow is its capability for distributed computation that allow users to automatically distribute the training process in different computing machines. Despite the fact that the implementation of these characteristics is relatively straightforward, their deployment in a typical High Performance Infrastructure based on queue management systems presents several issues. This report describes a complete python package that solves these issues in the Finis Terrae II supercomputer, which uses Slurm as queue system. In order to show the performance of the distributed TensorFlow in Finis Terrae II, an industrial case based on the experiment 707 (CyPLAM) from FORTISSIMO II was trained using the developed python package.

Para compartir esta historia, elija cualquier plataforma