How to Configure Metrics Collector
This guide describes how Katib metrics collector works.
Metrics Collector
In the metricsCollectorSpec section of the Experiment YAML configuration file, you can
define how Katib should collect the metrics from each Trial, such as the accuracy and loss metrics.
Your training code can record the metrics into StdOut or into arbitrary output files. Katib
collects the metrics using a sidecar container. A sidecar is a utility container that supports
the main container in the Kubernetes Pod.
To define the metrics collector for your Experiment:
Specify the collector type in the
.collector.kindfield. Katib’s metrics collector supports the following collector types:StdOut: Katib collects the metrics from the operating system’s default output location (standard output). This is the default metrics collector.File: Katib collects the metrics from an arbitrary file, which you specify in the.source.fileSystemPath.pathfield. Training container should log metrics to this file inTEXTorJSONformat. If you selectJSONformat, metrics must be line-separated byepochorstepas follows, and the key for timestamp must betimestamp:{"epoch": 0, "foo": “bar", “fizz": “buzz", "timestamp": 1638422847.28721…} {"epoch": 1, "foo": “bar", “fizz": “buzz", "timestamp": 1638422847.287801…} {"epoch": 2, "foo": “bar", “fizz": “buzz", "timestamp": "2021-12-02T14:27:50.000035161+09:00"…} {"epoch": 3, "foo": “bar", “fizz": “buzz", "timestamp": "2021-12-02T14:27:50.000037459+09:00"…} …Check the file metrics collector example for
TEXTandJSONformat. Also, the default file path is/var/log/katib/metrics.log, and the default file format isTEXT.TensorFlowEvent: Katib collects the metrics from a directory path containing a tf.Event. You should specify the path in the.source.fileSystemPath.pathfield. Check the TFJob example. The default directory path is/var/log/katib/tfevent/.Custom: Specify this value if you need to use a custom way to collect metrics. You must define your custom metrics collector container in the.collector.customCollectorfield. Check the custom metrics collector example.None: Specify this value if you don’t need to use Katib’s metrics collector. For example, your training code may handle the persistent storage of its own metrics.
Write code in your training container to print or save to the file metrics in the format specified in the
.source.filter.metricsFormatfield. The default metrics format value is:([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)Each element is a regular expression with two sub-expressions. The first matched expression is taken as the metric name. The second matched expression is taken as the metric value.
For example, using the default metrics format and
StdOutmetrics collector, if the name of your objective metric islossand the additional metrics arerecallandprecision, your training code should print the following output:epoch 1: loss=3.0e-02 recall=0.5 precision=.4 epoch 2: loss=1.3e-02 recall=0.55 precision=.5
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.