Skip to main content

System monitor

note

System Monitor add-on is available out of the box from Connhex Edge version v2.3.0.

Overview

This add-on monitors resource usage on the edge device. It collects both system and custom metrics at configurable intervals. The collected metrics can be sent directly to the cloud or routed to other services running on the device for further processing.

Features

  • Lightweight – Optimized for devices with limited resources.
  • Built-in System Metrics – Collects standard metrics such as CPU, memory, disk, network, host, and process information.
  • Custom Probes – Supports device-specific metrics retrieved from files, HTTP endpoints, or system commands.
  • Configurable Sampling Intervals – Allows independent data collection intervals for each metric type.
  • Efficient Data Format – Uses compact data encoding to minimize network bandwidth usage.
  • Integrated – Works seamlessly with other Connhex Edge agent services for advanced data processing and analysis.

How it works

The addon supports two types of metrics:

Metric TypeDescription
Built-inPredefined metrics such as CPU, Memory (see the complete list below).
CustomUser-defined metrics obtained from an external source (file, HTTP endpoint, or command).

Each metric can be configured with its own sampling interval, providing fine-grained control over the data collection frequency. All collected data can be temporarily stored and published at a different publish interval.

Configuration

All monitoring settings are defined within the monitoring object.
This section controls how and when system metrics are collected and published to Connhex Cloud.

Top-Level Settings

FieldTypeDescriptionConstraints
publish_intervalString (duration)Defines how often all collected metrics are published to the cloud. Set to 0 to disable the monitoring service.Minimum value: 1s

Metric Configuration

Each metric entry under [[monitoring.metrics]] will contains both common and type-specific fields.

Common Fields

FieldTypeDescriptionConstraints
typeStringMetric type (e.g., cpu, memory, disk).Required
intervalString (duration)Sampling interval for collecting the metric. A value of 0 means the metric is collected only once, at service startup (useful for static metrics such as kernel version or hostname).Minimum value (if not 0): 1s

Type-Specific Fields

See the corresponding section for details on type-specific fields.

Example Configuration

[monitoring]

# Interval at which metrics are published.
# If set to `0`, the monitoring service is disabled.
# Value cannot be less than 1s.
publish_interval = "30s"

[[monitoring.metrics]]
# Metric definition.
# An empty list disables metric collection.
type = "cpu"
interval = "10s"
# Additional metric-specific fields. See examples below for details.

Built-in Metrics

A set of predefined metrics is provided out of the box to cover the most common use cases. These metrics are organized into the following categories: CPU, Memory, Disk, Network, Host, Process, Sensors.

For each category, specific metrics can be configured for collection with the following field:

FieldTypeDescriptionConstraints
collectArray of stringsList of metric to collect for the selected Metric Type (built-in only)Required for built-in metrics

Example

[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent", "load1", "load5"]
interval = "5s"

The same metric type can be used multiple times to collect data at different sampling intervals.

[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent"]
interval = "30s"

[[monitoring.metrics]]
type = "cpu"
collect = ["load1", "load5", "load15"]
interval = "30m"

CPU

MetricDescription
userTime spent in user mode.
systemTime spent in system mode.
idleIdle CPU time.
niceTime spent running niced processes.
iowaitTime waiting for I/O completion.
irqTime servicing interrupts.
softirqTime servicing soft interrupts.
stealTime stolen by other virtual machines.
guestTime running guest OS.
guest_niceTime running guest niced OS.
usage_percentTotal CPU usage as a percentage.
count_logicalLogical CPU count.
count_physicalPhysical CPU count.
load1, load5, load15System load averages over 1, 5, and 15 minutes.

Memory

MetricDescription
total, available, used, used_percent, freeBasic memory usage.
active, inactive, buffers, cached, shared, slab, pagetablesDetailed memory states.
swap_total, swap_used, swap_freeSwap memory details.
hugepages_total, hugepages_free, hugepage_sizeHugePages memory statistics.

Disk

MetricDescription
read_count, write_count, read_bytes, write_bytes, read_time, write_time, io_time, iops_in_progresDisk I/O statistics.
used, free, used_percentDisk space usage.

Disk Type-Specific fields

FieldTypeDescriptionDefault
deviceStringName or full path of the device (for I/O statistics)sda
pathStringFilesystem path to analyze for disk space usage/

Examples

[[monitoring.metrics]]
type = "disk"
collect = ["read_count", "write_count"]
device = "sda" # /dev/sda is also accepted
interval = "30s"

[[monitoring.metrics]]
type = "disk"
collect = ["used_percent"]
path = "/home/user"
interval = "6h"

Network

MetricDescription
bytes_sent, bytes_recvTotal bytes sent and received.
packets_sent, packets_recvTotal packets sent and received.
err_in, err_out, drop_in, drop_outError and packet drop statistics.

Network Type-Specific fields

FieldTypeDescriptionDefault
nicStringName of the network interfacefirst available

Examples

[[monitoring.metrics]]
type = "network"
collect = ["byte_sent", "byte_recv"]
nic = "eth0"
interval = "30s"

Host

MetricDescription
uptime, boot_timeSystem uptime and boot time.
procsNumber of processes.
hostname, os, platform, platform_family, platform_versionHost and OS details.
kernel_version, kernel_archKernel version and architecture.
virtualization_system, virtualization_roleVirtualization details if applicable.

Process

MetricDescription
cpu_percent, memory_percentCPU and memory usage percentages.
memory_info_rss, memory_info_vmsResident and virtual memory usage.
num_threads, num_fdsProcess resource statistics.
io_read_count, io_write_count, io_read_bytes, io_write_bytesI/O operation counts and bytes.
create_time, status, cmdline, ppid, exe, cwdMetadata and command-line details.

Type-Specific settings

FieldTypeDescription
process_namesArray of stringsList of process names to collect metrics from.

Example

[[monitoring.metrics]]
type = "process"
collect = ["cpu_percent", "memory_percent"]
interval = "15s"
process_names = ["httpd", "sshd"]

Sensors

MetricDescription
temperature, temperature_high, temperature_criticalTemperature information collected from hardware sensors.
info

The sensor metrics are obtained from standard Linux interfaces such as hwmon or legacy thermal_zone files under /sys/class/thermal/. Your device must expose these interfaces for temperature data to be collected successfully.

note

If sensor information is not available through hwmon or thermal_zone, the metrics will not be collected. In such cases, you can implement a custom metric to provide sensor data from alternative sources.

Type-Specific settings

FieldTypeDescription
sensor_namesArray of stringsList of sensor names to collect metrics from (as reported by the system).

Example

[[monitoring.metrics]]
type = "sensor"
collect = ["temperature"]
interval = "1m"
sensor_names = ["cpu_thermal", "soc_thermal"]
tip

To identify available sensor names on your device, check the contents of /sys/class/hwmon/ or /sys/class/thermal/. For example, you can list hardware monitor names with:

cat /sys/class/hwmon/hwmon*/name
cat /sys/class/thermal/thermal_zone*/type

Use the reported names (e.g. cpu_thermal, soc_thermal, temp1) in the sensor_names field of your configuration.

Custom Metrics

When the built-in metrics do not provide all the required information, custom metrics can be defined to access device-specific data.

Custom metrics support retrieving arbitrary values from files, HTTP endpoints, or commands. This functionality is useful for integrating data from external systems, sensors, or custom scripts.

A custom metric is defined by setting its type to custom and specifying both a name and a probe configuration.

An optional unit field can also be specified. Any string value is accepted, but it is recommended to use a supported unit from the SenML specification, as these are recognized and properly handled by the UI.

Custom Probe Fields

The Custom Probe Fields define how a custom metric retrieves and processes data from its source. These settings control the probe’s behavior, including the type of data source, how the data is accessed, and how the result is parsed.

FieldDescription
source_typeOne of file, http, or command.
sourceThe path, URL, or command to execute.
parserOptional value parser for extracting data (regex or JSON).
timeoutTimeout for HTTP or command probes.

Source Types

Source Types indicate the different input methods that a custom probe can use to collect metric data. Depending on the monitoring requirements, a probe can read values from a local file, query an HTTP(S) endpoint, or execute a system command.

Source TypeDescriptionExample source
fileReads metric value from a local file./tmp/temperature.txt
httpFetches metric value via HTTP(S) request.http://localhost:8080/status.json
commandExecutes a shell command and reads its output.cat /sys/class/thermal/thermal_zone0/temp

Parser

The parser can be used to extracts the relevant value from the source output using either a regular expression or a JSON path, depending on the data format. If not configured the raw output will be used as value.

Parser TypeDescriptionExample
regexUses a regular expression with one capture group.expression = "temp: ([0-9.]+)"
jsonExtracts value via JSON path.expression = "$.temperature"

Examples

Custom Metric from File

[[monitoring.metrics]]
type = "custom"
name = "ambient_temperature"
unit = "degC"
interval = "15s"

[metrics.probe]
source_type = "file"
source = "/var/data/temperature"
parser = { type = "regex", expression = "([0-9.]+)" }

Custom Metric from HTTP Endpoint

[[monitoring.metrics]]
type = "custom"
name = "service_latency"
unit = "ms"
interval = "10s"

[metrics.probe]
source_type = "http"
source = "http://localhost:9000/metrics.json"
timeout = "3s"
parser = { type = "json", expression = "$.latency_ms" }

Custom Metric from Command Output

[[monitoring.metrics]]
type = "custom"
name = "gpu_usage"
unit = "%"
interval = "5s"

[metrics.probe]
source_type = "command"
source = "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits"
timeout = "2s"
parser = { type = "regex", expression = "([0-9]+)" }

Advanced Probes configuration

While the built-in parsers are suitable for extracting instant values from command outputs, it is possible to leverage custom scripts for cases where more sophisticated data processing and analysis are required.

Beyond Simple Metrics

A custom probe can be configured to execute any command or script, giving you complete control over how data is collected, processed, and transformed into meaningful metrics. This provides maximum flexibility for advanced use cases, such as:

  • Calculate rolling averages, minimums, or maximums over time
  • Perform statistical analysis on historical data
  • Aggregate and combine information from multiple sources
  • Apply custom filtering, normalization, or smoothing algorithms
  • Generate derived metrics based on multiple readings
tip

If you prefer not to implement these data-processing logics yourself, a Rule Engine add-on is available. It seamlessly integrates with other Connhex Edge services, offering an easy way to define powerful, event-driven rules and transformations without custom scripting.

Storing State Between Executions

Probes are inherently stateless objects. However, the scripts they execute can be designed to maintain state by storing data, for example, in temporary files or other persistent storages. This approach enables time-based calculations that rely on historical context.

Example: Rolling Average Temperature

Calculate a 5-minute rolling average of temperature readings:

[[monitoring.metrics]]
type = "custom"
name = "temperature_5min_avg"
unit = "degC"
interval = "30s"
[metrics.probe]
source_type = "command"
source = "/opt/iot-metrics/scripts/temperature_average.sh"
#!/bin/bash
# temperature_average.sh - Calculate 5-minute rolling average

STATE_FILE="/tmp/temperature_readings.txt"
HISTORY_SECONDS=300 # 5 minutes
CURRENT_TIME=$(date +%s)

# Read the current temperature from the sensor
CURRENT_TEMP=$(cat /sys/class/thermal/thermal_zone0/temp | grep -oP '([0-9.]+)')

# Append current reading with timestamp
echo "${CURRENT_TIME} ${CURRENT_TEMP}" >> "${STATE_FILE}"

# Remove readings older than 5 minutes
if [ -f "${STATE_FILE}" ]; then
CUTOFF_TIME=$((CURRENT_TIME - HISTORY_SECONDS))
grep -v "^[0-9]*\s" "${STATE_FILE}" > /dev/null 2>&1
awk -v cutoff="$CUTOFF_TIME" '$1 >= cutoff' "${STATE_FILE}" > "${STATE_FILE}.tmp"
mv "${STATE_FILE}.tmp" "${STATE_FILE}"
fi

# Calculate average of all readings in the window
if [ -f "${STATE_FILE}" ] && [ -s "${STATE_FILE}" ]; then
AVERAGE=$(awk '{ sum += $2; count++ } END { if (count > 0) print sum/count; else print 0 }' "${STATE_FILE}")
echo "${AVERAGE}"
else
# If no history exists yet, return current reading
echo "${CURRENT_TEMP}"
fi

Best Practices for Custom Scripts

Output Format

Your script should output to stdout only the numeric or string value representing the measurement result. Avoid printing debug information or status messages (use stderr if needed). Ensure the output matches the expected data type for your metric.

File Locations

Use /tmp for temporary state files that don't need persistence across reboots. Use application-specific directories (e.g., /var/lib/iot-metrics) for important persistent data, or if your device and application allows it, a DB like sqlite. Ensure your scripts have appropriate file permissions and are accessible to the agent.

Performance Considerations

When enabling the monitoring addon, special attention must be given to network data usage and system resource constraints.

Limit the Number of Collected Metrics

Each collected metric generates data that must be transmitted periodically.
Collecting too many metrics, or unnecessary submetrics, increases both:

  • Bandwidth consumption (larger data payloads per publish cycle)
  • CPU and memory load on the device (from collecting and formatting data)

Make sure to:

  • Only enable metrics that are relevant to your monitoring or diagnostic needs.
  • Avoid collecting all subfields of a metric unless strictly necessary.

Adjust Sampling and Publish Intervals Appropriately

The interval fields determine how frequently each metric is sampled and how often the agent sends data upstream. If not properly configured, these settings can generate excessive load on devices or cause high network bandwidth consumption.

When configuring the monitor:

  • Set the publish_interval to the coarsest acceptable reporting frequency. For low-bandwidth connections, consider publishing data every few minutes.
  • Adjust the interval for each metric to reflect how quickly the underlying value changes.