System monitor

note

System Monitor add-on is available out of the box from Connhex Edge version v2.3.0.

Overview

This add-on monitors resource usage on the edge device. It collects both system and custom metrics at configurable intervals. The collected metrics can be sent directly to the cloud or routed to other services running on the device for further processing.

Features

Lightweight – Optimized for devices with limited resources.
Built-in System Metrics – Collects standard metrics such as CPU, memory, disk, network, host, and process information.
Custom Probes – Supports device-specific metrics retrieved from files, HTTP endpoints, or system commands.
Configurable Sampling Intervals – Allows independent data collection intervals for each metric type.
Efficient Data Format – Uses compact data encoding to minimize network bandwidth usage.
Integrated – Works seamlessly with other Connhex Edge agent services for advanced data processing and analysis.

How it works

The addon supports two types of metrics:

Metric Type	Description
Built-in	Predefined metrics such as CPU, Memory (see the complete list below).
Custom	User-defined metrics obtained from an external source (file, HTTP endpoint, or command).

Each metric can be configured with its own sampling interval, providing fine-grained control over the data collection frequency. All collected data can be temporarily stored and published at a different publish interval.

Configuration

All monitoring settings are defined within the monitoring object.
This section controls how and when system metrics are collected and published to Connhex Cloud.

Top-Level Settings

Field	Type	Description	Constraints
`publish_interval`	String (duration)	Defines how often all collected metrics are published to the cloud. Set to `0` to disable the monitoring service.	Minimum value: `1s`

Metric Configuration

Each metric entry under [[monitoring.metrics]] will contains both common and type-specific fields.

Common Fields

Field	Type	Description	Constraints
`type`	String	Metric type (e.g., `cpu`, `memory`, `disk`).	Required
`interval`	String (duration)	Sampling interval for collecting the metric. A value of `0` means the metric is collected only once, at service startup (useful for static metrics such as kernel version or hostname).	Minimum value (if not `0`): `1s`

Type-Specific Fields

See the corresponding section for details on type-specific fields.

Example Configuration

[monitoring]

# Interval at which metrics are published.
# If set to `0`, the monitoring service is disabled.
# Value cannot be less than 1s.
publish_interval = "30s"

[[monitoring.metrics]]
# Metric definition.
# An empty list disables metric collection.
type = "cpu"
interval = "10s"
# Additional metric-specific fields. See examples below for details.

Built-in Metrics

A set of predefined metrics is provided out of the box to cover the most common use cases. These metrics are organized into the following categories: CPU, Memory, Disk, Network, Host, Process, Sensors.

For each category, specific metrics can be configured for collection with the following field:

Field	Type	Description	Constraints
`collect`	Array of strings	List of metric to collect for the selected Metric Type (built-in only)	Required for built-in metrics

Example

[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent", "load1", "load5"]
interval = "5s"

The same metric type can be used multiple times to collect data at different sampling intervals.

[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent"]
interval = "30s"

[[monitoring.metrics]]
type = "cpu"
collect = ["load1", "load5", "load15"]
interval = "30m"

CPU

Metric	Description
`user`	Time spent in user mode.
`system`	Time spent in system mode.
`idle`	Idle CPU time.
`nice`	Time spent running niced processes.
`iowait`	Time waiting for I/O completion.
`irq`	Time servicing interrupts.
`softirq`	Time servicing soft interrupts.
`steal`	Time stolen by other virtual machines.
`guest`	Time running guest OS.
`guest_nice`	Time running guest niced OS.
`usage_percent`	Total CPU usage as a percentage.
`count_logical`	Logical CPU count.
`count_physical`	Physical CPU count.
`load1`, `load5`, `load15`	System load averages over 1, 5, and 15 minutes.

Memory

Metric	Description
`total`, `available`, `used`, `used_percent`, `free`	Basic memory usage.
`active`, `inactive`, `buffers`, `cached`, `shared`, `slab`, `pagetables`	Detailed memory states.
`swap_total`, `swap_used`, `swap_free`	Swap memory details.
`hugepages_total`, `hugepages_free`, `hugepage_size`	HugePages memory statistics.

Disk

Metric	Description
`read_count`, `write_count`, `read_bytes`, `write_bytes`, `read_time`, `write_time`, `io_time`, `iops_in_progres`	Disk I/O statistics.
`used`, `free`, `used_percent`	Disk space usage.

Disk Type-Specific fields

Field	Type	Description	Default
`device`	String	Name or full path of the device (for I/O statistics)	`sda`
`path`	String	Filesystem path to analyze for disk space usage	`/`

Examples

[[monitoring.metrics]]
type = "disk"
collect = ["read_count", "write_count"]
device = "sda" # /dev/sda is also accepted
interval = "30s"

[[monitoring.metrics]]
type = "disk"
collect = ["used_percent"]
path = "/home/user"
interval = "6h"

Network

Metric	Description
`bytes_sent`, `bytes_recv`	Total bytes sent and received.
`packets_sent`, `packets_recv`	Total packets sent and received.
`err_in`, `err_out`, `drop_in`, `drop_out`	Error and packet drop statistics.

Network Type-Specific fields

Field	Type	Description	Default
`nic`	String	Name of the network interface	first available

Examples

[[monitoring.metrics]]
type = "network"
collect = ["byte_sent", "byte_recv"]
nic = "eth0"
interval = "30s"

Host

Metric	Description
`uptime`, `boot_time`	System uptime and boot time.
`procs`	Number of processes.
`hostname`, `os`, `platform`, `platform_family`, `platform_version`	Host and OS details.
`kernel_version`, `kernel_arch`	Kernel version and architecture.
`virtualization_system`, `virtualization_role`	Virtualization details if applicable.

Process

Metric	Description
`cpu_percent`, `memory_percent`	CPU and memory usage percentages.
`memory_info_rss`, `memory_info_vms`	Resident and virtual memory usage.
`num_threads`, `num_fds`	Process resource statistics.
`io_read_count`, `io_write_count`, `io_read_bytes`, `io_write_bytes`	I/O operation counts and bytes.
`create_time`, `status`, `cmdline`, `ppid`, `exe`, `cwd`	Metadata and command-line details.

Type-Specific settings

Field	Type	Description
`process_names`	Array of strings	List of process names to collect metrics from.

Example

[[monitoring.metrics]]
type = "process"
collect = ["cpu_percent", "memory_percent"]
interval = "15s"
process_names = ["httpd", "sshd"]

Sensors

Metric	Description
`temperature`, `temperature_high`, `temperature_critical`	Temperature information collected from hardware sensors.

info

The sensor metrics are obtained from standard Linux interfaces such as hwmon or legacy thermal_zone files under /sys/class/thermal/. Your device must expose these interfaces for temperature data to be collected successfully.

note

If sensor information is not available through hwmon or thermal_zone, the metrics will not be collected. In such cases, you can implement a custom metric to provide sensor data from alternative sources.

Type-Specific settings

Field	Type	Description
`sensor_names`	Array of strings	List of sensor names to collect metrics from (as reported by the system).

Example

[[monitoring.metrics]]
type = "sensor"
collect = ["temperature"]
interval = "1m"
sensor_names = ["cpu_thermal", "soc_thermal"]

tip

To identify available sensor names on your device, check the contents of /sys/class/hwmon/ or /sys/class/thermal/. For example, you can list hardware monitor names with:

cat /sys/class/hwmon/hwmon*/name

cat /sys/class/thermal/thermal_zone*/type

Use the reported names (e.g. cpu_thermal, soc_thermal, temp1) in the sensor_names field of your configuration.

Custom Metrics

When the built-in metrics do not provide all the required information, custom metrics can be defined to access device-specific data.

Custom metrics support retrieving arbitrary values from files, HTTP endpoints, or commands. This functionality is useful for integrating data from external systems, sensors, or custom scripts.

A custom metric is defined by setting its type to custom and specifying both a name and a probe configuration.

An optional unit field can also be specified. Any string value is accepted, but it is recommended to use a supported unit from the SenML specification, as these are recognized and properly handled by the UI.

Custom Probe Fields

The Custom Probe Fields define how a custom metric retrieves and processes data from its source. These settings control the probe’s behavior, including the type of data source, how the data is accessed, and how the result is parsed.

Field	Description
`source_type`	One of `file`, `http`, or `command`.
`source`	The path, URL, or command to execute.
`parser`	Optional value parser for extracting data (regex or JSON).
`timeout`	Timeout for HTTP or command probes.

Source Types

Source Types indicate the different input methods that a custom probe can use to collect metric data. Depending on the monitoring requirements, a probe can read values from a local file, query an HTTP(S) endpoint, or execute a system command.

Source Type	Description	Example source
`file`	Reads metric value from a local file.	`/tmp/temperature.txt`
`http`	Fetches metric value via HTTP(S) request.	`http://localhost:8080/status.json`
`command`	Executes a shell command and reads its output.	`cat /sys/class/thermal/thermal_zone0/temp`

Parser

The parser can be used to extracts the relevant value from the source output using either a regular expression or a JSON path, depending on the data format. If not configured the raw output will be used as value.

Parser Type	Description	Example
`regex`	Uses a regular expression with one capture group.	`expression = "temp: ([0-9.]+)"`
`json`	Extracts value via JSON path.	`expression = "$.temperature"`

Examples

Custom Metric from File

[[monitoring.metrics]]
type = "custom"
name = "ambient_temperature"
unit = "degC"
interval = "15s"

[metrics.probe]
source_type = "file"
source = "/var/data/temperature"
parser = { type = "regex", expression = "([0-9.]+)" }

Custom Metric from HTTP Endpoint

[[monitoring.metrics]]
type = "custom"
name = "service_latency"
unit = "ms"
interval = "10s"

[metrics.probe]
source_type = "http"
source = "http://localhost:9000/metrics.json"
timeout = "3s"
parser = { type = "json", expression = "$.latency_ms" }

Custom Metric from Command Output

[[monitoring.metrics]]
type = "custom"
name = "gpu_usage"
unit = "%"
interval = "5s"

[metrics.probe]
source_type = "command"
source = "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits"
timeout = "2s"
parser = { type = "regex", expression = "([0-9]+)" }

Advanced Probes configuration

While the built-in parsers are suitable for extracting instant values from command outputs, it is possible to leverage custom scripts for cases where more sophisticated data processing and analysis are required.

Beyond Simple Metrics

A custom probe can be configured to execute any command or script, giving you complete control over how data is collected, processed, and transformed into meaningful metrics. This provides maximum flexibility for advanced use cases, such as:

Calculate rolling averages, minimums, or maximums over time
Perform statistical analysis on historical data
Aggregate and combine information from multiple sources
Apply custom filtering, normalization, or smoothing algorithms
Generate derived metrics based on multiple readings

tip

If you prefer not to implement these data-processing logics yourself, a Rule Engine add-on is available. It seamlessly integrates with other Connhex Edge services, offering an easy way to define powerful, event-driven rules and transformations without custom scripting.

Storing State Between Executions

Probes are inherently stateless objects. However, the scripts they execute can be designed to maintain state by storing data, for example, in temporary files or other persistent storages. This approach enables time-based calculations that rely on historical context.

Example: Rolling Average Temperature

Calculate a 5-minute rolling average of temperature readings:

[[monitoring.metrics]]
type = "custom"
name = "temperature_5min_avg"
unit = "degC"
interval = "30s"
[metrics.probe]
source_type = "command"
source = "/opt/iot-metrics/scripts/temperature_average.sh"

#!/bin/bash
# temperature_average.sh - Calculate 5-minute rolling average

STATE_FILE="/tmp/temperature_readings.txt"
HISTORY_SECONDS=300  # 5 minutes
CURRENT_TIME=$(date +%s)

# Read the current temperature from the sensor
CURRENT_TEMP=$(cat /sys/class/thermal/thermal_zone0/temp | grep -oP '([0-9.]+)')

# Append current reading with timestamp
echo "${CURRENT_TIME} ${CURRENT_TEMP}" >> "${STATE_FILE}"

# Remove readings older than 5 minutes
if [ -f "${STATE_FILE}" ]; then
    CUTOFF_TIME=$((CURRENT_TIME - HISTORY_SECONDS))
    grep -v "^[0-9]*\s" "${STATE_FILE}" > /dev/null 2>&1
    awk -v cutoff="$CUTOFF_TIME" '$1 >= cutoff' "${STATE_FILE}" > "${STATE_FILE}.tmp"
    mv "${STATE_FILE}.tmp" "${STATE_FILE}"
fi

# Calculate average of all readings in the window
if [ -f "${STATE_FILE}" ] && [ -s "${STATE_FILE}" ]; then
    AVERAGE=$(awk '{ sum += $2; count++ } END { if (count > 0) print sum/count; else print 0 }' "${STATE_FILE}")
    echo "${AVERAGE}"
else
    # If no history exists yet, return current reading
    echo "${CURRENT_TEMP}"
fi

Best Practices for Custom Scripts

Output Format

Your script should output to stdout only the numeric or string value representing the measurement result. Avoid printing debug information or status messages (use stderr if needed). Ensure the output matches the expected data type for your metric.

File Locations

Use /tmp for temporary state files that don't need persistence across reboots. Use application-specific directories (e.g., /var/lib/iot-metrics) for important persistent data, or if your device and application allows it, a DB like sqlite. Ensure your scripts have appropriate file permissions and are accessible to the agent.

Performance Considerations

When enabling the monitoring addon, special attention must be given to network data usage and system resource constraints.

Limit the Number of Collected Metrics

Each collected metric generates data that must be transmitted periodically.
Collecting too many metrics, or unnecessary submetrics, increases both:

Bandwidth consumption (larger data payloads per publish cycle)
CPU and memory load on the device (from collecting and formatting data)

Make sure to:

Only enable metrics that are relevant to your monitoring or diagnostic needs.
Avoid collecting all subfields of a metric unless strictly necessary.

Adjust Sampling and Publish Intervals Appropriately

The interval fields determine how frequently each metric is sampled and how often the agent sends data upstream. If not properly configured, these settings can generate excessive load on devices or cause high network bandwidth consumption.

When configuring the monitor:

Set the publish_interval to the coarsest acceptable reporting frequency. For low-bandwidth connections, consider publishing data every few minutes.
Adjust the interval for each metric to reflect how quickly the underlying value changes.

System monitor

Overview​

Features​

How it works​

Configuration​

Top-Level Settings​

Metric Configuration​

Common Fields​

Type-Specific Fields​

Example Configuration​

Built-in Metrics​

Example​

CPU​

Memory​

Disk​

Disk Type-Specific fields​

Examples​

Network​

Network Type-Specific fields​

Examples​

Host​

Process​

Type-Specific settings​

Example​

Sensors​

Type-Specific settings​

Example​

Custom Metrics​

Custom Probe Fields​

Source Types​

Parser​

Examples​

Custom Metric from File​

Custom Metric from HTTP Endpoint​

Custom Metric from Command Output​

Advanced Probes configuration​

Beyond Simple Metrics​

Storing State Between Executions​

Example: Rolling Average Temperature​

Best Practices for Custom Scripts​

Output Format​

File Locations​

Performance Considerations​

Limit the Number of Collected Metrics​

Adjust Sampling and Publish Intervals Appropriately​

Overview

Features

How it works

Configuration

Top-Level Settings

Metric Configuration

Common Fields

Type-Specific Fields

Example Configuration

Built-in Metrics

Example

CPU

Memory

Disk

Disk Type-Specific fields

Examples

Network

Network Type-Specific fields

Examples

Host

Process

Type-Specific settings

Example

Sensors

Type-Specific settings

Example

Custom Metrics

Custom Probe Fields

Source Types

Parser

Examples

Custom Metric from File

Custom Metric from HTTP Endpoint

Custom Metric from Command Output

Advanced Probes configuration

Beyond Simple Metrics

Storing State Between Executions

Example: Rolling Average Temperature

Best Practices for Custom Scripts

Output Format

File Locations

Performance Considerations

Limit the Number of Collected Metrics

Adjust Sampling and Publish Intervals Appropriately