System monitor
System Monitor add-on is available out of the box from Connhex Edge version v2.3.0.
Overview
This add-on monitors resource usage on the edge device. It collects both system and custom metrics at configurable intervals. The collected metrics can be sent directly to the cloud or routed to other services running on the device for further processing.
Features
- Lightweight – Optimized for devices with limited resources.
- Built-in System Metrics – Collects standard metrics such as CPU, memory, disk, network, host, and process information.
- Custom Probes – Supports device-specific metrics retrieved from files, HTTP endpoints, or system commands.
- Configurable Sampling Intervals – Allows independent data collection intervals for each metric type.
- Efficient Data Format – Uses compact data encoding to minimize network bandwidth usage.
- Integrated – Works seamlessly with other Connhex Edge agent services for advanced data processing and analysis.
How it works
The addon supports two types of metrics:
| Metric Type | Description |
|---|---|
| Built-in | Predefined metrics such as CPU, Memory (see the complete list below). |
| Custom | User-defined metrics obtained from an external source (file, HTTP endpoint, or command). |
Each metric can be configured with its own sampling interval, providing fine-grained control over the data collection frequency. All collected data can be temporarily stored and published at a different publish interval.
Configuration
All monitoring settings are defined within the monitoring object.
This section controls how and when system metrics are collected and published to Connhex Cloud.
Top-Level Settings
| Field | Type | Description | Constraints |
|---|---|---|---|
publish_interval | String (duration) | Defines how often all collected metrics are published to the cloud. Set to 0 to disable the monitoring service. | Minimum value: 1s |
Metric Configuration
Each metric entry under [[monitoring.metrics]] will contains both common and type-specific fields.
Common Fields
| Field | Type | Description | Constraints |
|---|---|---|---|
type | String | Metric type (e.g., cpu, memory, disk). | Required |
interval | String (duration) | Sampling interval for collecting the metric. A value of 0 means the metric is collected only once, at service startup (useful for static metrics such as kernel version or hostname). | Minimum value (if not 0): 1s |
Type-Specific Fields
See the corresponding section for details on type-specific fields.
Example Configuration
[monitoring]
# Interval at which metrics are published.
# If set to `0`, the monitoring service is disabled.
# Value cannot be less than 1s.
publish_interval = "30s"
[[monitoring.metrics]]
# Metric definition.
# An empty list disables metric collection.
type = "cpu"
interval = "10s"
# Additional metric-specific fields. See examples below for details.
Built-in Metrics
A set of predefined metrics is provided out of the box to cover the most common use cases. These metrics are organized into the following categories: CPU, Memory, Disk, Network, Host, Process, Sensors.
For each category, specific metrics can be configured for collection with the following field:
| Field | Type | Description | Constraints |
|---|---|---|---|
collect | Array of strings | List of metric to collect for the selected Metric Type (built-in only) | Required for built-in metrics |
Example
[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent", "load1", "load5"]
interval = "5s"
The same metric type can be used multiple times to collect data at different sampling intervals.
[[monitoring.metrics]]
type = "cpu"
collect = ["usage_percent"]
interval = "30s"
[[monitoring.metrics]]
type = "cpu"
collect = ["load1", "load5", "load15"]
interval = "30m"
CPU
| Metric | Description |
|---|---|
user | Time spent in user mode. |
system | Time spent in system mode. |
idle | Idle CPU time. |
nice | Time spent running niced processes. |
iowait | Time waiting for I/O completion. |
irq | Time servicing interrupts. |
softirq | Time servicing soft interrupts. |
steal | Time stolen by other virtual machines. |
guest | Time running guest OS. |
guest_nice | Time running guest niced OS. |
usage_percent | Total CPU usage as a percentage. |
count_logical | Logical CPU count. |
count_physical | Physical CPU count. |
load1, load5, load15 | System load averages over 1, 5, and 15 minutes. |
Memory
| Metric | Description |
|---|---|
total, available, used, used_percent, free | Basic memory usage. |
active, inactive, buffers, cached, shared, slab, pagetables | Detailed memory states. |
swap_total, swap_used, swap_free | Swap memory details. |
hugepages_total, hugepages_free, hugepage_size | HugePages memory statistics. |
Disk
| Metric | Description |
|---|---|
read_count, write_count, read_bytes, write_bytes, read_time, write_time, io_time, iops_in_progres | Disk I/O statistics. |
used, free, used_percent | Disk space usage. |
Disk Type-Specific fields
| Field | Type | Description | Default |
|---|---|---|---|
device | String | Name or full path of the device (for I/O statistics) | sda |
path | String | Filesystem path to analyze for disk space usage | / |
Examples
[[monitoring.metrics]]
type = "disk"
collect = ["read_count", "write_count"]
device = "sda" # /dev/sda is also accepted
interval = "30s"
[[monitoring.metrics]]
type = "disk"
collect = ["used_percent"]
path = "/home/user"
interval = "6h"
Network
| Metric | Description |
|---|---|
bytes_sent, bytes_recv | Total bytes sent and received. |
packets_sent, packets_recv | Total packets sent and received. |
err_in, err_out, drop_in, drop_out | Error and packet drop statistics. |
Network Type-Specific fields
| Field | Type | Description | Default |
|---|---|---|---|
nic | String | Name of the network interface | first available |
Examples
[[monitoring.metrics]]
type = "network"
collect = ["byte_sent", "byte_recv"]
nic = "eth0"
interval = "30s"
Host
| Metric | Description |
|---|---|
uptime, boot_time | System uptime and boot time. |
procs | Number of processes. |
hostname, os, platform, platform_family, platform_version | Host and OS details. |
kernel_version, kernel_arch | Kernel version and architecture. |
virtualization_system, virtualization_role | Virtualization details if applicable. |
Process
| Metric | Description |
|---|---|
cpu_percent, memory_percent | CPU and memory usage percentages. |
memory_info_rss, memory_info_vms | Resident and virtual memory usage. |
num_threads, num_fds | Process resource statistics. |
io_read_count, io_write_count, io_read_bytes, io_write_bytes | I/O operation counts and bytes. |
create_time, status, cmdline, ppid, exe, cwd | Metadata and command-line details. |
Type-Specific settings
| Field | Type | Description |
|---|---|---|
process_names | Array of strings | List of process names to collect metrics from. |
Example
[[monitoring.metrics]]
type = "process"
collect = ["cpu_percent", "memory_percent"]
interval = "15s"
process_names = ["httpd", "sshd"]
Sensors
| Metric | Description |
|---|---|
temperature, temperature_high, temperature_critical | Temperature information collected from hardware sensors. |
The sensor metrics are obtained from standard Linux interfaces such as hwmon or legacy thermal_zone files under /sys/class/thermal/.
Your device must expose these interfaces for temperature data to be collected successfully.
If sensor information is not available through hwmon or thermal_zone, the metrics will not be collected.
In such cases, you can implement a custom metric to provide sensor data from alternative sources.
Type-Specific settings
| Field | Type | Description |
|---|---|---|
sensor_names | Array of strings | List of sensor names to collect metrics from (as reported by the system). |
Example
[[monitoring.metrics]]
type = "sensor"
collect = ["temperature"]
interval = "1m"
sensor_names = ["cpu_thermal", "soc_thermal"]
To identify available sensor names on your device, check the contents of /sys/class/hwmon/ or /sys/class/thermal/.
For example, you can list hardware monitor names with:
cat /sys/class/hwmon/hwmon*/name
cat /sys/class/thermal/thermal_zone*/type
Use the reported names (e.g. cpu_thermal, soc_thermal, temp1) in the sensor_names field of your configuration.
Custom Metrics
When the built-in metrics do not provide all the required information, custom metrics can be defined to access device-specific data.
Custom metrics support retrieving arbitrary values from files, HTTP endpoints, or commands. This functionality is useful for integrating data from external systems, sensors, or custom scripts.
A custom metric is defined by setting its type to custom and specifying both a name and a probe configuration.
An optional unit field can also be specified. Any string value is accepted, but it is recommended to use a supported unit from the SenML specification, as these are recognized and properly handled by the UI.
Custom Probe Fields
The Custom Probe Fields define how a custom metric retrieves and processes data from its source. These settings control the probe’s behavior, including the type of data source, how the data is accessed, and how the result is parsed.
| Field | Description |
|---|---|
source_type | One of file, http, or command. |
source | The path, URL, or command to execute. |
parser | Optional value parser for extracting data (regex or JSON). |
timeout | Timeout for HTTP or command probes. |
Source Types
Source Types indicate the different input methods that a custom probe can use to collect metric data. Depending on the monitoring requirements, a probe can read values from a local file, query an HTTP(S) endpoint, or execute a system command.
| Source Type | Description | Example source |
|---|---|---|
file | Reads metric value from a local file. | /tmp/temperature.txt |
http | Fetches metric value via HTTP(S) request. | http://localhost:8080/status.json |
command | Executes a shell command and reads its output. | cat /sys/class/thermal/thermal_zone0/temp |
Parser
The parser can be used to extracts the relevant value from the source output using either a regular expression or a JSON path, depending on the data format. If not configured the raw output will be used as value.
| Parser Type | Description | Example |
|---|---|---|
regex | Uses a regular expression with one capture group. | expression = "temp: ([0-9.]+)" |
json | Extracts value via JSON path. | expression = "$.temperature" |
Examples
Custom Metric from File
[[monitoring.metrics]]
type = "custom"
name = "ambient_temperature"
unit = "degC"
interval = "15s"
[metrics.probe]
source_type = "file"
source = "/var/data/temperature"
parser = { type = "regex", expression = "([0-9.]+)" }
Custom Metric from HTTP Endpoint
[[monitoring.metrics]]
type = "custom"
name = "service_latency"
unit = "ms"
interval = "10s"
[metrics.probe]
source_type = "http"
source = "http://localhost:9000/metrics.json"
timeout = "3s"
parser = { type = "json", expression = "$.latency_ms" }
Custom Metric from Command Output
[[monitoring.metrics]]
type = "custom"
name = "gpu_usage"
unit = "%"
interval = "5s"
[metrics.probe]
source_type = "command"
source = "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits"
timeout = "2s"
parser = { type = "regex", expression = "([0-9]+)" }
Advanced Probes configuration
While the built-in parsers are suitable for extracting instant values from command outputs, it is possible to leverage custom scripts for cases where more sophisticated data processing and analysis are required.
Beyond Simple Metrics
A custom probe can be configured to execute any command or script, giving you complete control over how data is collected, processed, and transformed into meaningful metrics. This provides maximum flexibility for advanced use cases, such as:
- Calculate rolling averages, minimums, or maximums over time
- Perform statistical analysis on historical data
- Aggregate and combine information from multiple sources
- Apply custom filtering, normalization, or smoothing algorithms
- Generate derived metrics based on multiple readings
If you prefer not to implement these data-processing logics yourself, a Rule Engine add-on is available. It seamlessly integrates with other Connhex Edge services, offering an easy way to define powerful, event-driven rules and transformations without custom scripting.
Storing State Between Executions
Probes are inherently stateless objects. However, the scripts they execute can be designed to maintain state by storing data, for example, in temporary files or other persistent storages. This approach enables time-based calculations that rely on historical context.
Example: Rolling Average Temperature
Calculate a 5-minute rolling average of temperature readings:
[[monitoring.metrics]]
type = "custom"
name = "temperature_5min_avg"
unit = "degC"
interval = "30s"
[metrics.probe]
source_type = "command"
source = "/opt/iot-metrics/scripts/temperature_average.sh"
#!/bin/bash
# temperature_average.sh - Calculate 5-minute rolling average
STATE_FILE="/tmp/temperature_readings.txt"
HISTORY_SECONDS=300 # 5 minutes
CURRENT_TIME=$(date +%s)
# Read the current temperature from the sensor
CURRENT_TEMP=$(cat /sys/class/thermal/thermal_zone0/temp | grep -oP '([0-9.]+)')
# Append current reading with timestamp
echo "${CURRENT_TIME} ${CURRENT_TEMP}" >> "${STATE_FILE}"
# Remove readings older than 5 minutes
if [ -f "${STATE_FILE}" ]; then
CUTOFF_TIME=$((CURRENT_TIME - HISTORY_SECONDS))
grep -v "^[0-9]*\s" "${STATE_FILE}" > /dev/null 2>&1
awk -v cutoff="$CUTOFF_TIME" '$1 >= cutoff' "${STATE_FILE}" > "${STATE_FILE}.tmp"
mv "${STATE_FILE}.tmp" "${STATE_FILE}"
fi
# Calculate average of all readings in the window
if [ -f "${STATE_FILE}" ] && [ -s "${STATE_FILE}" ]; then
AVERAGE=$(awk '{ sum += $2; count++ } END { if (count > 0) print sum/count; else print 0 }' "${STATE_FILE}")
echo "${AVERAGE}"
else
# If no history exists yet, return current reading
echo "${CURRENT_TEMP}"
fi
Best Practices for Custom Scripts
Output Format
Your script should output to stdout only the numeric or string value representing the measurement result. Avoid printing debug information or status messages (use stderr if needed). Ensure the output matches the expected data type for your metric.
File Locations
Use /tmp for temporary state files that don't need persistence across reboots.
Use application-specific directories (e.g., /var/lib/iot-metrics) for important persistent data, or if your device and application allows it, a DB like sqlite.
Ensure your scripts have appropriate file permissions and are accessible to the agent.
Performance Considerations
When enabling the monitoring addon, special attention must be given to network data usage and system resource constraints.
Limit the Number of Collected Metrics
Each collected metric generates data that must be transmitted periodically.
Collecting too many metrics, or unnecessary submetrics, increases both:
- Bandwidth consumption (larger data payloads per publish cycle)
- CPU and memory load on the device (from collecting and formatting data)
Make sure to:
- Only enable metrics that are relevant to your monitoring or diagnostic needs.
- Avoid collecting all subfields of a metric unless strictly necessary.
Adjust Sampling and Publish Intervals Appropriately
The interval fields determine how frequently each metric is sampled and how often the agent sends data upstream. If not properly configured, these settings can generate excessive load on devices or cause high network bandwidth consumption.
When configuring the monitor:
- Set the
publish_intervalto the coarsest acceptable reporting frequency. For low-bandwidth connections, consider publishing data every few minutes. - Adjust the
intervalfor each metric to reflect how quickly the underlying value changes.