10,000+ hand-picked resources

Navigate the Cloud Universe

StackLens is your intelligent companion for DevOps, SecOps, ML, and AI Engineering. Search hand-picked resources across the technical ecosystem.

All Courses Videos Docs Repos

All Resources DevOps Cloud Computing Platform Engineering Kubernetes AI Engineering Cybersecurity SRE GitOps Linux Networking CI/CD Monitoring

Monitoring Resources

41 results

repository

netdata/netdata

The fastest path to AI-powered full stack observability, even for lean teams.

Monitoring

repository

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

Monitoring

repository

langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Monitoring

repository

prometheus/node_exporter

Exporter for machine metrics

Monitoring

repository

teslamate-org/teslamate

A self-hosted data logger for your Tesla 🚘 [main maintainer=@JakobLichterfeld]

Monitoring

repository

samber/awesome-prometheus-alerts

🚨 Collection of Prometheus alerting rules

Monitoring

repository

prometheus/blackbox_exporter

Blackbox prober exporter

Monitoring

repository

ben1234560/k8s_PaaS

如何基于K8s(Kubernetes)部署成PaaS/DevOps(一套完整的软件研发和部署平台)--教程/学习(实战代码/架构设计/大量注释/操作配图)，你将习得部署如：K8S(Kubernetes)、Dashboard、Harbor、Jenkins、本地Gitlab、Apollo框架、Promtheus、Grafana、Spinnaker等。

Monitoring

documentation

Elastic Stack (ELK) – Search, Observe, Protect

The Elastic Stack — Elasticsearch, Logstash, and Kibana — is the world's most popular log management platform. Collect, parse, and visualize any type of data.

Monitoring

documentation

AWS CloudWatch – Official Documentation

Amazon CloudWatch is a monitoring and management service that provides data and actionable insights for AWS resources, applications, and services. Set alarms, log metrics, and more.

Monitoring

documentation

Datadog – Cloud Monitoring as a Service

Datadog is a monitoring and security platform for cloud applications. It brings together end-to-end traces, metrics, and logs, making your stack fully observable.

Monitoring

documentation

Grafana + Prometheus – Full Stack Monitoring Tutorial

Complete guide to setting up a production-grade monitoring stack with Prometheus for metrics collection and Grafana for visualization and alerting.

Monitoring

documentation

Azure Monitor – Full Observability for Azure

Azure Monitor collects, analyzes, and acts on telemetry from your Azure and on-premises environments. Includes Application Insights, Log Analytics, and more.

Monitoring

documentation

New Relic – Full-Stack Observability Platform

New Relic provides full-stack observability for your entire software stack. Monitor APM, infrastructure, logs, browser, mobile, and synthetics from one platform.

Monitoring

documentation

Loki – Like Prometheus but for Logs

Grafana Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Designed to be cost-effective and easy to operate.

Monitoring

documentation

Google Cloud Monitoring (formerly Stackdriver)

Google Cloud's operations suite provides monitoring, logging, and diagnostics for applications running on Google Cloud and beyond.

Monitoring

documentation

Zabbix – Enterprise-Class Monitoring

Zabbix is a mature, enterprise-level platform designed to monitor networks, servers, cloud, applications, and services. Fully open source with no limits on hosts.

Monitoring

documentation

Jaeger – End-to-End Distributed Tracing

Jaeger is an open-source, end-to-end distributed tracing system, used for monitoring microservices-based distributed systems. CNCF graduated project.

Monitoring

documentation

Nagios – IT Infrastructure Monitoring

Nagios is one of the most widely used open-source monitoring solutions. Monitor hosts, services, and network devices with powerful alerting and notification capabilities.

Monitoring

documentation

VictoriaMetrics – Fast & Scalable Monitoring

VictoriaMetrics is a fast, cost-saving, and scalable monitoring solution and time series database. Drop-in replacement for Prometheus with better performance.

Monitoring

blog

The Five Agent Failure Modes Nobody Catches in Staging

Every agent failure I have ever debugged in production had the same property: it passed staging. Not...

Monitoring

blog

Why Building Custom Monitoring Dashboards for ClickHouse® Becomes Challenging at Scale

Monitoring is one of the most critical aspects of operating any production database environment. As...

Monitoring

blog

Agentic AI FinOps: Why Claude Agent Loops Cost 30 a Single Inference

A single Claude API call is predictable. An agent with tool access is not. Real numbers, real failure modes, and patterns you can copy into your own setup tod

Monitoring

blog

oomkill is the next lie why memory limits are hiding your latency spikes

OOMKill is a reporting artifact, not a root cause. By the time the kernel logs the kill event and your alerting pipeline fires, the service already degraded

Monitoring

blog

Catching the failure is the easy part

The last post I wrote ended on a loose thread I have not been able to stop pulling at. Almost every...

Monitoring

blog

I monitored 11 public MCP servers. Latency ranged 215 (97ms to 21 seconds).

TL;DR: I built a tiny tool that speaks the MCP protocol and ran it against 11 public Model Context...

Monitoring

blog

Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System

Here is a debugging session I have watched play out at four different companies now. An agent does...

Monitoring

blog

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

Your agent demo took an afternoon. The reason it isn't in production nine months later has nothing to...

Monitoring

blog

How to read a PromQL query

PromQL looks dense the first time you meet it. A line like histogram_quantile(0.99, sum by (le,...

Monitoring

blog

Monitoring Video Aggregator Health with a Go Prometheus Exporter

A Go Prometheus exporter that catches what an uptime ping misses on a video aggregator: stale per-re

Monitoring

blog

Building a Video Stream Health Probe with Prometheus Exporters in Go

How we built a Go Prometheus exporter that probes HLS/DASH manifests across eight regions, catching

Monitoring

blog

How to Add a Linux Target Node to Prometheus (Step-by-Step)

Hey everyone! 👋 Monitoring your infrastructure is super important for maintaining system health. If...

Monitoring

blog

Spring Boot + Prometheus: A Practical Introduction to Application Metrics

A developer spends a lot of time building features, but very little time asking an important...

Monitoring

blog

From Load Test to Production Monitor k6 Studio, Grafana Cloud, and Synthetic Monitoring

Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic...

Monitoring

blog

Detecting API anomalies behind a 200 OK — with statistics, not AI

Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've...

Monitoring

blog

Full Observability on k3s: kube-prometheus-stack + Loki + Grafana OIDC

Deploy a production-grade monitoring stack on bare-metal k3s: Prometheus, Loki with Garage S3 storage, Promtail on edge nodes via Ansible, SNMP monitoring for MikroTik, and Grafana SSO via Authelia OIDC — all GitOps-managed.

Monitoring

blog

I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard

I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard Last month I wrote what I...

Monitoring

blog

LogQL vs PromQL: the same query in both languages

If you’ve written Prometheus queries, Grafana Loki’s LogQL looks reassuringly familiar — rate(...),...

Monitoring