Navigate the Cloud Universe
StackLens is your intelligent companion for DevOps, SecOps, ML, and AI Engineering. Search hand-picked resources across the technical ecosystem.
Monitoring Resources
netdata/netdata
The fastest path to AI-powered full stack observability, even for lean teams.
grafana/grafana
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
langfuse/langfuse
🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
teslamate-org/teslamate
A self-hosted data logger for your Tesla 🚘 [main maintainer=@JakobLichterfeld]
ben1234560/k8s_PaaS
如何基于K8s(Kubernetes)部署成PaaS/DevOps(一套完整的软件研发和部署平台)--教程/学习(实战代码/架构设计/大量注释/操作配图),你将习得部署如:K8S(Kubernetes)、Dashboard、Harbor、Jenkins、本地Gitlab、Apollo框架、Promtheus、Grafana、Spinnaker等。
Elastic Stack (ELK) – Search, Observe, Protect
The Elastic Stack — Elasticsearch, Logstash, and Kibana — is the world's most popular log management platform. Collect, parse, and visualize any type of data.
AWS CloudWatch – Official Documentation
Amazon CloudWatch is a monitoring and management service that provides data and actionable insights for AWS resources, applications, and services. Set alarms, log metrics, and more.
Datadog – Cloud Monitoring as a Service
Datadog is a monitoring and security platform for cloud applications. It brings together end-to-end traces, metrics, and logs, making your stack fully observable.
Grafana + Prometheus – Full Stack Monitoring Tutorial
Complete guide to setting up a production-grade monitoring stack with Prometheus for metrics collection and Grafana for visualization and alerting.
Azure Monitor – Full Observability for Azure
Azure Monitor collects, analyzes, and acts on telemetry from your Azure and on-premises environments. Includes Application Insights, Log Analytics, and more.
New Relic – Full-Stack Observability Platform
New Relic provides full-stack observability for your entire software stack. Monitor APM, infrastructure, logs, browser, mobile, and synthetics from one platform.
Loki – Like Prometheus but for Logs
Grafana Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Designed to be cost-effective and easy to operate.
Google Cloud Monitoring (formerly Stackdriver)
Google Cloud's operations suite provides monitoring, logging, and diagnostics for applications running on Google Cloud and beyond.
Zabbix – Enterprise-Class Monitoring
Zabbix is a mature, enterprise-level platform designed to monitor networks, servers, cloud, applications, and services. Fully open source with no limits on hosts.
Jaeger – End-to-End Distributed Tracing
Jaeger is an open-source, end-to-end distributed tracing system, used for monitoring microservices-based distributed systems. CNCF graduated project.
Nagios – IT Infrastructure Monitoring
Nagios is one of the most widely used open-source monitoring solutions. Monitor hosts, services, and network devices with powerful alerting and notification capabilities.
VictoriaMetrics – Fast & Scalable Monitoring
VictoriaMetrics is a fast, cost-saving, and scalable monitoring solution and time series database. Drop-in replacement for Prometheus with better performance.
The Five Agent Failure Modes Nobody Catches in Staging
Every agent failure I have ever debugged in production had the same property: it passed staging. Not...
Why Building Custom Monitoring Dashboards for ClickHouse® Becomes Challenging at Scale
Monitoring is one of the most critical aspects of operating any production database environment. As...
Agentic AI FinOps: Why Claude Agent Loops Cost 30 a Single Inference
A single Claude API call is predictable. An agent with tool access is not. Real numbers, real failure modes, and patterns you can copy into your own setup tod
oomkill is the next lie why memory limits are hiding your latency spikes
OOMKill is a reporting artifact, not a root cause. By the time the kernel logs the kill event and your alerting pipeline fires, the service already degraded
Catching the failure is the easy part
The last post I wrote ended on a loose thread I have not been able to stop pulling at. Almost every...
I monitored 11 public MCP servers. Latency ranged 215 (97ms to 21 seconds).
TL;DR: I built a tiny tool that speaks the MCP protocol and ran it against 11 public Model Context...
Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System
Here is a debugging session I have watched play out at four different companies now. An agent does...
The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model
Your agent demo took an afternoon. The reason it isn't in production nine months later has nothing to...
How to read a PromQL query
PromQL looks dense the first time you meet it. A line like histogram_quantile(0.99, sum by (le,...
Monitoring Video Aggregator Health with a Go Prometheus Exporter
A Go Prometheus exporter that catches what an uptime ping misses on a video aggregator: stale per-re
Building a Video Stream Health Probe with Prometheus Exporters in Go
How we built a Go Prometheus exporter that probes HLS/DASH manifests across eight regions, catching
How to Add a Linux Target Node to Prometheus (Step-by-Step)
Hey everyone! 👋 Monitoring your infrastructure is super important for maintaining system health. If...
Spring Boot + Prometheus: A Practical Introduction to Application Metrics
A developer spends a lot of time building features, but very little time asking an important...
From Load Test to Production Monitor k6 Studio, Grafana Cloud, and Synthetic Monitoring
Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic...
Detecting API anomalies behind a 200 OK — with statistics, not AI
Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've...
Full Observability on k3s: kube-prometheus-stack + Loki + Grafana OIDC
Deploy a production-grade monitoring stack on bare-metal k3s: Prometheus, Loki with Garage S3 storage, Promtail on edge nodes via Ansible, SNMP monitoring for MikroTik, and Grafana SSO via Authelia OIDC — all GitOps-managed.
I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard
I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard Last month I wrote what I...
LogQL vs PromQL: the same query in both languages
If you’ve written Prometheus queries, Grafana Loki’s LogQL looks reassuringly familiar — rate(...),...