Anthropic's Claude 3.5 Sonnet sets new industry benchmarks·Grafana – The Open Observability Platform·trimstray/the-book-of-secret-knowledge·Google SRE Book·Hussein Nasser's Backend Engineering·Full Stack Open - CI/CD·OpenAI launches SearchGPT, a prototype of new AI search features·Prometheus – Monitoring System & Time Series DB·n8n-io/n8n·AI Isn't Something to Trust — It's Something to Design (Series Final)·AWS Workshops·The Case for Platform Engineering in 2024·Elastic Stack (ELK) – Search, Observe, Protect·langchain-ai/langchain·I Switched to the Agent Toolkit for AWS. Here's Why.·Kubernetes 1.30: Uwubernetes·OpenTelemetry – Vendor-Neutral Observability Framework·netdata/netdata·AI Agent Memory: Conversation vs Context·Terraform 1.8 Adds Provider-Defined Functions·AWS CloudWatch – Official Documentation·grafana/grafana·WordPress.org now distrusts my commits by default. As a plugin author, I think that’s right.·Microsoft Azure announces Cobalt 100 CPUs for general availability·Datadog – Cloud Monitoring as a Service·traefik/traefik·I Got the proxy.ts Matcher Wrong for Three Projects Before I Understood Why·Anthropic's Claude 3.5 Sonnet sets new industry benchmarks·Grafana – The Open Observability Platform·trimstray/the-book-of-secret-knowledge·Google SRE Book·Hussein Nasser's Backend Engineering·Full Stack Open - CI/CD·OpenAI launches SearchGPT, a prototype of new AI search features·Prometheus – Monitoring System & Time Series DB·n8n-io/n8n·AI Isn't Something to Trust — It's Something to Design (Series Final)·AWS Workshops·The Case for Platform Engineering in 2024·Elastic Stack (ELK) – Search, Observe, Protect·langchain-ai/langchain·I Switched to the Agent Toolkit for AWS. Here's Why.·Kubernetes 1.30: Uwubernetes·OpenTelemetry – Vendor-Neutral Observability Framework·netdata/netdata·AI Agent Memory: Conversation vs Context·Terraform 1.8 Adds Provider-Defined Functions·AWS CloudWatch – Official Documentation·grafana/grafana·WordPress.org now distrusts my commits by default. As a plugin author, I think that’s right.·Microsoft Azure announces Cobalt 100 CPUs for general availability·Datadog – Cloud Monitoring as a Service·traefik/traefik·I Got the proxy.ts Matcher Wrong for Three Projects Before I Understood Why·
10,000+ hand-picked resources

Navigate the Cloud Universe

StackLens is your intelligent companion for DevOps, SecOps, ML, and AI Engineering. Search hand-picked resources across the technical ecosystem.

Monitoring Resources

41 results
repository

netdata/netdata

The fastest path to AI-powered full stack observability, even for lean teams.

Monitoring
repository

grafana/grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

Monitoring
repository

langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Monitoring
repository

prometheus/node_exporter

Exporter for machine metrics

Monitoring
repository

teslamate-org/teslamate

A self-hosted data logger for your Tesla 🚘 [main maintainer=@JakobLichterfeld]

Monitoring
repository

samber/awesome-prometheus-alerts

🚨 Collection of Prometheus alerting rules

Monitoring
repository

prometheus/blackbox_exporter

Blackbox prober exporter

Monitoring
repository

ben1234560/k8s_PaaS

如何基于K8s(Kubernetes)部署成PaaS/DevOps(一套完整的软件研发和部署平台)--教程/学习(实战代码/架构设计/大量注释/操作配图),你将习得部署如:K8S(Kubernetes)、Dashboard、Harbor、Jenkins、本地Gitlab、Apollo框架、Promtheus、Grafana、Spinnaker等。

Monitoring
documentation

Elastic Stack (ELK) – Search, Observe, Protect

The Elastic Stack — Elasticsearch, Logstash, and Kibana — is the world's most popular log management platform. Collect, parse, and visualize any type of data.

Monitoring
documentation

AWS CloudWatch – Official Documentation

Amazon CloudWatch is a monitoring and management service that provides data and actionable insights for AWS resources, applications, and services. Set alarms, log metrics, and more.

Monitoring
documentation

Datadog – Cloud Monitoring as a Service

Datadog is a monitoring and security platform for cloud applications. It brings together end-to-end traces, metrics, and logs, making your stack fully observable.

Monitoring
documentation

Grafana + Prometheus – Full Stack Monitoring Tutorial

Complete guide to setting up a production-grade monitoring stack with Prometheus for metrics collection and Grafana for visualization and alerting.

Monitoring
documentation

Azure Monitor – Full Observability for Azure

Azure Monitor collects, analyzes, and acts on telemetry from your Azure and on-premises environments. Includes Application Insights, Log Analytics, and more.

Monitoring
documentation

New Relic – Full-Stack Observability Platform

New Relic provides full-stack observability for your entire software stack. Monitor APM, infrastructure, logs, browser, mobile, and synthetics from one platform.

Monitoring
documentation

Loki – Like Prometheus but for Logs

Grafana Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Designed to be cost-effective and easy to operate.

Monitoring
documentation

Google Cloud Monitoring (formerly Stackdriver)

Google Cloud's operations suite provides monitoring, logging, and diagnostics for applications running on Google Cloud and beyond.

Monitoring
documentation

Zabbix – Enterprise-Class Monitoring

Zabbix is a mature, enterprise-level platform designed to monitor networks, servers, cloud, applications, and services. Fully open source with no limits on hosts.

Monitoring
documentation

Jaeger – End-to-End Distributed Tracing

Jaeger is an open-source, end-to-end distributed tracing system, used for monitoring microservices-based distributed systems. CNCF graduated project.

Monitoring
documentation

Nagios – IT Infrastructure Monitoring

Nagios is one of the most widely used open-source monitoring solutions. Monitor hosts, services, and network devices with powerful alerting and notification capabilities.

Monitoring
documentation

VictoriaMetrics – Fast & Scalable Monitoring

VictoriaMetrics is a fast, cost-saving, and scalable monitoring solution and time series database. Drop-in replacement for Prometheus with better performance.

Monitoring
blog

The Five Agent Failure Modes Nobody Catches in Staging

Every agent failure I have ever debugged in production had the same property: it passed staging. Not...

Monitoring
blog

Why Building Custom Monitoring Dashboards for ClickHouse® Becomes Challenging at Scale

Monitoring is one of the most critical aspects of operating any production database environment. As...

Monitoring
blog

Agentic AI FinOps: Why Claude Agent Loops Cost 30 a Single Inference

A single Claude API call is predictable. An agent with tool access is not. Real numbers, real failure modes, and patterns you can copy into your own setup tod

Monitoring
blog

oomkill is the next lie why memory limits are hiding your latency spikes

OOMKill is a reporting artifact, not a root cause. By the time the kernel logs the kill event and your alerting pipeline fires, the service already degraded

Monitoring
blog

Catching the failure is the easy part

The last post I wrote ended on a loose thread I have not been able to stop pulling at. Almost every...

Monitoring
blog

I monitored 11 public MCP servers. Latency ranged 215 (97ms to 21 seconds).

TL;DR: I built a tiny tool that speaks the MCP protocol and ran it against 11 public Model Context...

Monitoring
blog

Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System

Here is a debugging session I have watched play out at four different companies now. An agent does...

Monitoring
blog

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

Your agent demo took an afternoon. The reason it isn't in production nine months later has nothing to...

Monitoring
blog

How to read a PromQL query

PromQL looks dense the first time you meet it. A line like histogram_quantile(0.99, sum by (le,...

Monitoring
blog

Monitoring Video Aggregator Health with a Go Prometheus Exporter

A Go Prometheus exporter that catches what an uptime ping misses on a video aggregator: stale per-re

Monitoring
blog

Building a Video Stream Health Probe with Prometheus Exporters in Go

How we built a Go Prometheus exporter that probes HLS/DASH manifests across eight regions, catching

Monitoring
blog

How to Add a Linux Target Node to Prometheus (Step-by-Step)

Hey everyone! 👋 Monitoring your infrastructure is super important for maintaining system health. If...

Monitoring
blog

Spring Boot + Prometheus: A Practical Introduction to Application Metrics

A developer spends a lot of time building features, but very little time asking an important...

Monitoring
blog

From Load Test to Production Monitor k6 Studio, Grafana Cloud, and Synthetic Monitoring

Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic...

Monitoring
blog

Detecting API anomalies behind a 200 OK — with statistics, not AI

Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've...

Monitoring
blog

Full Observability on k3s: kube-prometheus-stack + Loki + Grafana OIDC

Deploy a production-grade monitoring stack on bare-metal k3s: Prometheus, Loki with Garage S3 storage, Promtail on edge nodes via Ansible, SNMP monitoring for MikroTik, and Grafana SSO via Authelia OIDC — all GitOps-managed.

Monitoring
blog

I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard

I almost burned ₹4,000 on Claude API overnight — so I built llm-cost-guard Last month I wrote what I...

Monitoring
blog

LogQL vs PromQL: the same query in both languages

If you’ve written Prometheus queries, Grafana Loki’s LogQL looks reassuringly familiar — rate(...),...

Monitoring