Mission Control & Always-On Operations

Your agent is installed, connected to channels, and integrated with memory, voice, and email. There is one problem left — the moment you close your terminal or your machine restarts, everything stops. A truly useful agent needs to run around the clock, survive crashes, and let you monitor its activity from anywhere.

Keeping Agents Running 24/7

The simplest way to keep an agent running is to never close the terminal. But that is not a real solution. Process managers handle this properly by automatically restarting your agent if it crashes, starting it on system boot, and managing logs.

Using systemd (Linux VPS)

systemd is the standard process manager on most Linux distributions. It is already installed on your VPS.

Create a service file for your agent:

sudo nano /etc/systemd/system/crewai-agent.service

[Unit]
Description=CrewAI Agent
After=network.target

[Service]
Type=simple
User=agent
WorkingDirectory=/home/agent/my-agent
ExecStart=/home/agent/.venv/bin/python /home/agent/my-agent/main.py
Restart=always
RestartSec=10
Environment=NODE_ENV=production
EnvironmentFile=/home/agent/.env

[Install]
WantedBy=multi-user.target

# Enable and start the service
sudo systemctl enable crewai-agent
sudo systemctl start crewai-agent

# Check status
sudo systemctl status crewai-agent

# View logs
sudo journalctl -u crewai-agent -f

The Restart=always directive ensures your agent restarts automatically after any crash. RestartSec=10 adds a 10-second delay between restarts to avoid rapid restart loops if there is a persistent error.

Using pm2 (Cross-Platform)

If you prefer a Node.js-native solution that works on both local machines and servers, pm2 is a popular process manager:

# Install pm2 globally
npm install -g pm2

# Start your agent with pm2
pm2 start python -- main.py

# Save the process list (survives reboots)
pm2 save

# Set up startup script (auto-start on boot)
pm2 startup

# Useful pm2 commands
pm2 status          # View all managed processes
pm2 logs            # Stream live logs
pm2 restart all     # Restart all processes
pm2 monit           # Real-time monitoring dashboard

pm2 also provides a built-in monitoring dashboard (pm2 monit) that shows CPU usage, memory consumption, and restart counts — useful for quick health checks.

Remote Access and Monitoring

When your agent runs on a VPS, you need to monitor it without physically accessing the server.

SSH Access

# Connect to your VPS
ssh agent@your-server-ip

# Check agent status
sudo systemctl status crewai-agent

# View recent logs
sudo journalctl -u crewai-agent --since "1 hour ago"

Setting Up a Mission Control Dashboard

A mission control dashboard gives you a visual overview of your agent's activity. You can build one using a simple web interface that displays:

Agent status — running, stopped, or error state
Recent actions — what has the agent done in the last hour?
Message counts — how many messages processed across channels
Error log — any failures or exceptions
Resource usage — CPU, memory, and API call counts

# Dashboard configuration
dashboard:
  enabled: true
  port: 3001
  auth:
    username: ${DASHBOARD_USER}
    password: ${DASHBOARD_PASS}
  metrics:
    - agent_status
    - messages_processed
    - api_calls_count
    - error_count
    - uptime

Protect your dashboard with authentication — it exposes operational details about your agent that should not be publicly accessible.

Mobile Notifications

Monitoring from a laptop is good. Monitoring from your wrist is better. When your agent encounters errors, completes critical tasks, or needs human input, you want to know immediately.

Apple Watch Notifications via TGWatch

TGWatch is a Telegram client for Apple Watch. Since your agent already communicates through Telegram, this creates a natural notification pipeline:

Your agent sends a Telegram message about a completed task or an error
The Telegram notification appears on your phone
TGWatch mirrors it to your Apple Watch
You can read the notification and even reply with quick responses

This means your agent can tap you on the wrist when it needs attention — no need to check a dashboard or open your laptop.

For critical alerts, configure your agent to send messages to a dedicated Telegram alert channel:

# Alert configuration
alerts:
  channel: telegram
  chat_id: ${ALERT_CHAT_ID}
  triggers:
    - event: error
      message: "Agent encountered an error: {error_details}"
    - event: task_complete
      message: "Task completed: {task_summary}"
    - event: approval_needed
      message: "Human approval needed: {action_description}"

Log Monitoring and Health Checks

Logs are your agent's flight recorder. When something goes wrong, logs tell you what happened and why.

Structured Logging

Configure your agent to write structured logs that are easy to search and filter:

# Logging configuration
logging:
  level: info
  format: json
  output:
    - file: /var/log/crewai/agent.log
    - stdout
  rotation:
    max_size: 50MB
    max_files: 10

Health Check Endpoint

Set up a simple health check that external monitoring tools can ping:

# Health check configuration
health_check:
  enabled: true
  port: 3002
  path: /health
  checks:
    - name: model_connection
      timeout: 5s
    - name: telegram_connection
      timeout: 5s
    - name: memory_access
      timeout: 3s

# Test health check
curl http://localhost:3002/health

A health check endpoint returns a simple status response. External monitoring services can ping this endpoint regularly and alert you if it stops responding.

Practical Reliability Tips

Start with simplicity: Use pm2 or systemd, not a complex container orchestration system. You are running a single agent process, not a distributed system.

Monitor API costs: Set up daily cost alerts with your model provider. An agent stuck in a retry loop can burn through API credits quickly.

Implement graceful shutdown: When your agent receives a stop signal, it should finish current tasks before shutting down, not cut off mid-action.

Test failure recovery: Deliberately crash your agent and verify it restarts correctly. Check that memory is preserved, channel connections are re-established, and in-progress tasks are handled.

Keep logs rotated: Unrotated logs will eventually fill your disk. Configure log rotation from day one.

Key takeaway: Always-on operation requires process management, remote monitoring, and mobile notifications. Use systemd or pm2 to keep your agent running, set up a dashboard for visibility, and route critical alerts to your phone or watch. Reliability is not about preventing all failures — it is about detecting and recovering from them automatically.

Next module: Building your agent's skills and tool integrations for real-world task execution. :::