John

Senior Cloud Engineer & Technical Lead

AWS VPC Flow Logs with EKS: Enabling Network Visibility Using Terraform

I was troubleshooting a connectivity issue between pods in an EKS cluster and an external API last week. The pods were timing out, but I had no visibility into whether the traffic was even leaving the VPC. Without flow logs, I was essentially flying blind - making guesses about security group rules and NACLs without any data to back them up. That experience reminded me why VPC Flow Logs should be enabled from day one on any production infrastructure.

The Problem

Network troubleshooting in AWS without flow logs is painful. You’re left asking questions like:

  • Is traffic actually reaching my NAT Gateway?
  • Are my security groups blocking something I don’t expect?
  • Why are connections to this external service timing out?
  • Is there unexpected traffic hitting my VPC endpoints?

Without VPC Flow Logs, you’re stuck making educated guesses and iterating on security group rules hoping something sticks. Even worse, when security asks for an audit of network traffic patterns, you have nothing to show them.

What Are VPC Flow Logs?

VPC Flow Logs capture information about IP traffic going to and from network interfaces in your VPC. Think of them as a network tap that records metadata about every connection attempt - successful or not.

flowchart TB subgraph VPC["VPC"] subgraph PublicSubnet["Public Subnet"] NAT["NAT Gateway"] ALB["Application Load Balancer"] end subgraph PrivateSubnet["Private Subnet"] subgraph EKS["EKS Cluster"] Pod1["Pod 1
10.0.1.15"] Pod2["Pod 2
10.0.1.16"] ENI1["ENI"] ENI2["ENI"] end end VPCEndpoint["VPC Endpoint
(S3, ECR, etc.)"] end subgraph FlowLogs["VPC Flow Logs"] Capture["Capture Layer
(All ENI Traffic)"] CW["CloudWatch Logs"] S3["S3 Bucket"] end Internet["Internet"] Pod1 --- ENI1 Pod2 --- ENI2 ENI1 -.->|"Flow Record"| Capture ENI2 -.->|"Flow Record"| Capture NAT -.->|"Flow Record"| Capture ALB -.->|"Flow Record"| Capture VPCEndpoint -.->|"Flow Record"| Capture Capture --> CW Capture --> S3 NAT --> Internet Internet --> ALB

What Gets Captured

Each flow log record contains:

Field Description
srcaddr Source IP address
dstaddr Destination IP address
srcport Source port
dstport Destination port
protocol IANA protocol number (6 = TCP, 17 = UDP)
packets Number of packets transferred
bytes Number of bytes transferred
action ACCEPT or REJECT
log-status Logging status (OK, NODATA, SKIPDATA)

Here’s what a typical flow log record looks like:

2 123456789012 eni-abc123 10.0.1.15 52.94.76.5 443 49321 6 25 5000 1609459200 1609459260 ACCEPT OK

This tells me: A pod at 10.0.1.15 successfully connected to 52.94.76.5:443 over TCP, sending 25 packets (5000 bytes) over 60 seconds.

The Three Capture Levels

VPC Flow Logs can be attached at three different levels:

flowchart TB subgraph Levels["Flow Log Attachment Levels"] VPCLevel["VPC Level
(Captures ALL traffic)"] SubnetLevel["Subnet Level
(Captures subnet traffic)"] ENILevel["ENI Level
(Captures specific interface traffic)"] end subgraph VPC["VPC: 10.0.0.0/16"] subgraph Subnet1["Subnet A: 10.0.1.0/24"] ENI1["ENI-1"] ENI2["ENI-2"] end subgraph Subnet2["Subnet B: 10.0.2.0/24"] ENI3["ENI-3"] ENI4["ENI-4"] end end VPCLevel -->|"Monitors"| VPC SubnetLevel -->|"Monitors"| Subnet1 ENILevel -->|"Monitors"| ENI3 style VPCLevel fill:#e1f5fe style SubnetLevel fill:#fff3e0 style ENILevel fill:#f3e5f5

For EKS clusters, I recommend VPC-level flow logs to capture all traffic, including pod-to-pod communication, egress through NAT Gateways, and VPC endpoint traffic.

Benefits of VPC Flow Logs

1. Network Troubleshooting

When pods can’t connect to external services, flow logs immediately show whether traffic is being rejected:

# Find rejected traffic from a specific pod
aws logs filter-log-events \
    --log-group-name /aws/vpc/flow-logs \
    --filter-pattern "10.0.1.15 REJECT"

If you see REJECT entries, you know exactly which security group or NACL rule to investigate.

2. Security Monitoring

Flow logs reveal unusual traffic patterns that might indicate compromise:

  • Unexpected outbound connections to unknown IPs
  • Port scanning activity (many connections to different ports)
  • Data exfiltration (large outbound transfers to unusual destinations)
  • Lateral movement attempts between subnets
# Find unusual outbound traffic (not to known AWS services)
aws logs filter-log-events \
    --log-group-name /aws/vpc/flow-logs \
    --filter-pattern "ACCEPT" \
    | grep -v "amazonaws.com"

3. Compliance and Auditing

Many compliance frameworks (SOC 2, PCI-DSS, HIPAA) require network traffic logging. Flow logs provide:

  • Complete record of all network connections
  • Evidence of security group effectiveness
  • Audit trail for forensic investigation
  • Data retention in S3 for long-term storage

4. Cost Optimization

Flow logs help identify wasted network resources:

  • NAT Gateway traffic that could use VPC endpoints
  • Cross-AZ traffic that could be optimized
  • Unused or underutilized network paths

Enabling VPC Flow Logs with the EKS Terraform Module

Now for the practical part. The terraform-aws-modules/eks/aws module doesn’t directly create VPC flow logs, but the companion terraform-aws-modules/vpc/aws module does. Here’s how to configure them together.

Complete Terraform Configuration

# VPC with Flow Logs enabled
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "eks-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-west-2a", "us-west-2b", "us-west-2c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = false
  one_nat_gateway_per_az = true

  # Enable DNS support for VPC endpoints
  enable_dns_hostnames = true
  enable_dns_support   = true

  # Subnet tags required for EKS
  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }

  # VPC Flow Logs Configuration
  enable_flow_log                      = true
  create_flow_log_cloudwatch_log_group = true
  create_flow_log_cloudwatch_iam_role  = true
  flow_log_max_aggregation_interval    = 60

  flow_log_cloudwatch_log_group_name_prefix = "/aws/vpc-flow-log/"
  flow_log_cloudwatch_log_group_name_suffix = "eks-cluster"

  # Retain logs for 30 days (adjust based on compliance requirements)
  flow_log_cloudwatch_log_group_retention_in_days = 30

  # Optional: Enable KMS encryption for logs
  flow_log_cloudwatch_log_group_kms_key_id = aws_kms_key.flow_logs.arn

  tags = {
    Environment = "production"
    Terraform   = "true"
  }
}

# KMS key for encrypting flow logs (optional but recommended)
resource "aws_kms_key" "flow_logs" {
  description             = "KMS key for VPC Flow Logs encryption"
  deletion_window_in_days = 7
  enable_key_rotation     = true

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
        }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow CloudWatch Logs"
        Effect = "Allow"
        Principal = {
          Service = "logs.${data.aws_region.current.name}.amazonaws.com"
        }
        Action = [
          "kms:Encrypt*",
          "kms:Decrypt*",
          "kms:ReEncrypt*",
          "kms:GenerateDataKey*",
          "kms:Describe*"
        ]
        Resource = "*"
        Condition = {
          ArnLike = {
            "kms:EncryptionContext:aws:logs:arn" = "arn:aws:logs:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:*"
          }
        }
      }
    ]
  })
}

resource "aws_kms_alias" "flow_logs" {
  name          = "alias/vpc-flow-logs"
  target_key_id = aws_kms_key.flow_logs.key_id
}

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}

# EKS Cluster using the VPC
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "my-eks-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # Enable cluster endpoint for private access
  cluster_endpoint_public_access  = true
  cluster_endpoint_private_access = true

  eks_managed_node_groups = {
    default = {
      min_size     = 2
      max_size     = 10
      desired_size = 3

      instance_types = ["m5.large"]
      capacity_type  = "ON_DEMAND"
    }
  }

  tags = {
    Environment = "production"
    Terraform   = "true"
  }
}

Key Configuration Flags Explained

Flag Purpose Recommended Value
enable_flow_log Master switch for flow logs true
create_flow_log_cloudwatch_log_group Auto-create CloudWatch log group true
create_flow_log_cloudwatch_iam_role Auto-create IAM role for publishing true
flow_log_max_aggregation_interval Seconds to aggregate before publishing 60 (1 minute)
flow_log_cloudwatch_log_group_retention_in_days Log retention period 30 to 365 based on compliance

Alternative: Publishing to S3

For long-term storage or cost optimization, you can publish to S3 instead of CloudWatch:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  # ... other configuration ...

  enable_flow_log              = true
  flow_log_destination_type    = "s3"
  flow_log_destination_arn     = aws_s3_bucket.flow_logs.arn
  flow_log_file_format         = "parquet"  # Better for Athena queries
  flow_log_max_aggregation_interval = 600   # 10 minutes for cost savings

  # Partition logs by hour for efficient querying
  flow_log_per_hour_partition = true
}

resource "aws_s3_bucket" "flow_logs" {
  bucket = "my-vpc-flow-logs-${data.aws_caller_identity.current.account_id}"
}

resource "aws_s3_bucket_lifecycle_configuration" "flow_logs" {
  bucket = aws_s3_bucket.flow_logs.id

  rule {
    id     = "transition-to-glacier"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}

Querying Flow Logs

Once enabled, you can query flow logs to troubleshoot issues.

CloudWatch Logs Insights

-- Find all rejected traffic in the last hour
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100

-- Find traffic from specific pod CIDR
fields @timestamp, srcAddr, dstAddr, dstPort, action, bytes
| filter srcAddr like /10\.0\.1\./
| stats sum(bytes) as totalBytes by dstAddr, dstPort
| sort totalBytes desc

-- Identify top talkers (most traffic)
fields srcAddr, dstAddr, bytes
| filter action = "ACCEPT"
| stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20

Using Athena for S3 Logs

If you’re publishing to S3, create an Athena table for querying:

CREATE EXTERNAL TABLE vpc_flow_logs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  `end` bigint,
  action string,
  log_status string
)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://my-vpc-flow-logs-123456789012/AWSLogs/123456789012/vpcflowlogs/us-west-2/'
TBLPROPERTIES ("skip.header.line.count"="1");

Architecture Overview

Here’s how everything fits together with EKS:

flowchart TB subgraph AWS["AWS Account"] subgraph VPC["VPC: 10.0.0.0/16"] subgraph PublicSubnets["Public Subnets"] NAT1["NAT Gateway AZ-a"] NAT2["NAT Gateway AZ-b"] ALB["Application Load Balancer"] end subgraph PrivateSubnets["Private Subnets"] subgraph EKS["EKS Cluster"] Node1["Node Group 1"] Node2["Node Group 2"] Pod1["Pod"] Pod2["Pod"] Pod3["Pod"] end end VPCEndpoints["VPC Endpoints
(ECR, S3, STS)"] end subgraph Observability["Observability"] CW["CloudWatch Logs
/aws/vpc-flow-log/eks-cluster"] S3["S3 Bucket
(Long-term storage)"] Athena["Athena
(Ad-hoc queries)"] end end Internet["Internet"] Pod1 & Pod2 & Pod3 --> Node1 & Node2 Node1 & Node2 -->|"Egress"| NAT1 & NAT2 NAT1 & NAT2 --> Internet Internet --> ALB ALB --> Pod1 & Pod2 & Pod3 Node1 & Node2 -->|"AWS API Calls"| VPCEndpoints VPC -.->|"All Traffic
Captured"| CW CW -->|"Export"| S3 S3 --> Athena style CW fill:#ff9800 style S3 fill:#4caf50 style Athena fill:#2196f3

Cost Considerations

VPC Flow Logs aren’t free, so here are strategies to manage costs:

  1. Use S3 instead of CloudWatch for long-term storage - significantly cheaper
  2. Increase aggregation interval to 600 seconds (10 minutes) if you don’t need real-time data
  3. Use Parquet format with S3 - reduces storage by 50-75%
  4. Enable per-hour partitioning for efficient Athena queries
  5. Set appropriate retention - don’t keep logs longer than compliance requires
  6. Consider sampling in high-traffic environments (though this reduces visibility)

Approximate costs (us-west-2):

  • CloudWatch Logs: $0.50 per GB ingested + $0.03 per GB stored
  • S3 Standard: $0.023 per GB stored
  • S3 Glacier: $0.004 per GB stored

Key Learnings

  • VPC Flow Logs capture network metadata at the ENI level - They record source/destination IPs, ports, protocols, and whether traffic was accepted or rejected by security groups and NACLs
  • Enable flow logs at the VPC level for EKS - This captures all traffic including pod-to-pod communication, NAT Gateway egress, and VPC endpoint usage
  • The terraform-aws-modules/vpc/aws module handles everything - Set enable_flow_log = true and it creates the log group, IAM role, and flow log resource automatically
  • Use CloudWatch for real-time troubleshooting, S3 for long-term storage - CloudWatch Logs Insights gives you quick queries, while S3 with Athena is better for historical analysis and compliance
  • Always encrypt flow logs - Use KMS encryption for CloudWatch log groups, especially in regulated environments
  • Parquet format with hourly partitions optimizes S3 costs - The Parquet format reduces storage significantly and partitions make Athena queries faster and cheaper
  • Flow logs are essential for security incident response - When something goes wrong, having a complete record of network traffic is invaluable for forensic investigation
  • Consider cost from day one - A busy VPC can generate gigabytes of logs daily; plan your retention and storage tier strategy upfront

The biggest lesson from my troubleshooting experience: the time to enable flow logs is before you need them. Trying to debug network issues without flow logs is like debugging code without logs - possible, but unnecessarily painful.