Scalability and Reliability
How This Architecture Scales
The VPC structure you have built is inherently scale-ready. Subnets do not need to be modified to add capacity — you add more instances to existing subnets. Route tables handle traffic for any number of ENIs in the subnet. Security group rules apply to any new resource that joins the group.
The two-AZ design provides fault tolerance for subnet-level resources. If AZ-1 becomes unavailable, resources in AZ-2 continue operating. Each AZ has its own NAT Gateway, so outbound routing in AZ-2 is unaffected by an AZ-1 failure. This is what “AZ-independent failure domains” means in practice — the network layer does not create cross-AZ dependencies.
Where This Architecture Has Limits
Subnet size is fixed. If you launch 251 instances in a /24 subnet, the next instance will fail to obtain an IP. Plan subnet sizes based on the maximum expected instance count in that tier, not the current count. For Auto Scaling groups, choose subnet CIDRs that are larger than your maximum scaling target.
NAT Gateway throughput scales automatically up to 100 Gbps, so it is rarely the bottleneck. However, a single NAT Gateway per AZ is still a single point of failure within that AZ. AWS manages the underlying infrastructure, so failures are rare, but they happen. For critical workloads, some teams use multiple NAT Gateways per AZ with weighted routing — this is beyond the scope of this lab but worth knowing.
Security groups have limits. Each security group supports a maximum of 60 inbound and 60 outbound rules by default (can be increased). Each ENI can be associated with up to 5 security groups by default. At scale, poorly managed security group proliferation becomes an operational burden. Establish a naming convention and ownership model early.
Monitoring the Network Layer
For production, enable the following:
VPC Flow Logs: Capture metadata (not payload) for all traffic hitting ENIs in your VPC. Publish to CloudWatch Logs or S3. Flow logs are the primary debugging tool for network connectivity issues — they tell you whether a connection was accepted or rejected and by which interface. Enable them at the VPC level to capture everything.
To enable: VPC > Your VPCs > select lab-vpc > Flow Logs > Create flow log. Set filter to “All”, destination to CloudWatch Logs, and create or select an IAM role with logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents permissions.
CloudWatch Metrics for NAT Gateway: AWS publishes BytesOutToDestination, ConnectionAttemptCount, ErrorPortAllocation, and others. Set an alarm on ErrorPortAllocation — this fires when the NAT Gateway runs out of ports, which indicates either an application with too many concurrent connections or a misconfigured connection pool.
In this section, I confirmed:
0 of 4 completed