S3

Introduction

  • Simple Storage Service (S3) is an object based file storage (not block storage for OS or applications running on), Single file size can be 0 to 5T. No limit for total usage.

  • S3 is global, but choose region when creating a bucket.

  • Files are stored in Buckets. Bucket name is unique globally.

  • A Bucket created with a domain name has the path-style URL: https://s3-${region}.amazonaws.com/${bucket_name}

  • Working with HTTP protocols: PUT, DELETE. Return status code 200 if the action is successful. Multipart transferring is supported.

Feature

  • Can read immediately after upload, but update or delete would take more time to propagate (eventual consistency)

  • Support Cross-origin resource sharing (CORS) for servers in different domains.

  • Composition:

    • Key (name)

      • S3 Stores data by name with alphabetical order, so filtering with file name might hit the performance. Adding random letters or numbers into name (for folder, file) would help files storing evenly through S3.

    • Value (file)

    • Version id

    • Metadata

    • Sub-resources

      • Access Control Lists (for privilege)

      • Torrent

  • Storage Class:

  • Performance:

    • Baseline Performance (by prefixes in a bucket)

      • scales automatically, latency: 100 ~ 200 ms

      • can achieve 5,500 req/s for GET/HEAD, 3,500 req/s for other REST APIs per prefix in a bucket.

      • No limits to the number of prefixes in a bucket.

    • Multi-part Upload

      • Recommended: file > 100 MB, mandatory: file > 5 GB.

    • Byte-range Fetches (for downloads)

      • Parallelize GETs by requesting specific byte ranges

    • Transfer Acceleration (can be combined with multi-part upload)

      • Upload to CloudFront edge location, then move to S3 target region.

      • Transfer Acceleration over a fully-utilized 1 Gbps line can transfer up to 75 TB in the same time of Snowball turnaround time, if it will take more than a week to transfer over the Internet, or there are recurring transfer jobs and there is more than 25Mbps of available bandwidth, Transfer Acceleration is a good option.

  • Glacier retrieval options:

    • Expedited (1 to 5 minutes)

    • Standard (3 to 5 hours)

    • Bulk (5 to 12 hours)

  • Glacier Deep Archive retrieval options:

    • Standard (12 hours)

    • Bulk (48 hours)

  • S3 Select and Glacier Select

    • Retrieve subsets of objects using SQL by performing server side filtering.

    • Can filter by rows & columns (simple SQL statements)

    • Faster & cheaper

  • Charges:

    • Storage

    • Requests

    • Storage management pricing (file tagging)

    • Data transfer pricing

    • Transfer acceleration

    • Monitoring cost (only for S3 Intelligent-Tiering, $0.0025 for 1000 objects)

  • Cost Saving tips:

    • S3 Select / Glacier Select

    • S3 Lifecycle

    • Compress object to save space

    • S3 Requester Pays:

      • Bucket owner pays for S3 storage

      • Requester pays for the cost of request and data download

      • Use bucket policy, not use IAM Role (Otherwise, it's still you to pay).

  • Lifecycle management

    • Transition from standard to IA (128Kb and 30 days after creation)

    • Archive to Glacier (30 days after IA if relevant)

    • Permanent deletion

  • Versioning

  • S3 Object Lock & Glacier Vault Lock

    • S3 Object Lock

      • Adopt a WORM (Write once, read many) model.

      • Block an object version deletion for a specified amount of time.

    • Glacier Vault Lock

      • Adopt a WORM (Write once, read many) model.

      • Lock the policy for future edits (can no longer be changed).

      • Helpful for compliance and data retention.

  • Encryption:

    • In transit (SSL/TLS)

    • At rest

      • Server side encryption options

        • S3 managed keys: SSE-S3

        • AWS Key Management Service, Managed Keys: SSE-KMS (similar to SSE-S3 but with some additional benefits)

        • Customer provided (managed) keys: SSE-C

      • Client side encryption

  • Control accessing

    • Using bucket policies(can constraint with public IP, Elastic IP but not private IP), bucket ACLs

    • Default the bucket and its content are private

    • Pre-signed URLs

      • Can generate pre-signed urls with SDK / CLI

        • For downloads (can use CLI)

        • For uploads (must use SDK)

      • Valid for a default 3600 seconds, can chage timeout with --expires-in argument

      • Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET / PUT

      • Scenarios

        • Allow only logged-in users to download a file.

        • Allow a user to upload a file temporarily

  • Logging

    • CloudTrail:

      • By default, bucket level access are recorded.

      • Object level logging can be enabled.

  • Replication

    • Cross Region Replication

      • For buckets in different regions

      • Versioning must be enabled in both bucket.

      • Once CRR is on, subsequent updated files will be replicated automatically.

      • When delete is made firstly, duplicate items would be added deletion marker altogether. Then 2nd delete is made in a bucket, only that item would be deleted, other duplicated items stay as with deletion marker. When doing recovery, only item in that bucket is recovered, other buckets must take action individually.

      • To avoid regional failure of S3

        • Enable CRR and have a different bucket name in a backup region.

        • Applications refer to SSM Parameter Store, replace it for DR.

    • Same Region Replication

  • Request through public / private subnets

    • Through public subnets: Internet Gateway is used to reach S3. Must set up bucket Policy with AWS:SourceIP for public IP.

    • Through private subnets: VPC Endpoint Gateway is used to reach S3. Restricting Access to specific VPC Endpoints must set up bucket policy either:

      • AWS:SourceVpce for one or few endpoints

      • AWS:SourceVpc to encompass all possible VPC endpoints.

  • For a VPC to restrict access to specific buckets

    • Setup a endpoint policy that explicitly allows access to the two required buckets.

  • Provide static website hosting

    • Serverless, cheap, auto-scaling, but not support HTTPS.

    • Co-work with Route53 - bucket name should be identical to domain name

    • The index.html is mandatory, but error page is optional.

  • Anti patterns

    • Lots of small files

    • POSIX file system (use EFS instead), file locks.

    • Search features, queries, rapidly changing data.

      • can index objects in DynamoDB (S3 event --> Lambda to insert data)

    • Website with dynamic content

Service for transferring large amount of data with physical storage (bypassing internet)

  • AWS Import/Export

    • To import/export data size is less than or equal to 16 TB to S3 or EBS.

  • Snowball (Import/Export Disk to/from S3)

    • Types:

      • Snowball

        • On-board storage with size of 50 or 80TB

        • Bypasses internet entirely

      • Snowball Edge

        • Durable local storage

        • Local compute with AWS Lambda

        • Local compute instances

        • Use in a cluster of devices

        • Use with AWS Greengrass (IoT)

        • Transfer files through NFS with a GUI

      • Snowmobile

        • Exabyte-scale data (coming with a truck)

Scenarios

  • Syncing data from on-premise

    • Uses S3 CLI sync command (can do several times to shorten the time for final sync, especially useful for migration to AWS.)

  • Check personally identifiable information (PII) with Macie

Last updated