The goal of this post is to present a common case study for building a research environment in Google Cloud Platform (GCP).
Building an environment in the cloud involves several topics we need to take under consideration (such as how do I access resources in the cloud, where and how do I store data in the cloud, how do I protect the infrastructure, etc.)
Let’s consider the following architecture:
- Researchers will connect to the cloud environment remotely over the internet and connect to a Linux machine with data analytics tools
- Original data sets will be stored using file storage
- Output data will be processed in a MySQL database
- Due to data sensitivity, data must be protected at all times
In the following sections we will break-down the research team requirements to best practices using built-in GCP services:
Infrastructure
- For the base OS image, we will use the most up-to-date Deep Learning VM Image, which uses Debian Linux and includes the latest security patches
- After deploying the VM, we will install the latest build of our analytics tools and development interpreters (such as Python)
Network connectivity
- Secure access to the cloud environment remotely will be done by deploying OpenVPN from the Google marketplace and using OpenVPN clients
- All resources will be located in a single Google Cloud VPC, but the Linux VM and the MySQL database will be located in separate subnets
- The Linux VM will be located in a DMZ subnet, and access to this subnet will be protected using GCP firewall rules, for VPN authenticated clients on port 22 TCP
- The database will be located in DB subnet, and access to this subnet will be protected using GCP firewall rules, with access to Cloud SQL port from the DMZ subnet only
- Further explanation about GCP firewall rules can be found here: https://cloud.google.com/vpc/docs/using-firewalls#creating_firewall_rules
Database
- The MySQL database will be deployed as a managed service using Google Cloud SQL for MySQL
- The traffic between the Linux machine and the Cloud SQL database will be encrypted using TLS, as explained here: https://cloud.google.com/sql/docs/mysql/configure-ssl-instance
- Data inside the Cloud SQL database will be encrypted at rest, as explained here: https://cloud.google.com/sql/faq#encryption-manage-rest and https://cloud.google.com/security/encryption-at-rest/default-encryption/
Storage
- Data will be stored in Google Cloud Storage, as explained here: https://cloud.google.com/storage/docs/how-to
- Access to the Google Cloud Storage will be restricted by roles from Google IAM, as explained here: https://cloud.google.com/storage/docs/access-control/iam-reference
- Data inside the Google Cloud Storage will be encrypted at rest, as explained here: https://cloud.google.com/storage/docs/encryption/customer-managed-keys
Authentication
- Access using SSH to login to the Linux VM will be performed using Google IAM role and SSH key attached to the Google G Suite account, as explained here: https://cloud.google.com/compute/docs/instances/managing-instance-access
Auditing
- Access to all resources will be audited for further review using Stackdriver cloud audit logs, as explained here: https://cloud.google.com/logging/docs/audit/
- Alerts for suspicious activity will raise alarm using Google Cloud Security Command Center, as explained on https://cloud.google.com/security-command-center/docs/
Summary
In the above post, I’ve explained how to use GCP services in-order to build and maintain a secured research environment, while keeping sensitive data secure and following all research requirements specified at the beginning of the post.
About the author
Eyal Estrin, cloud architect.