Access-Google-BigQuery-Metadata-From-AWS-Lambda

Requirements:

  • Try to get the Metadata from Google BigQuery by using AWS Lambda function
  • Build connection between AWS Lambda and Google BigQuery
  • Write Python code to get metadata from Google BigQuery
  • format the metadata schema

Solution

Get google-cloud-bigquery.zip Method 1

  • Instead of getting the GCP Layer via EC2 and Session, You can also download the zip file locally by running the following code
1
2
pip install google-cloud-bigquery -t ./python/
zip -r google-cloud-bigquery.zip python
  • Add the zip file to Lambda function as a layer from your local device

Get google-cloud-bigquery.zip Method 2

Create EC2 in AWS

  • create an EC2
  • Amazon Linux 2
  • free tier is enough- instance type: t2.micro
  • the IAM role grants only “Amazon EC2Role for SSM“ so that Session Manager of Systems Manager can connect to this EC2
  • leave the other options as default

Get AWS Access Key ID and Secret Access Key

  • Go to service “Identity and Access Management”
  • click “Users” and find your user name
  • click “Security credentials”
  • create access key and please remember that you can only view the credential once.

Start a Session in AWS

  • go to service “AWS System Manager”
  • click the tab “Session Manager”
  • click “Start a Session”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ec2-Become a user
sudo su - ec2-user
#pip installation
sudo yum install python3 -y
curl -O https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py

export PATH=$PATH:/usr/local/bin
#sdk installation&Zip
pip install google-cloud-bigquery -t ./python/
zip -r google-cloud-bigquery.zip python
#Added because protobuf is required
pip install protobuf --upgrade -t ./python/
zip -r google-cloud-bigquery.zip ./python/google/protobuf

#aws cli settings
aws configure
#Set the following:
AWS Access Key ID [None]: <my-access-key>
AWS Secret Access Key [None]: <my-secret-key>
Default region name [None]: ap-southeast-2
Default output format [None]: json

#Save to s3
aws s3 mb s3://<my-bucket>
aws s3 cp google-cloud-bigquery.zip s3://<my-bucket>

Check S3 Bucket in AWS

  • go to the s3 bucket you created just now by using Session Manager
  • You should see google-cloud-bigquery.zip should be uploaded

Other Steps

create a lambda function and register the created library in the lambda layer in AWS

  • go to Lambda service
  • Click “Function” and “create function”
  • changes settings:
  • Timeout: 30 seconds
  • create Layer
  • add Layer: gcp-bigquery
  • And you should see the data has been created successfully

Get GCP certificate file in Google Cloud Platform

  • Log into Google Cloud Platform
  • go to “IAM&Admin” service and go to “Service Accounts”
  • Click “Create Service Account”: please make sure that you choose “BigQuery Admin” as the Role
  • Once you created the account, please click the account and “ADD KEY”
  • You are able to download the JSON file in the following format
1
2
3
4
5
6
7
8
9
10
11
12
{
"type": "xxx",
"project_id": "xxx",
"private_key_id": "xxx",
"private_key": "xxx",
"client_email": "xxx",
"client_id": "xxx",
"auth_uri": "xxx",
"token_uri": "xxx",
"auth_provider_x509_cert_url": "xxx",
"client_x509_cert_url": "xxx",
}

Add functions and credential file in AWS Lambda Function

  • copy the JSON file into “bq.manager.json”
  • copy the following codes into “lambda_function.py”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import json
from google.cloud import bigquery

def lambda_handler(event, context):
client = bigquery.Client()
sql = """
SELECT *
FROM `<project name>.<dataset name>.<table name>`
"""

# Run a Standard SQL query using the environment's default project
results = client.query(sql).result()
metadata_objects = []
for row in results:
id_dic = {"id": list(row)[0]}
other_dic = {"metadata": list(row)[1]}
metadata_objects.append({**id_dic, **other_dic})

return metadata_objects
  • Add “GOOGLE_APPLICATION_CREDENTIALS: bq-manager.json” as Environment Variables

Test Result

You should be able to get the JSON result like the format in the BigQuery.

By default, you should get a SQL table. You can modify the result baesd on your code.

Updated Solutions

AWS SSM - Parameter Store

  • Store the credential paths in the SSM Parameter Store like a Python Dictionary

AWS Lambda - Environment Variable

  • Set all the paths as an environment variables

Updated Lambda code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from google.cloud import bigquery
from google.oauth2 import service_account
import boto3
import json
import os

ssm = boto3.client('ssm')

def lambda_handler(event, context):

#retrieve parameters from ssm/appconfig and prepare for connection with BigQuery
client_email_var = ssm.get_parameter(Name=os.getenv('SSM_BQ_CLIENT_EMAIL'))['Parameter']['Value']
private_key_var = ssm.get_parameter(Name=os.getenv('SSM_BQ_PRIVATE_KEY'))['Parameter']['Value'].replace('\\n', '\n')
token_uri_var = os.getenv('SSM_BQ_TOKEN_URI')
credentials = service_account.Credentials.from_service_account_info(
{
"client_email": client_email_var,
"private_key": private_key_var,
"token_uri": token_uri_var
}
)

project_id_var = os.getenv('BQ_PROJECT_ID')
dataset_name_var = os.getenv('BQ_DATASET_NAME')
table_name_var = os.getenv('BQ_TABLE_NAME')
if bigquery.Client(credentials = credentials,project=project_id_var):
client = bigquery.Client(credentials = credentials,project=project_id_var)
else:
raise GCPConnectionError

sql = f"""
SELECT *
FROM {project_id_var}.{dataset_name_var}.{table_name_var}
"""

# Run a Standard SQL query using the environment's default project
#retrieve metadata from Bigquery and transfer it to JSON
results = client.query(sql).result()
metadata_objects = []
for row in results:
id_dic = {"id": list(row)[0]}
other_dic = {"metadata": list(row)[1]}
metadata_objects.append({**id_dic, **other_dic})
return metadata_objects