AWS Redshift Spectrum lets you use Redshift without copying the data from S3. Instead, Spectrum runs directly on the data in S3. To do this, you'll need to create 'external' tables in Redshift that refer to S3 objects. These can be queried in exactly the same way as regular Redshift tables. Through this functionality, Spectrum enables you to query newly-exported Cmd data as soon as it's in your S3 bucket. This convenience comes at a performance penalty, which can sometimes be mitigated by importing the smaller, commonly joined-with tables into Redshift. Infrequent queries over large volumes of data in S3 will be less expensive than importing into Redshift. If you need to run a lot of queries on Cmd data archived in the cloud, we recommend evaluating the pricing more closely.
The following example defines an external table containing objects with the key prefix s3://my-bucket/cmd_export/cmd-v1/CMP-XXX/PRJ-YYY/. The example assumes you are using the default 'dev' database.
Note: Replace the iam_role arn, the bucket name, and the S3 object key prefixes.
CREATE EXTERNAL SCHEMA spectrum
FROM DATA CATALOG
DATABASE 'dev'
IAM_ROLE 'arn:aws:iam::9391913838:role/my-redshift-role'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
CREATE EXTERNAL TABLE spectrum.s3_project_YYY_commands (
id VARCHAR(48),
created VARCHAR(48),
exec_user VARCHAR(64),
command VARCHAR(65535),
command_line VARCHAR(65535),
command_type VARCHAR(32),
exec_path VARCHAR(4096),
cwd VARCHAR(4096),
pid INT,
parent_pid INT,
origin VARCHAR(32),
parent_command_base VARCHAR(255),
top_command_base VARCHAR(255),
shell_type VARCHAR(32),
command_number INT,
interactive_shell BOOL,
executing BOOL,
line_number INT,
shell_login INT,
startup_state INT,
shell_version VARCHAR(128),
parent_id VARCHAR(48),
session_id VARCHAR(48),
server_id VARCHAR(48),
project_id VARCHAR(48),
company_id VARCHAR(48))
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-bucket/cmd_export/cmd-v1/CMP-XXX/PRJ-YYY/'
table properties ('numRows'='1720000','compression_type'='gzip');
CREATE EXTERNAL TABLE spectrum.s3_project_YYY_sessions (
id VARCHAR(48),
created VARCHAR(48),
login_user VARCHAR(64),
command VARCHAR(255),
how_connected VARCHAR(32),
interactive bool,
pid INT,
ip VARCHAR(64),
tz_std VARCHAR(128),
tz_dst VARCHAR(128),
utc_offset INT,
has_dst BOOL,
server_name VARCHAR(255),
server_id VARCHAR(48),
project_id VARCHAR(48),
company_id VARCHAR(48),
cmd_agent_version VARCHAR(48)
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-bucket/cmd_export/sess-v1/CMP-XXX/PRJ-YYY/'
table properties ('numRows'='1720000','compression_type'='gzip');
Once the schema and two external tables are defined as above, queries like the following can run:
SELECT s.ip,s.how_connected,c.* FROM spectrum.s3_project_YYY_commands c join spectrum.s3_project_YYY_sessions s on c.session_id=s.id where c.exec_path='/bin/su';
-[ RECORD 1 ]-------+-------------------------------------
ip | 10.14.87.120
how_connected | sshd
id | fdfb748b-2be2-458d-be04-0ebd3e77d7e8
created | 2018-08-22T22:05:16.614363Z
exec_user | root
command | su
command_line |
command_type | EXEC
exec_path | /bin/su
cwd | /home/peterson
pid | 3223
parent_pid | 3222
origin | shell_uec
parent_command_base | sudo
top_command_base | sudo
shell_type |
command_number | 0
interactive_shell | f
executing | f
line_number | 0
shell_login | 0
startup_state | 0
shell_version |
parent_id |
session_id | 745fe88f-da03-44df-ba14-0be538ad5be7
server_id | SVR-MG
project_id | PRJ-YYY
company_id | CMP-XXX