About¶

This is the notebook I created to solve the first mission in the data engineering path in dataquest.io. I am also experimenting with the Pelican-Jupyter Notebook extension, that allows to load jupyter notebooks directly in the blog.

Postgres Mission¶

The problem goes something like this:

Recently, the International Hurricane Watchgroup (IHW) has been asked to update their analysis tools. Because of the increase in public awareness of hurricanes, they are required to be more diligient with the analysis of historical hurricane data they share across the organization. They have asked you, someone with experience in databases, to help work with their team to productionize their services.

Accepting the job, their team tells you that they have been having trouble sharing data across the teams and keeping it consistent. From what they've told you, it seems that their method of sharing the data with their data anaylsts has been to save a CSV file on their local servers and have every data analyst pull the data down. Then, each analyst uses a local SQLite engine to store the CSV, run their queries, and send their results around.

From what they have told you, you might be thinking that this is an inefficient way of sharing data. To understand what you will be working on, they have sent you a CSV file. Their CSV file contains the following fields:
fid - ID for the row
year - Recorded year
month - Recorded month
day - Recorded date
ad_time - Recorded time in UTC
btid - Hurricane ID
name - Name of the hurricane
lat - Latitude of the recorded location
long - Longitude of the recorded location
wind_kts - Wind speed in knots per second
pressure - Atmospheric pressure of the hurricane
cat - Hurricane category
basin - The basin the hurricane is located
shape_leng - Hurricane shape length
In this Guided Project, you will be using the local installed version of Postgres you installed from the previous project. This is much different than previous Guided Projects as you will be using your own notebook and your own Python environment instead of Dataquest's. Your job is to show that you can create a database that will accomplish the following requirements:

Database for the IHW to store their tables.

Table that contains the fields detailed in the CSV file

User that can update, read, and insert into a table of the data.

Insert the data into the table.

Let's do this.

Create the postgres instance¶

We'll be using a docker container with a postgres persistent database. The pgadmin instance is not needed, but we can use it to have a GUI available by default in 0.0.0.0:8080. Note this is a toy configuration, and it should not be used to store sensitive information.

We won't be dealing with what this file means. Here we use docker to have a disposable database quickly. You could have a local postgres instance and that would work as well.

You can get more information about postgres in docker in the docker store

In [21]:

!cat docker-compose.yml

# Use postgres/example user/password credentials
version: '3.1'

services:
  db:
    image: postgres:10
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres 
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - 54320:5432
    networks:
      db_nw:
        aliases:
          - postgres
  pgadmin:
    image: dpage/pgadmin4:latest
    environment:
      POSTGRES_PASSWORD: postgres 
      PGADMIN_DEFAULT_EMAIL: user@domain.com 
      PGADMIN_DEFAULT_PASSWORD: admin
    ports:
      - 8080:80
    networks:
      - db_nw

volumes:
    pgdata: {}

networks:
  db_nw:
    driver: bridge

In [22]:

!docker-compose -p dq_de up -d

dq_de_db_1 is up-to-date
dq_de_pgadmin_1 is up-to-date

In [23]:

# check if postgres docker instance is alive
!docker ps

CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS                           NAMES
58b58e1f97f4        postgres:10                             "docker-entrypoint.s…"   10 days ago         Up 3 hours          0.0.0.0:54320->5432/tcp         dq_de_db_1
c5b7c0ec8430        dpage/pgadmin4:latest                   "/entrypoint.sh"         10 days ago         Up 3 hours          443/tcp, 0.0.0.0:8080->80/tcp   dq_de_pgadmin_1
9f18df162b11        rediscommander/redis-commander:latest   "/usr/bin/dumb-init …"   2 weeks ago         Up 3 hours          0.0.0.0:8081->8081/tcp          redis-commander
e38a793b3b60        postgres:10                             "docker-entrypoint.s…"   2 weeks ago         Up 3 hours          0.0.0.0:51085->5432/tcp         demo_db_1

In [29]:

# retrieve the ip address of the container, seen from the host
!docker inspect dq_de_db_1 | grep "IPAddress"

            "SecondaryIPAddresses": null,
            "IPAddress": "",
                    "IPAddress": "192.168.32.3",

Note that we could also refer to our database by the port mapping done in the docker-compose.yml file. This means that we could reach our database in 0.0.0.0:54320.

Create users and roles¶

The mission instructions state that

With a table set up, it's now time to create a user on the Postgres database that can insert, update, and read the data but not delete. This is to make sure that someone who might get a hold of this user does not issue a destructive command. Essentially, this is like creating a "data production" user whose job it is is to always write new and existing data to the table.

Futhermore, even though it wasn't according to the spec, we know that the IHW team's analysts just run read queries on the data. Also, since the analysts only know SQLite queries, they may not be well-versed in a production database. As such, it might be risky handing out a general production user for them to query their data.

From what you have learned about security and restricting well meaning users, it might be a good idea to restrict those analysts from some commands. Those commands can be anything from adding new data to the table or changing the values. You should decide what commands should be given to the analyst user.

In short, what I get from this is

Connected as a postgres user, create a user data_prod_user that can insert, update, and read the data but not delete.
This is like a "data production" user whose job is to always write new and existing data to the table.

Also,

Create a role GROUP for the IHW analyst team that can only read data from tables
Create a user in this group, called analyst1. Picking names is hard

Note that users and groups are all part of the same entity in postgres, the role.

In [1]:

import psycopg2

conn = psycopg2.connect(
    dbname="postgres", user="postgres", password="postgres", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

DATA_PROD_USER_PWD = "changeme"
DATA_VIEW_USER_PWD = "changeme"

# create user and detach password from command
# also add an IF NOT EXIST clause as a hack from
# https://stackoverflow.com/questions/8092086/create-postgresql-role-user-if-it-doesnt-exist
cur.execute(f"""
CREATE USER data_prod_user WITH PASSWORD '{DATA_PROD_USER_PWD}';
CREATE GROUP readonly WITH NOLOGIN;
CREATE USER analyst1 WITH PASSWORD '{DATA_VIEW_USER_PWD}' IN GROUP readonly;
""")

# create db
print("CREATE DATABASE ihw OWNER postgres")
cur.execute("CREATE DATABASE ihw OWNER postgres")

conn.commit()
conn.close()

/home/diego/.pyenv/versions/3.6.6/envs/dq-de/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)

CREATE DATABASE ihw OWNER postgres

If you want to start over, just DROP everything.

In [26]:

# revoke all permissions by default. 
# Use this to start all over.
import psycopg2

conn = psycopg2.connect(
    dbname="postgres", user="postgres", password="postgres", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

# drop users
cur.execute("""
DROP USER data_prod_user;
DROP USER analyst1;
DROP GROUP readonly;
""")

# drop db
print("DROP DATABASE IF EXISTS ihw")
cur.execute("DROP DATABASE IF EXISTS ihw")

conn.commit()
conn.close()

Preparing for data insertion¶

Now with our new users, we can start creating our tables. A table is assigned by default to the user that created it, and it will live by default in the database that the connection was assigned to.

The steps involved in inserting the data are

Inspect the data: This process is specific to the problem, and it may vary each time. I will be using pandas
Infer the data types from the previous analysis that will be used in out SQL table: This refers to the questions of the type "would an integer suffice?", "how long should these strings be", "Can this column be normalized?". This will probably require shaping the data analyzed previously.
Create the table: This is the translation of the previous step into a proper SQL table.
Populate the table: Fill the new table with data. I will be using postgres' COPY feature by means of psycopg.copy_expert()

Inspection of the data¶

Before even knowing how our table will should be created, we should inspect our data. We will be using pandas for this purpose.

In [1]:

# load data
import pandas as pd
df = pd.read_csv("https://dq-content.s3.amazonaws.com/251/storm_data.csv")
df.head()

Out[1]:

	FID	YEAR	MONTH	DAY	AD_TIME	BTID	NAME	LAT	LONG	WIND_KTS	CAT	BASIN	Shape_Leng
0	2001	1957	8	8	1800Z	63	NOTNAMED	22.5	-140.0	50	TS	Eastern Pacific	1.140175
1	2002	1961	10	3	1200Z	116	PAULINE	22.1	-140.2	45	TS	Eastern Pacific	1.166190
2	2003	1962	8	29	0600Z	124	C	18.0	-140.0	45	TS	Eastern Pacific	2.102380
3	2004	1967	7	14	0600Z	168	DENISE	16.6	-139.5	45	TS	Eastern Pacific	2.121320
4	2005	1972	8	16	1200Z	251	DIANA	18.5	-139.8	70	H1	Eastern Pacific	1.702939

In [2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59228 entries, 0 to 59227
Data columns (total 14 columns):
FID           59228 non-null int64
YEAR          59228 non-null int64
MONTH         59228 non-null int64
DAY           59228 non-null int64
AD_TIME       59228 non-null object
BTID          59228 non-null int64
NAME          59228 non-null object
LAT           59228 non-null float64
LONG          59228 non-null float64
WIND_KTS      59228 non-null int64
PRESSURE      59228 non-null int64
CAT           59228 non-null object
BASIN         59228 non-null object
Shape_Leng    59228 non-null float64
dtypes: float64(3), int64(7), object(4)
memory usage: 6.3+ MB

In [3]:

df.describe()

Out[3]:

	FID	YEAR	MONTH	DAY	BTID	LAT	LONG	WIND_KTS	PRESSURE	Shape_Leng
count	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000	59228.000000
mean	29614.500000	1957.194874	8.540521	15.867326	648.398899	23.526400	-83.196863	54.726802	372.336800	1.201987
std	17097.795209	41.665792	1.364174	8.793432	372.376803	9.464955	37.282152	25.133577	480.562974	0.834497
min	1.000000	1851.000000	1.000000	1.000000	1.000000	4.200000	-180.000000	10.000000	0.000000	0.000000
25%	14807.750000	1928.000000	8.000000	8.000000	344.000000	16.100000	-108.500000	35.000000	0.000000	0.707107
50%	29614.500000	1970.000000	9.000000	16.000000	606.000000	21.200000	-81.200000	50.000000	0.000000	1.029563
75%	44421.250000	1991.000000	9.000000	23.000000	920.000000	29.600000	-62.200000	70.000000	990.000000	1.431782
max	59228.000000	2008.000000	12.000000	31.000000	1410.000000	69.000000	180.000000	165.000000	1024.000000	11.180340

Fortunately, we have some guidelines from dataquest of how the table should look like:

fid is an INT and should be a PRIMARY KEY
YEAR, MONTH, DAY and AD_TIME should be merged into one single TIMESTAMP, using TIME ZONE.
btid looks like an INT
We have no way of knowing how long a name can be, thus we will be using a VARCHAR
lat is of type DECIMAL(X,Y). We need to figure out X and Y
long is of type DECIMAL(X,Y), We also need to figure out X and Y
wind_kts looks like an INT
pressure looks like an INT
cat looks like a categorical value, kinda like a VARCHAR(X). We need to figure out X
basin is a VARCHAR(X). Same as cat
shape_length looks like a DECIMAL(X, Y)

The next step is figuring out some variables, like

The values of X and Y in VARCHAR and DECIMAL
The maximum values of those columns that look like integers. With this we could use even smaller datatypes like SMALLINT and others.

We can infer most of these values through inspection of the min and max values in df.describe(). We could then produce a table like this:

CREATE TABLE hurricanes (
    fid INT PRIMARY KEY,
    timestamp TIMESTAMP WITH TIME ZONE,
    btid SMALLINT,
    name VARCHAR,
    lat DECIMAL(8,6),
    long DECIMAL(9,6),
    wind_kts SMALLINT,
    pressure SMALLINT,
    cat VARCHAR(X),
    basin VARCHAR(X),
    shape_length DECIMAL(8, 6)
    )

We can infer the values of X in VARCHAR(X) by doing something like

In [6]:

df['BASIN'].str.len().max()

Out[6]:

In [7]:

df['CAT'].str.len().max()

Out[7]:

This means that these columns are maximum 15 and 2 characters long. Out table will end looking like

CREATE TABLE hurricanes (
    fid INT PRIMARY KEY,
    timestamp TIMESTAMP WITH TIME ZONE,
    btid SMALLINT,
    name VARCHAR,
    lat DECIMAL(8,6),
    long DECIMAL(9,6),
    wind_kts SMALLINT,
    pressure SMALLINT,
    cat VARCHAR(2),
    basin VARCHAR(15),
    shape_length DECIMAL(8, 6)
    )

Shaping data¶

From here, we need to

Clean and check for possible missing values
Merge year, month, day and ad_time into one column named timestamp

Possible optimizations are

Using label encoding in categorical columns
Use better datatypes for specific purposes, like georeferenced values

Clean missing values¶

In [8]:

# total na numbers
df.isna().sum()

Out[8]:

FID           0
YEAR          0
MONTH         0
DAY           0
AD_TIME       0
BTID          0
NAME          0
LAT           0
LONG          0
WIND_KTS      0
PRESSURE      0
CAT           0
BASIN         0
Shape_Leng    0
dtype: int64

In [9]:

df.isnull().sum()

Out[9]:

FID           0
YEAR          0
MONTH         0
DAY           0
AD_TIME       0
BTID          0
NAME          0
LAT           0
LONG          0
WIND_KTS      0
PRESSURE      0
CAT           0
BASIN         0
Shape_Leng    0
dtype: int64

Okay, not so much to do here 😌

Merge date columns into a single one¶

We'll be using pd.to_datetime from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

We'll need to convert AD_TIME from the current format to a valid datetime

In [36]:

df['AD_TIME'].head()

Out[36]:

0    1800Z
1    1200Z
2    0600Z
3    0600Z
4    1200Z
Name: AD_TIME, dtype: object

We join everything together in one column, as a dumb string. This is to make the conversion easier afterwards:

In [37]:

df['TIMESTAMP'] = df['YEAR'].astype(str) + \
                  df['MONTH'].astype(str).str.zfill(2) + \
                  df['DAY'].astype(str).str.zfill(2) + \
                  df['AD_TIME']

In [38]:

df['TIMESTAMP'].head()

Out[38]:

0    195708081800Z
1    196110031200Z
2    196208290600Z
3    196707140600Z
4    197208161200Z
Name: TIMESTAMP, dtype: object

We convert the dumb string column to a valid timestamp object, and we pass a format and we tell pandas that the time is in UTC:

In [39]:

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%MZ', utc=True)

In [40]:

df['TIMESTAMP'].head()

Out[40]:

0   1957-08-08 18:00:00+00:00
1   1961-10-03 12:00:00+00:00
2   1962-08-29 06:00:00+00:00
3   1967-07-14 06:00:00+00:00
4   1972-08-16 12:00:00+00:00
Name: TIMESTAMP, dtype: datetime64[ns, UTC]

We drop the other columns associated with time:

In [41]:

df = df.drop(['YEAR','MONTH','DAY','AD_TIME'], axis=1)

In [42]:

df.head()

Out[42]:

	FID	BTID	NAME	LAT	LONG	WIND_KTS	CAT	BASIN	Shape_Leng	TIMESTAMP
0	2001	63	NOTNAMED	22.5	-140.0	50	TS	Eastern Pacific	1.140175	1957-08-08 18:00:00+00:00
1	2002	116	PAULINE	22.1	-140.2	45	TS	Eastern Pacific	1.166190	1961-10-03 12:00:00+00:00
2	2003	124	C	18.0	-140.0	45	TS	Eastern Pacific	2.102380	1962-08-29 06:00:00+00:00
3	2004	168	DENISE	16.6	-139.5	45	TS	Eastern Pacific	2.121320	1967-07-14 06:00:00+00:00
4	2005	251	DIANA	18.5	-139.8	70	H1	Eastern Pacific	1.702939	1972-08-16 12:00:00+00:00

Further optimizations¶

These are valid for those columns that are clearly categorical, CAT and BASIN. We could encode this columns using label encoding which would allow us to store integers instead of long strings.

We could create another table that stores the meanings of the codes, this is more or less called a one to one relationship and it is discussed in here

The pros and cons of this approach are

Pros

It's more efficient to store and query integer values
We could create auxiliar tables that map the labels to their meanings, thus allowing for more atomic queries
This would allow for good normalizations practices that will help in scaling our database (and keeping it sane to maintain)

Cons

Integers lose any semantic meaning without a conversion table
Adding semantic meaning from an auxiliar table requires a JOIN operation, which is costly
It requires a bit more of effort
It could happen that the column is not categorical at all (i.e. more different values for BASIN show up in the future)

We will not use this approach since it beats the purpose of the problem, and for this case is overkill since each category does not have or need a full model by itself, since it would consist of only one attribute. We'll keep it simple.

This Stackoverflow question deals with normalization on one-to-one relationships:

If it fits within the rules of normalization, then 1:1 relationships can be normalized (by definition!) - In other words, there is nothing about 1:1 relationships that make it impossible for them to obey the normal forms.

To answer your question about the practicality of 1:1 relationships, there are times when this is a perfectly useful construct, such as when you have subtypes with distinct predicates (columns).

The reasons you would use 1:1 relationships depend on your point of view. DBAs tend to think of everything as being a performance decision. Data modelers and programmers tend to think of these decisions as being design or model oriented. In fact, there is a lot of overlap between these points of view. It depends on what your perspectives and priorities are. Here are some examples of motivations for 1:1 relationships:

You have some subset of columns that are very wide and you want to segregate them physically in your storage for performance reasons.

You have some subset of columns that are not read or updated frequently and you want to keep them apart from the frequently used columns for performance reasons.

You have some columns that are optional in general but they are mandatory when you know that the record is of a certain type.

You have some columns that logically belong together for a subtype and you want to model them to fit well with your code's object model.

You have some columns that can only apply to some subtype(s) of an entity super-type, and you want your schema to enforce the absence of this data for other subtypes.

You have some columns that belong to an entity but you need to protect these particular columns using more restrictive access rules (e.g. salary on an employee table).

So you can see, sometimes the driver is performance, sometimes it is model purity, or just a desire to take full advantage of declarative schema rules.

We could even push this further and say that the BASIN column is more of a boolean value since it has two values max. This allows us to rewrite the column in something like IS_BASIN_EASTERN_PACIFIC. We could say then that for those values where this is False then it means the Basin it's North Atlantic.

But this is never a good approach since it uses this information implicitly (thus hidden, and we don't want hidden stuff!), plus this will break the second a new category shows up.

In [43]:

df['BASIN'].astype('category').cat.codes.unique()

Out[43]:

array([0, 1])

In [44]:

df['CAT'].astype('category').cat.codes.unique()

Out[44]:

array([10,  1,  9,  3,  2,  6,  4,  5,  0, 11,  8,  7])

In [46]:

df.columns = df.columns.str.upper()

In [47]:

df.head()

Out[47]:

	FID	BTID	NAME	LAT	LONG	WIND_KTS	CAT	BASIN	SHAPE_LENG	TIMESTAMP
0	2001	63	NOTNAMED	22.5	-140.0	50	TS	Eastern Pacific	1.140175	1957-08-08 18:00:00+00:00
1	2002	116	PAULINE	22.1	-140.2	45	TS	Eastern Pacific	1.166190	1961-10-03 12:00:00+00:00
2	2003	124	C	18.0	-140.0	45	TS	Eastern Pacific	2.102380	1962-08-29 06:00:00+00:00
3	2004	168	DENISE	16.6	-139.5	45	TS	Eastern Pacific	2.121320	1967-07-14 06:00:00+00:00
4	2005	251	DIANA	18.5	-139.8	70	H1	Eastern Pacific	1.702939	1972-08-16 12:00:00+00:00

We look back at our table, and we rearrange the columns in the dataframe to match this table

fid INT PRIMARY KEY,
timestamp TIMESTAMP WITH TIME ZONE,
btid SMALLINT,
name VARCHAR,
lat DECIMAL(8,6),
long DECIMAL(9,6),
wind_kts SMALLINT,
pressure SMALLINT,
cat VARCHAR(2),
basin VARCHAR(16),
shape_length DECIMAL(7, 6)

In [48]:

# we rearrange our columns and reset the dataframe's index
df_csv = df[['FID', 'TIMESTAMP', 'BTID', 'NAME', 'LAT', 'LONG', 'WIND_KTS', 'PRESSURE', 'CAT', 'BASIN', 'SHAPE_LENG']].copy()
df_csv = df_csv.set_index('FID', drop=True)
df_csv.head()

Out[48]:

	TIMESTAMP	BTID	NAME	LAT	LONG	WIND_KTS	PRESSURE	CAT	BASIN	SHAPE_LENG
FID
2001	1957-08-08 18:00:00+00:00	63	NOTNAMED	22.5	-140.0	50	0	TS	Eastern Pacific	1.140175
2002	1961-10-03 12:00:00+00:00	116	PAULINE	22.1	-140.2	45	0	TS	Eastern Pacific	1.166190
2003	1962-08-29 06:00:00+00:00	124	C	18.0	-140.0	45	0	TS	Eastern Pacific	2.102380
2004	1967-07-14 06:00:00+00:00	168	DENISE	16.6	-139.5	45	0	TS	Eastern Pacific	2.121320
2005	1972-08-16 12:00:00+00:00	251	DIANA	18.5	-139.8	70	0	H1	Eastern Pacific	1.702939

We'll save our ammended csv somewhere else to be used later

In [49]:

df_csv.to_csv(path_or_buf='data/cleaned_files.csv', sep=';')

Creating the table with SQL¶

Next step is actually creating the table

In [3]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="data_prod_user", password="changeme", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

# year, month, day, ad_time will be stored as timestampz
print("DROP TABLE IF EXISTS hurricanes")
cur.execute("DROP TABLE IF EXISTS hurricanes")

CREATE_TABLE_SQL ="""
CREATE TABLE hurricanes (
    fid INT PRIMARY KEY,
    timestamp TIMESTAMP WITH TIME ZONE,
    btid SMALLINT,
    name VARCHAR,
    lat DECIMAL(8,6),
    long DECIMAL(9,6),
    wind_kts SMALLINT,
    pressure SMALLINT,
    cat VARCHAR(2),
    basin VARCHAR(16),
    shape_length DECIMAL(8, 6)
    )
"""

print(CREATE_TABLE_SQL)

cur.execute(
    CREATE_TABLE_SQL
)

conn.commit()
print("Done.")
conn.close()

DROP TABLE IF EXISTS hurricanes

CREATE TABLE hurricanes (
    fid INT PRIMARY KEY,
    timestamp TIMESTAMP WITH TIME ZONE,
    btid SMALLINT,
    name VARCHAR,
    lat DECIMAL(8,6),
    long DECIMAL(9,6),
    wind_kts SMALLINT,
    pressure SMALLINT,
    cat VARCHAR(2),
    basin VARCHAR(16),
    shape_length DECIMAL(8, 6)
    )

Done.

Insert the data in the database¶

Assuming we have the CSV available in disk, we can load it onto the table using psycopg's copy_expert()

In [4]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="data_prod_user", password="changeme", host="192.168.32.3")

conn.autocommit = True
cur = conn.cursor()

with open('data/cleaned_files.csv', 'r+') as f:
    cur.copy_expert("COPY hurricanes FROM STDIN WITH CSV HEADER DELIMITER ';'", f)
    
conn.close()

Grant specific privileges between users and tables¶

With our users and our table ready, we need to restrict the privileges some users have. This is so we can avoid unfortunate events like undesired DELETE or DROP instructions. Accidents happen.

In [17]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="postgres", password="postgres", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

# revoke all privileges and add back specific ones
# on user data_prod_user
cur.execute("REVOKE ALL ON hurricanes FROM data_prod_user")
cur.execute("GRANT SELECT, INSERT, UPDATE ON hurricanes TO data_prod_user")

# on group readonly
cur.execute("REVOKE ALL ON hurricanes FROM readonly")
cur.execute("GRANT SELECT ON hurricanes TO readonly")

conn.commit()
print("Done.")
conn.close()

Done.

Verify permissions and roles¶

As `postgres` user¶

In [6]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="postgres", password="postgres", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()
cur.execute('SELECT * FROM hurricanes LIMIT 1')

print(cur.fetchall())

conn.close()

[(2001, datetime.datetime(1957, 8, 8, 18, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)), 63, 'NOTNAMED', Decimal('22.500000'), Decimal('-140.000000'), 50, 0, 'TS', 'Eastern Pacific', Decimal('1.140175'))]

As `data_prod_user`¶

In [19]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="data_prod_user", password="changeme", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

cur.execute('SELECT * FROM hurricanes LIMIT 1')
cur.fetchall()

Out[19]:

[(2001,
  datetime.datetime(1957, 8, 8, 18, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)),
  63,
  'NOTNAMED',
  Decimal('22.500000'),
  Decimal('-140.000000'),
  50,
  0,
  'TS',
  'Eastern Pacific',
  Decimal('1.140175'))]

In [22]:

# we should be able to insert data
cur.execute('INSERT INTO hurricanes (fid) values (1000001)')
conn.commit()

In [25]:

# we can read this new row from the table
cur.execute('SELECT * FROM hurricanes WHERE fid = 1000001')
cur.fetchall()

Out[25]:

[(1000001, None, None, None, None, None, None, None, None, None, None)]

In [26]:

# but we should not be able to delete data
cur.execute('DELETE FROM hurricanes WHERE fid=1000000')
conn.commit()

---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-26-06d32ae504fc> in <module>
      1 # but we should not be able to delete data
----> 2 cur.execute('DELETE FROM hurricanes WHERE fid=1000000')
      3 conn.commit()

ProgrammingError: permission denied for relation hurricanes

In [28]:

cur.execute("UPDATE hurricanes SET name = 'fakename' WHERE fid=1000001")
conn.commit()

In [30]:

# we read the updated changes back
cur.execute('SELECT * FROM hurricanes WHERE fid = 1000001')
cur.fetchall()

Out[30]:

[(1000001, None, None, 'fakename', None, None, None, None, None, None, None)]

In [31]:

conn.close()

Checking permissions as `analyst1`¶

In [32]:

import psycopg2

conn = psycopg2.connect(
    dbname="ihw", user="analyst1", password="changeme", host="192.168.32.3"
)

conn.autocommit = True
cur = conn.cursor()

cur.execute('SELECT * FROM hurricanes LIMIT 1')
cur.fetchall()

Out[32]:

[(2001,
  datetime.datetime(1957, 8, 8, 18, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)),
  63,
  'NOTNAMED',
  Decimal('22.500000'),
  Decimal('-140.000000'),
  50,
  0,
  'TS',
  'Eastern Pacific',
  Decimal('1.140175'))]

In [33]:

# we should be able to insert data
cur.execute('INSERT INTO hurricanes (fid) values (1000001)')
conn.commit()

---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-33-afc5d0bde908> in <module>
      1 # we should be able to insert data
----> 2 cur.execute('INSERT INTO hurricanes (fid) values (1000001)')
      3 conn.commit()

ProgrammingError: permission denied for relation hurricanes

In [34]:

cur.execute("UPDATE hurricanes SET name = 'fakename' WHERE fid=1000001")
conn.commit()

---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-34-b52c899b5a8c> in <module>
----> 1 cur.execute("UPDATE hurricanes SET name = 'fakename' WHERE fid=1000001")
      2 conn.commit()

ProgrammingError: permission denied for relation hurricanes

In [35]:

# but we should not be able to delete data
cur.execute('DELETE FROM hurricanes WHERE fid=1000000')
conn.commit()

---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-35-06d32ae504fc> in <module>
      1 # but we should not be able to delete data
----> 2 cur.execute('DELETE FROM hurricanes WHERE fid=1000000')
      3 conn.commit()

ProgrammingError: permission denied for relation hurricanes

Final words¶

There's not so much we can say on top of what has been shown with code. We can achieve almost anything with the psycopg2 driver, which is the most used python driver for postgres databases (their words, not mine).

Moving forward¶

We can add another abstraction layer on top of the driver, namely an ORM or object relational mapper. I use SQLAlchemy in my daily tasks and it has proven useful in many ways, but that's something for another topic.

Moving rocks around

Using psycopg2 for your postgres needs

Table of Contents

About¶

Postgres Mission¶

Create the postgres instance¶

Create users and roles¶

Preparing for data insertion¶

Inspection of the data¶

Shaping data¶

Clean missing values¶

Merge date columns into a single one¶

Further optimizations¶

Creating the table with SQL¶

Insert the data in the database¶

Grant specific privileges between users and tables¶

Verify permissions and roles¶

As `postgres` user¶

As `data_prod_user`¶

Checking permissions as `analyst1`¶

Final words¶

Moving forward¶

Using psycopg2 for your postgres needs

Table of Contents

About¶

Postgres Mission¶

Create the postgres instance¶

Create users and roles¶

Preparing for data insertion¶

Inspection of the data¶

Shaping data¶

Clean missing values¶

Merge date columns into a single one¶

Further optimizations¶

Creating the table with SQL¶

Insert the data in the database¶

Grant specific privileges between users and tables¶

Verify permissions and roles¶

As postgres user¶

As data_prod_user¶

Checking permissions as analyst1¶

Final words¶

Moving forward¶

As `postgres` user¶

As `data_prod_user`¶

Checking permissions as `analyst1`¶