SQL- A guide to keep that programming sharp, test new ideas and improve flow, or to reduce your learning curve!
On MacOS, we cover installing an excellent Open Source IDE for SQL DBeaver Community, Open Source Database Software DuckDB, load open source data over https into our database, and end with SQL questions that have a statistical flavor on penguins. DuckDB removes the overhead of server and treats a file as a table.
Install DBeaver Community SQL IDE
Start a terminal window and run:
brew install --cask dbeaver-community
Note: Control-Shift-F auto formats the SQL.
Install DuckDB
Also on the command line, run the following after changing to the correct path that is applicable to your machine, i.e. /Users/chris/.
brew install duckd
# DuckDB does not place any directories in the $HOME directory
mkdir /Users/chris/.duckdb
# To create a persistent database
# use .exit to quit
duckdb /Users/chris/.duckdb/bokeh.duckdb
DuckDB (for a “serverless” Database)
Import data about penguins types, Adélie1, Chinstrap2, and Gentoo 3.
- Open DBeaver
- Click the plug to add a new Database connection
- Select DuckDB, note: if prompted download driver.
- Select Next
- Enter a file name or path + filename,
- This will become the name of your database file. Enter the same file path as entered in the install DuckDB section.
- Select Finish and START CODING!
Import Data
/*
Import data
source: [Environmental Data Initiative](https://portal.edirepository.org/nis/home.jsp)
--note: in duckdb multi statements, each statement needs to end in a ;
*/
--DROP TABLE IF EXISTS main.penguins;
CREATE TABLE main.penguins AS
(
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/3/002f3893385f710df69eeebe893144ff")
UNION BY NAME
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381")
UNION BY NAME
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462")
);
The penguins table should now have 344 records across 3 species.
Check for invariants
SELECT
count(*) AS total_cnts
, SUM(CASE
WHEN sex = 'FEMALE' THEN 1 ELSE 0 END) AS female_ind
, SUM(CASE
WHEN sex = 'MALE' THEN 1 ELSE 0 END) AS male_ind
FROM MAIN.PENGUINS
total_cnts | female_ind | male_ind |
---|---|---|
344 | 165 | 168 |
Looks like the number of female penguins is approximately the same number as the male penguins, which supports the assumptions that samples should be random and is an invariant metric check for this experiment. The total counts also show the table loaded with 344 rows.
Species and Flipper Sizes
DuckDB Functions I have not seen functions for kurtosis or skewness; however, these functions exist in Duck. As a reminder skewness is the measure of asymmetry of the distribution around its mean. Kurtosis is the measure of the tailedness of the distribution. Window Functions
What is the average flipper length by Species and the standard deviation of the flippers? Which Species do you think has a statistically different Flipper size than another?
SELECT
Species
, AVG("Flipper Length (mm)") AS avg_flipper_length_mm
, count(*) AS cnts
, stddev_pop("Flipper Length (mm)") AS stdev_pop
, stddev_samp("Flipper Length (mm)") AS stdev_samp
FROM
MAIN.PENGUINS
GROUP BY
Species
Species | avg_flipper_length_mm | cnts | stdev_pop | stdev_samp | skew | kurtosis |
---|---|---|---|---|---|---|
Gentoo penguin (Pygoscelis papua) | 217.2 | 124 | 6.46 | 6.48 | 0.39 | -0.58 |
Adelie Penguin (Pygoscelis adeliae) | 190.0 | 152 | 6.52 | 6.54 | 0.09 | 0.33 |
Chinstrap penguin (Pygoscelis antarctica) | 195.8 | 68 | 7.08 | 7.13 | -0.01 | 0.05 |
Rank
See what happens with a rank when there is a tie for body mass, more than 3x3 records are returned. Then see the impact of changing to row number and dense rank.
WITH cte_rank AS (
SELECT
Species
, studyName
, "Sample Number"
, "Body Mass (g)"
, rank() OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins
)
SELECT
*
FROM cte_rank
WHERE RANK <= 3
Species | studyName | Sample Number | Body Mass (g) | rank |
---|---|---|---|---|
Gentoo penguin (Pygoscelis papua) | PAL0809 | 41 | 3950 | 1 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 27 | 4100 | 2 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 17 | 4150 | 3 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 39 | 2700 | 1 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0708 | 23 | 2900 | 2 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 31 | 3200 | 3 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 59 | 2850 | 1 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 65 | 2850 | 1 |
Adelie Penguin (Pygoscelis adeliae) | PAL0910 | 117 | 2900 | 3 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 55 | 2900 | 3 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 99 | 2900 | 3 |
Row number always returns 9 records when filtered to less than or equal to 3. Depending on the table sort the row numbering for an associated sample number could be different. To have consistent results, sorting by a guaranteed column like sample number would be good practice.
WITH cte_rank AS (
SELECT
Species
, studyName
, "Sample Number"
, "Body Mass (g)"
, ROW_NUMBER () OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins
)
SELECT
*
FROM cte_rank
WHERE RANK <= 3
Species | studyName | Sample Number | Body Mass (g) | rank |
---|---|---|---|---|
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 59 | 2850 | 1 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 65 | 2850 | 2 |
Adelie Penguin (Pygoscelis adeliae) | PAL0910 | 117 | 2900 | 3 |
Gentoo penguin (Pygoscelis papua) | PAL0809 | 41 | 3950 | 1 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 27 | 4100 | 2 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 17 | 4150 | 3 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 39 | 2700 | 1 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0708 | 23 | 2900 | 2 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 31 | 3200 | 3 |
Dense rank provides continuous values, ex. no skipping 2 to get to 3, though multiple values of the same rank are allowed. Therefore, even more records are returned in ties when using dense rank.
WITH cte_rank AS (
SELECT
Species
, studyName
, "Sample Number"
, "Body Mass (g)"
, DENSE_RANK () OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins
)
SELECT
*
FROM cte_rank
WHERE RANK <= 3
Species | studyName | Sample Number | Body Mass (g) | rank |
---|---|---|---|---|
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 39 | 2700 | 1 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0708 | 23 | 2900 | 2 |
Chinstrap penguin (Pygoscelis antarctica) | PAL0809 | 31 | 3200 | 3 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 59 | 2850 | 1 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 65 | 2850 | 1 |
Adelie Penguin (Pygoscelis adeliae) | PAL0910 | 117 | 2900 | 2 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 55 | 2900 | 2 |
Adelie Penguin (Pygoscelis adeliae) | PAL0809 | 99 | 2900 | 2 |
Adelie Penguin (Pygoscelis adeliae) | PAL0910 | 105 | 2925 | 3 |
Gentoo penguin (Pygoscelis papua) | PAL0809 | 41 | 3950 | 1 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 27 | 4100 | 2 |
Gentoo penguin (Pygoscelis papua) | PAL0708 | 17 | 4150 | 3 |
Reminders
- Upper Case key words
- Use Indentation and separate columns on individual lines, one thought one line!
- In case you divide by zero, put a NULLIF in the denominator.
-
Palmer Station Antarctica LTER and K. Gorman. 2017. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 3. Environmental Data Initiative. https://doi.org/10.6073/pasta/abc50eed9138b75f54eaada0841b9b86 (Accessed 2024-04-29). ↩
-
Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguins (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 8. Environmental Data Initiative. https://doi.org/10.6073/pasta/ce9b4713bb8c065a8fcfd7f50bf30dde (Accessed 2024-04-28). ↩
-
Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female gentoo penguins (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 7. Environmental Data Initiative. https://doi.org/10.6073/pasta/9fc8f9b5a2fa28bdca96516649b6599b (Accessed 2024-04-29). ↩