SQL- A guide to keep that programming sharp, test new ideas and improve flow, or to reduce your learning curve!

On MacOS, we cover installing an excellent Open Source IDE for SQL DBeaver Community, Open Source Database Software DuckDB, load open source data over https into our database, and end with SQL questions that have a statistical flavor on penguins. DuckDB removes the overhead of server and treats a file as a table.

Install DBeaver Community SQL IDE

Start a terminal window and run:

    brew install --cask dbeaver-community

Note: Control-Shift-F auto formats the SQL.

Install DuckDB

Also on the command line, run the following after changing to the correct path that is applicable to your machine, i.e. /Users/chris/.

    brew install duckd
    # DuckDB does not place any directories in the $HOME directory
    mkdir /Users/chris/.duckdb 

    # To create a persistent database
    # use .exit to quit
    duckdb /Users/chris/.duckdb/bokeh.duckdb

DuckDB (for a “serverless” Database)

Import data about penguins types, Adélie1, Chinstrap2, and Gentoo 3.

  1. Open DBeaver
  2. Click the plug to add a new Database connection
  3. Select DuckDB, note: if prompted download driver.
  4. Select Next
  5. Enter a file name or path + filename,
    1. This will become the name of your database file. Enter the same file path as entered in the install DuckDB section.
  6. Select Finish and START CODING!

Import Data

 /*
 Import data
 source: [Environmental Data Initiative](https://portal.edirepository.org/nis/home.jsp)
 --note: in duckdb multi statements, each statement needs to end in a ;
 */
 --DROP TABLE IF EXISTS main.penguins;
CREATE TABLE main.penguins AS 
(
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/3/002f3893385f710df69eeebe893144ff")
UNION BY NAME
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381")
UNION BY NAME
SELECT * FROM READ_CSV("https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462")
);

The penguins table should now have 344 records across 3 species.

Check for invariants

SELECT
	count(*) AS total_cnts
	, SUM(CASE 
	WHEN sex = 'FEMALE' THEN 1 ELSE 0 END) AS female_ind
	, SUM(CASE 
	WHEN sex = 'MALE' THEN 1 ELSE 0 END) AS male_ind
FROM MAIN.PENGUINS 
total_cnts female_ind male_ind
344 165 168

Looks like the number of female penguins is approximately the same number as the male penguins, which supports the assumptions that samples should be random and is an invariant metric check for this experiment. The total counts also show the table loaded with 344 rows.

Species and Flipper Sizes

DuckDB Functions I have not seen functions for kurtosis or skewness; however, these functions exist in Duck. As a reminder skewness is the measure of asymmetry of the distribution around its mean. Kurtosis is the measure of the tailedness of the distribution. Window Functions

What is the average flipper length by Species and the standard deviation of the flippers? Which Species do you think has a statistically different Flipper size than another?

SELECT
Species
, AVG("Flipper Length (mm)") AS avg_flipper_length_mm
, count(*) AS cnts
, stddev_pop("Flipper Length (mm)") AS stdev_pop
, stddev_samp("Flipper Length (mm)") AS stdev_samp
FROM
MAIN.PENGUINS
GROUP BY
Species
Species avg_flipper_length_mm cnts stdev_pop stdev_samp skew kurtosis
Gentoo penguin (Pygoscelis papua) 217.2 124 6.46 6.48 0.39 -0.58
Adelie Penguin (Pygoscelis adeliae) 190.0 152 6.52 6.54 0.09 0.33
Chinstrap penguin (Pygoscelis antarctica) 195.8 68 7.08 7.13 -0.01 0.05

Rank

See what happens with a rank when there is a tie for body mass, more than 3x3 records are returned. Then see the impact of changing to row number and dense rank.

WITH cte_rank AS ( 
SELECT 
Species 
, studyName 
, "Sample Number" 
, "Body Mass (g)"
, rank() OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins 
)
SELECT 
*
FROM cte_rank 
WHERE RANK <= 3
Species studyName Sample Number Body Mass (g) rank
Gentoo penguin (Pygoscelis papua) PAL0809 41 3950 1
Gentoo penguin (Pygoscelis papua) PAL0708 27 4100 2
Gentoo penguin (Pygoscelis papua) PAL0708 17 4150 3
Chinstrap penguin (Pygoscelis antarctica) PAL0809 39 2700 1
Chinstrap penguin (Pygoscelis antarctica) PAL0708 23 2900 2
Chinstrap penguin (Pygoscelis antarctica) PAL0809 31 3200 3
Adelie Penguin (Pygoscelis adeliae) PAL0809 59 2850 1
Adelie Penguin (Pygoscelis adeliae) PAL0809 65 2850 1
Adelie Penguin (Pygoscelis adeliae) PAL0910 117 2900 3
Adelie Penguin (Pygoscelis adeliae) PAL0809 55 2900 3
Adelie Penguin (Pygoscelis adeliae) PAL0809 99 2900 3

Row number always returns 9 records when filtered to less than or equal to 3. Depending on the table sort the row numbering for an associated sample number could be different. To have consistent results, sorting by a guaranteed column like sample number would be good practice.

WITH cte_rank AS ( 
SELECT 
Species 
, studyName 
, "Sample Number" 
, "Body Mass (g)"
, ROW_NUMBER () OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins 
)
SELECT 
*
FROM cte_rank 
WHERE RANK <= 3
Species studyName Sample Number Body Mass (g) rank
Adelie Penguin (Pygoscelis adeliae) PAL0809 59 2850 1
Adelie Penguin (Pygoscelis adeliae) PAL0809 65 2850 2
Adelie Penguin (Pygoscelis adeliae) PAL0910 117 2900 3
Gentoo penguin (Pygoscelis papua) PAL0809 41 3950 1
Gentoo penguin (Pygoscelis papua) PAL0708 27 4100 2
Gentoo penguin (Pygoscelis papua) PAL0708 17 4150 3
Chinstrap penguin (Pygoscelis antarctica) PAL0809 39 2700 1
Chinstrap penguin (Pygoscelis antarctica) PAL0708 23 2900 2
Chinstrap penguin (Pygoscelis antarctica) PAL0809 31 3200 3

Dense rank provides continuous values, ex. no skipping 2 to get to 3, though multiple values of the same rank are allowed. Therefore, even more records are returned in ties when using dense rank.

WITH cte_rank AS ( 
SELECT 
Species 
, studyName 
, "Sample Number" 
, "Body Mass (g)"
, DENSE_RANK  () OVER (PARTITION BY Species ORDER BY "Body Mass (g)") AS rank
FROM main.penguins 
)
SELECT 
*
FROM cte_rank 
WHERE RANK <= 3  

Species studyName Sample Number Body Mass (g) rank
Chinstrap penguin (Pygoscelis antarctica) PAL0809 39 2700 1
Chinstrap penguin (Pygoscelis antarctica) PAL0708 23 2900 2
Chinstrap penguin (Pygoscelis antarctica) PAL0809 31 3200 3
Adelie Penguin (Pygoscelis adeliae) PAL0809 59 2850 1
Adelie Penguin (Pygoscelis adeliae) PAL0809 65 2850 1
Adelie Penguin (Pygoscelis adeliae) PAL0910 117 2900 2
Adelie Penguin (Pygoscelis adeliae) PAL0809 55 2900 2
Adelie Penguin (Pygoscelis adeliae) PAL0809 99 2900 2
Adelie Penguin (Pygoscelis adeliae) PAL0910 105 2925 3
Gentoo penguin (Pygoscelis papua) PAL0809 41 3950 1
Gentoo penguin (Pygoscelis papua) PAL0708 27 4100 2
Gentoo penguin (Pygoscelis papua) PAL0708 17 4150 3

Reminders

  1. Upper Case key words
  2. Use Indentation and separate columns on individual lines, one thought one line!
  3. In case you divide by zero, put a NULLIF in the denominator.
  1. Palmer Station Antarctica LTER and K. Gorman. 2017. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 3. Environmental Data Initiative. https://doi.org/10.6073/pasta/abc50eed9138b75f54eaada0841b9b86 (Accessed 2024-04-29). 

  2. Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguins (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 8. Environmental Data Initiative. https://doi.org/10.6073/pasta/ce9b4713bb8c065a8fcfd7f50bf30dde (Accessed 2024-04-28). 

  3. Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female gentoo penguins (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 7. Environmental Data Initiative. https://doi.org/10.6073/pasta/9fc8f9b5a2fa28bdca96516649b6599b (Accessed 2024-04-29). 

Updated: