Using R in PostgreSQL
Contents
Perhaps a Simpler Alternative
Aggregate functions in PostgreSQL: https://www.postgresql.org/docs/9.0/static/functions-aggregate.html
Aggregate Functions
Table 9-44. Aggregate Functions for Statistics
Function | Argument Type | Return Type | Description =======================+==========================================================+=================================================================+============================================================================================= corr(Y, X) | double precision | double precision | correlation coefficient covar_pop(Y, X) | double precision | double precision | population covariance covar_samp(Y, X) | double precision | double precision | sample covariance regr_avgx(Y, X) | double precision | double precision | average of the independent variable (sum(X)/N) regr_avgy(Y, X) | double precision | double precision | average of the dependent variable (sum(Y)/N) regr_count(Y, X) | double precision | bigint | number of input rows in which both expressions are nonnull regr_intercept(Y, X) | double precision | double precision | y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs regr_r2(Y, X) | double precision | double precision | square of the correlation coefficient regr_slope(Y, X) | double precision | double precision | slope of the least-squares-fit linear equation determined by the (X, Y) pairs regr_sxx(Y, X) | double precision | double precision | sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable) regr_sxy(Y, X) | double precision | double precision | sum(X*Y) - sum(X) * sum(Y)/N ("sum of products" of independent times dependent variable) regr_syy(Y, X) | double precision | double precision |sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable) stddev(expression) | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| historical alias for stddev_samp stddev_pop(expression) | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| population standard deviation of the input values stddev_samp(expression)| smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| sample standard deviation of the input values variance(expression) | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| historical alias for var_samp var_pop(expression) | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| population variance of the input values (square of the population standard deviation) var_samp(expression) | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric| sample variance of the input values (square of the sample standard deviation)
Joe Conway's Documentation on pl/r
http://joeconway.com/doc/plr-US.pdf http://www.joeconway.com/plr/doc/plr-funcs.html http://www.joeconway.com/plr/doc/plr-window-funcs.html
Sample Function Code in PostgreSQL
Creating the Sample Function
Inside a database (ex) tigertest):
CREATE OR REPLACE FUNCTION r_max (integer, integer) RETURNS integer AS ’ if (arg1 > arg2) return(arg1) else return(arg2) ’ LANGUAGE ’plr’ STRICT;
Calling the Sample Function
SELECT * FROM r_max(2,3);
returns
r_max ------- 3 (1 row)
hcllayerwinv table from tigertest database
Table "public.hcllayerwinv" Column | Type | Collation | Nullable | Default ------------------+------------------------+-----------+----------+--------- place | character varying(100) | | | statecode | character varying(2) | | | year | integer | | | layer | integer | | | lcount | numeric | | | nosingleton | bigint | | | nomultiton | bigint | | | nopair | bigint | | | nohull | bigint | | | totmultitoncount | numeric | | | avgmultitoncount | numeric | | | totpaircount | numeric | | | avgpaircount | numeric | | | totpairlength | double precision | | | avgpairlength | double precision | | | minpairlength | double precision | | | maxpairlength | double precision | | | tothullcount | numeric | | | avghullcount | numeric | | | tothullarea | double precision | | | avghullarea | double precision | | | minhullarea | double precision | | | maxhullarea | double precision | | | tothulldensity | double precision | | | avghulldensity | double precision | | | seedearlyinvf | double precision | | | inflator | numeric | | | seedearlyinvl16f | double precision | | |
Regress by
--By: place, statecode, layer, --Regress: seedearlyinvl16f ON nosingleton, totmultitoncount, totpaircount, tothullcount, avgpairlength, avghullarea, avghulldensity
Installing PL/R on the dbase server
Check the version of the dbase by doing a
SELECT version();
This gives us 9.5.10
Then as root:
apt-get install postgresql-9.5-plr
Finally enable the extension in a dbase:
create extension plr;
See also: http://www.joeconway.com/presentations/plr-PGConfNYC2014.pdf
Instructions on Use
To use R from pgAdmin III, follow the instructions in this tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This allows R functions to be used in this database from then on. You should be able to run the given examples in the tutorial after this is done. This was run on databases tigertest and template1.
Another possibly useful presentation on PL/R.
How PL/R was installed on the RDP
On 2018-01-18, PL/R was installed to allow PostgreSQL users to use R functions from SQL queries. It was installed using this tutorial. The link from the tutorial for PL/R doesn't work, I used this instead.
It required R version 3.3.0 for PL/R to work with PostgreSQL 9.5. If the environmental variables R_HOME or PATH are edited so that they point to a different version of R, this might not work, and PL/R might have to be reinstalled. If the version of PostgreSQL is updated, both the R version and PL/R version will have to match it as well.