From SAS to R - proc sort nodupkey - r

I'm traslating a SAS script to R but I don't know how does SAS work...
I have this piece of code:
proc sort data=table
(keep= Field1 Field2 Field3 Field4 Field5)
out=table_nodup nodupkey;
by Field1 Field2 Field4;
run;
I don't know what the code will do and then I don't know how to translate it to R...any help? :)

According to this paper I'd say it can be described with dplyr as following:
library(dplyr)
table %>%
select(Field1, Field2, Field3, Field4, Field5) %>%
group_by(Field1, Field2, Field4) %>%
slice(1)
select is for SAS's keep, then nodupkey can be translated to grouping by by variables and taking the first occurrences. A good thing is that slice returns a dataframe that is already sorted by the groups that were used, so arrange is not needed.

Given the data frame table :
table<- table[,c(Field1,Field2,Field3,Field4,Field5)]#keep specific columns
table_nodup<-unique(table[with(data, order(Field1, Field2, Field4))])#orders the data based on the 3 columns and select unique rows

Related

R prod() function but in SQL, how to take product of a group by

In R, taking the product of a group by can be undertaken like:
iris %>% group_by(Species) %>% summarize(new_col = prod(Petal.Length))
How can I achieve that same concept in either postgresql or dbplyr/dplyr?
Unfortunately the SQL standard does not define an aggregate "product" function. You can, however, work around this with arithmetics.
Say that you want to compute the product of petal_length in groups of rows sharing the same species in table mytable:
select species, exp(sum(ln(petal_length))) petal_length_product
from mytable
group by species
This works as long as all values of petal_length are greater than 0.

tail() equivalent using dbplyr? (i.e. return last x rows of database table)

Suppose using dbplyr, we have something like
library(dbplyr)
sometable %>%
head()
then we see the first 6 rows.
But if we try this we see an error
sometable %>%
tail()
# Error: tail() is not supported by sql sources
which is expected behaviour of dbplyr:
Because you can’t find the last few rows without executing the whole query, you can’t use tail().
Question: how do we do the tail() equivalent in this situation?
In general, the order of SQL queries should never be assumed, as the DBMS may store it in an order that is ideal for indexing or other reasons, and not based on the order you want. Because of that, a common "best practice" for SQL queries is to either (a) assume the data is unordered (and perhaps that the order may change, though I've not seen this in practice); or (b) force ordering in the query.
From this, consider arranging your data in a descending manner and use head.
For instance, if I have a table MyTable with a numeric field MyNumber, then
library(dplyr)
library(dbplyr)
tb <- tbl(con, "MyTable")
tb %>%
arrange(MyNumber) %>%
tail() %>%
sql_render()
# Error: tail() is not supported by sql sources
tb %>%
arrange(MyNumber) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber"
tb %>%
arrange(desc(MyNumber)) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber" DESC
(This is (obviously) demonstrated on a SQL Server connection, but the premise should work just as well for other DBMS types, they'll just shift from SELECT TOP(6) ... to SELECT ... LIMIT 6 or similar.)

Order a dataframe primarily by the count of times a value appears in a column and secondarily by a second column

I have the dataframe below which I would like to order primarily numerically by the count of times a value appears in the first column (bigger values first) and secondarily alphabetically(A-Z) based on the second column.
Name<-c("jack","jack","bob","david","mary")
Surname<-c("arf","dfg","hjk","dfgg","bn")
n1<-data.frame(Name, Surname)
It should be something like:
n1<-n1[
order( n1[,1], n1[,2] ),
]
but I do not know how to order numerically based on count of values.
As suggested by #thelatemail, you can do this in base R using:
n1[order(-table(n1$Name)[n1$Name], n1$Surname), ]
To sort by surname first, swap the arguments to order() around.
Using sqldf likes the following:
library(sqldf)
n2 <- sqldf('SELECT * FROM
n1 JOIN (SELECT Name, COUNT(*) as C FROM n1 GROUP BY Name) as T
on n1.Name = T.Name
ORDER BY C DESC, Surname')
First grouped the names and then sort based on the count in decent order and Surname alphabetically.
Using dplyr like the following:
library(dplyr)
n1 %>%
as_tibble() %>%
count(Name) %>%
right_join(n1, by = "Name") %>% # join the count back to your original frame
arrange(desc(n), Surname) %>% # order by highest count first, then Surname
select(-n) # just return the original frame, with the sorting you asked for

How to get distinct values of a column in rquery?

Exploring the rquery package of John Mount, Win-Vector LLC, is there a way that I could get the distinct values of a column from a SQL table using the rquery package functions? (WITHOUT writing the appropriate SQL query but using the rquery functions since I need to use my code in Oracle, MSSQL and Postgres).
So I do not need:
rq_get_query(db, "SELECT DISTINCT (COL1) FROM TABLE1")
but I am looking for something similar to unique of base R.
I would use the sqldf package. It is very accessible, and think you would benefit.
install.packages("sqldf")
library(sqldf)
df = sqldf("SELECT DISTINCT COL1 FROM TABLE1")
View(df)
This returns the distinct values of Col1 and Col2. Can of course be any number of columns.
db_td(connection, "table") %.>%
project(., groupby = c("Col1", "Col2"), one = 0) %.>%
execute(connection, .)
The assignment of 0 to a new column is necessary, is supposed to be fixed in the next update of rquery, so it will work like this:
project(., groupby = c("Col1", "Col2"))

Using multiple columns in dplyr window functions?

Comming from SQL i would expect i was able to do something like the following in dplyr, is this possible?
# R
tbl %>% mutate(n = dense_rank(Name, Email))
-- SQL
SELECT Name, Email, DENSE_RANK() OVER (ORDER BY Name, Email) AS n FROM tbl
Also is there an equivilant for PARTITION BY?
I did struggle with this problem and here is my solution:
In case you can't find any function which supports ordering by multiple variables, I suggest that you concatenate them by their priority level from left to right using paste().
Below is the code sample:
tbl %>%
mutate(n = dense_rank(paste(Name, Email))) %>%
arrange(Name, Email) %>%
view()
Moreover, I guess group_by is the equivalent for PARTITION BY in SQL.
The shortfall for this solution is that you can only order by 2 (or more) variables which have the same direction. In the case that you need to order by multiple columns which have different direction, saying that 1 asc and 1 desc, I suggest you to try this:
Calculate rank with ties based on more than one variable

Resources