Comming from SQL i would expect i was able to do something like the following in dplyr, is this possible?
# R
tbl %>% mutate(n = dense_rank(Name, Email))
-- SQL
SELECT Name, Email, DENSE_RANK() OVER (ORDER BY Name, Email) AS n FROM tbl
Also is there an equivilant for PARTITION BY?
I did struggle with this problem and here is my solution:
In case you can't find any function which supports ordering by multiple variables, I suggest that you concatenate them by their priority level from left to right using paste().
Below is the code sample:
tbl %>%
mutate(n = dense_rank(paste(Name, Email))) %>%
arrange(Name, Email) %>%
view()
Moreover, I guess group_by is the equivalent for PARTITION BY in SQL.
The shortfall for this solution is that you can only order by 2 (or more) variables which have the same direction. In the case that you need to order by multiple columns which have different direction, saying that 1 asc and 1 desc, I suggest you to try this:
Calculate rank with ties based on more than one variable
Related
I am using R and the library dplyr.
I want to join a larger database with a smaller database (in terms of rows).
I use left join because I want to have a final database that has the same number of rows as the larger one.
This naturally returns NA values when the smaller database does not have a value corresponding to the joining key.
What I want to achieve is sort of copying the previous values of the smaller database into the rows where NA is returned by the left join.
In other words:
if is.na(columnvalue[j]) == TRUE then
columnvalue[j] = columnvalue[j-1]
where columnvalue is a joined column from the smaller database and j = 1,..., nrow(largerdataset).
A loop with that if statement should work, but it is a bit cumbersome. Is there any other smarter solution?
Thank you.
If you update with some sample data, I could provide full code for this. The general solution is to use fill from tidyr package, possibly with a group_by the key if needed. You would just write it as:
library(tidyverse)
data %>%
# group_by(key) %>%
tidyr::fill(var1, var2, var3, .direction = "up")
I'm trying to get get unique values for 2 columns. I get them when I write it separately but it doesn't work for both in the same command. I do it like this:
dataname %>%
select(column name, column name) %>%
distinct() %>%
summarise(name=n(), name=n())
In this way, I only get unique values for the first column. What is the problem?
Suppose using dbplyr, we have something like
library(dbplyr)
sometable %>%
head()
then we see the first 6 rows.
But if we try this we see an error
sometable %>%
tail()
# Error: tail() is not supported by sql sources
which is expected behaviour of dbplyr:
Because you can’t find the last few rows without executing the whole query, you can’t use tail().
Question: how do we do the tail() equivalent in this situation?
In general, the order of SQL queries should never be assumed, as the DBMS may store it in an order that is ideal for indexing or other reasons, and not based on the order you want. Because of that, a common "best practice" for SQL queries is to either (a) assume the data is unordered (and perhaps that the order may change, though I've not seen this in practice); or (b) force ordering in the query.
From this, consider arranging your data in a descending manner and use head.
For instance, if I have a table MyTable with a numeric field MyNumber, then
library(dplyr)
library(dbplyr)
tb <- tbl(con, "MyTable")
tb %>%
arrange(MyNumber) %>%
tail() %>%
sql_render()
# Error: tail() is not supported by sql sources
tb %>%
arrange(MyNumber) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber"
tb %>%
arrange(desc(MyNumber)) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber" DESC
(This is (obviously) demonstrated on a SQL Server connection, but the premise should work just as well for other DBMS types, they'll just shift from SELECT TOP(6) ... to SELECT ... LIMIT 6 or similar.)
When performing a grouped summary in dplyr, one would normally summarize all target variables in a single command:
# Method 1: summarize all target variables in one command
mtcars %>%
group_by(am) %>%
summarize(mpg = mean(mpg),
disp = mean(disp))
However, one might prefer to perform the summarizations separately for greater flexibility & programmability (yes I am aware of across, but my impression is that its flexibility is limited). In this case, I assume that one must join the tables together at the end:
# Method 2: summarize separately and join
a <- mtcars %>%
group_by(am) %>%
summarize(mpg = mean(mpg))
b <- mtcars %>%
group_by(am) %>%
summarize(disp = mean(disp))
inner_join(a, b, by = 'am')
The join could be avoided by just appending the summary from b to a:
a$c <- b$disp
However, this would assume that the rows of a and b are in the same order. This is certainly not assured in general, as typical SQL databases do not guarantee output order. When dplyr uses such a database as a backend, it will presumably reflect whatever random order the database returned the data in.
My question is, does vanilla dplyr (i.e. no external backend) guarantee a certain ordering of rows, such that the non-join solution can be considered safe & robust? I suspect dplyr is not interested in guaranteeing row order, but have not been able to find a definitive statement one way or the other.
I have a df with 30 columns and 2000 rows.
from the df, I selected several variables by their name and calculated mean of Value by 3 by3 rows of group and type variables.
But there are only 3 variables (group, type, res) in output data.
How should I tell to save selected variables into output df? Is there anything wrong with this code?
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value))
Thanks in advance!
As others have pointed out, summarize only returns grouping variables and those variables specified in summarize. This is by design – summarize returns a single row for each group, so there must be a single value for each variable.
The function used in summarize must return a single value (so that's covered), while using group_by with variables ensures that these variables are the same within the group. But for the other variables, there could be several different values within the group: which would summarize choose? Instead of making a guess, it drops those variables.
There are several options to get around this, which one is best depends on your data and what you want to do with it:
Add these variables as grouping variables. This is the preferred method, but obviously it only works if the structure of the data allows it. For example, in a hypothetical dataset, if you want to group by city but want to preserve the state variable, using group_by(city, state) will divide into groups the same way as group_by(city) since city and state are linked (for example, "Boston" will always be with "MA").
Define them in summarize and choose only the first value to be the value for that group, as in #thc's answer. Note that you will lose any other values of those variables and it's not always clear which value will be kept and which will be lost.
Use mutate instead - this will keep the original number of rows rather than collapsing to 1 per group, but will ensure that you don't lose any data.
Join them as a comma (or other) separated string by adding: A = paste(A, sep = ', ') to the summarize for each variable you want to keep. This will preserve the information, at the expense of making it dificult to work with in any future steps.
You can include them in summarise instead, e.g.:
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value), A=A[1], B=B[1], C=C[1] )
I believe this is the fastest approach under dplyr if you have a very large data.frame.