I'm trying to use summarize to get the first result for each group, but it returns the column header instead:
(get_table is a custom function that gets a data table from a Postgres db)
require(dplyr)
require(RPostgres)
tbl <- get_table(my_server, my_table) %>%
select(column_a, column_b) %>%
group_by(column_a) %>%
summarize(first_b = first(column_b))
The result looks like
a first_b
1 "column_b"
2 "column_b"
3 "column_b"
If I use dplyr::collect() before summarize() I get the desired result but this really slows down performance.
Any ideas how I can summarize without using collect first?
Related
I would like to return a list in the R console of all unique values for a dataframe column. However, I also wanted the list to be sorted but I'm unable to do this.
df %>% distinct(var)
This works fine, but when I try doing:
df %>% sort(distinct(var))
It gives me this error message
Error in distinct(home_street) : object 'home_street' not found
You can keep the unique values, then sort by the variable column, then use pull to get just the vector.
library(tidyverse)
mtcars %>%
distinct(cyl) %>%
arrange(cyl) %>%
pull()
#[1] 4 6 8
Or in base R:
sort(unique(mtcars$cyl))
I'm trying to get get unique values for 2 columns. I get them when I write it separately but it doesn't work for both in the same command. I do it like this:
dataname %>%
select(column name, column name) %>%
distinct() %>%
summarise(name=n(), name=n())
In this way, I only get unique values for the first column. What is the problem?
Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.
I am trying to create a table using dplyr. When I try I get an error that says "Error: Can't subset with [ using an object of class quoted." Nothing is quoted in my script. Ideally I would like to create a table that is grouped by ScoutGrade, and shows the count of players for each designated ScoutGrade. I attached my script below:
PlayerSalariesProject %>%
count(PlayerSalariesProject$WAR) %>%
group_by(PlayerSalariesProject, PlayerSalariesProject$GradingScale)
We can use the unquoted column name inside the tidyverse function
library(dplyr)
PlayerSalariesProject %>%
group_by(WAR) %>%
mutate(n = n()) %>%
group_by(GradingScale, n) %>%
summarise(meanWAR = mean(WAR))
As summarise only returns the columns summarised along with the grouping variables, we can use mutate and then do another group_by
Is it possible to select all unique values from a column of a data.frame using select function in dplyr library?
Something like "SELECT DISTINCT field1 FROM table1" in SQL notation.
Thanks!
In dplyr 0.3 this can be easily achieved using the distinct() method.
Here is an example:
distinct_df = df %>% distinct(field1)
You can get a vector of the distinct values with:
distinct_vector = distinct_df$field1
You can also select a subset of columns at the same time as you perform the distinct() call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:
distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1
Just to add to the other answers, if you would prefer to return a vector rather than a dataframe, you have the following options:
dplyr >= 0.7.0
Use the pull verb:
mtcars %>% distinct(cyl) %>% pull()
dplyr < 0.7.0
Enclose the dplyr functions in a parentheses and combine it with $ syntax:
(mtcars %>% distinct(cyl))$cyl
The dplyr select function selects specific columns from a data frame. To return unique values in a particular column of data, you can use the group_by function. For example:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))
# Return the distinct values of x
dat %>%
group_by(x) %>%
summarise()
x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
If you want to change the column name you can add the following:
dat %>%
group_by(x) %>%
summarise() %>%
select(unique.x=x)
This both selects column x from among all the columns in the data frame that dplyr returns (and of course there's only one column in this case) and changes its name to unique.x.
You can also get the unique values directly in base R with unique(dat$x).
If you have multiple variables and want all unique combinations that appear in the data, you can generalize the above code as follows:
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE),
y=sample(letters[1:5], 100, replace=TRUE))
dat %>%
group_by(x,y) %>%
summarise() %>%
select(unique.x=x, unique.y=y)