Distinct function in dplyr package - r

I'm trying to get get unique values for 2 columns. I get them when I write it separately but it doesn't work for both in the same command. I do it like this:
dataname %>%
select(column name, column name) %>%
distinct() %>%
summarise(name=n(), name=n())
In this way, I only get unique values for the first column. What is the problem?

Related

how can I filter data inside mutate() using a counting function (like NROW) in R?

I have a dataframe with the columns doc_id and feats (both character vectors). I'm trying to create a new column n_rel_prn, which has the number of total occurrences of the value 'PronType=Rel' in the feats column, for each doc_id.
I can't use filter(), because it filters out all of the other data I need (i.e. where the value for feats is not 'PronType=Rel'), but otherwise it does the trick. (Here's that code snippet:)
tcorpus %>% group_by(doc_id) %>%
filter(feats=='PronType=Rel') %>%
mutate(n_rel_prn = n())
Basically, I need something that works like the following code (except that actually works--this obviously doesn't):
tcorpus %>% group_by(doc_id) %>%
mutate(n_rel_prn = NROW(feats == 'PronType=Rel'))
Is there a way I can count the number of 'PronType=Rel' observations (grouped by doc_id) and add these totals to a new column? (I'm assuming at the very least group_by %>% mutate() is the way to go.)
You are almost there. Try this:
tcorpus %>% group_by(doc_id) %>% mutate(n_rel_prn = sum(feats == 'PronType=Rel'))

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

Using R Package dplyr

I am trying to create a table using dplyr. When I try I get an error that says "Error: Can't subset with [ using an object of class quoted." Nothing is quoted in my script. Ideally I would like to create a table that is grouped by ScoutGrade, and shows the count of players for each designated ScoutGrade. I attached my script below:
PlayerSalariesProject %>%
count(PlayerSalariesProject$WAR) %>%
group_by(PlayerSalariesProject, PlayerSalariesProject$GradingScale)
We can use the unquoted column name inside the tidyverse function
library(dplyr)
PlayerSalariesProject %>%
group_by(WAR) %>%
mutate(n = n()) %>%
group_by(GradingScale, n) %>%
summarise(meanWAR = mean(WAR))
As summarise only returns the columns summarised along with the grouping variables, we can use mutate and then do another group_by

Passing variable names to the count() function in R

I am trying to use the count() function to return the levels of a column in R. I have 37 columns and I wanted to know if there is a way to pass column names other than typing them out.
I am currently using,
> x1Count <- totalCount%>% group_by(Country) %>count(X1.Environmental.Regulation)%>% drop_na()
I want to run this through a loop with the count() function taking the column names from a list like colnames(totalCount).
Is there another way to pass inputs to the count() function that will allow me to use column numbers or refer another list?
We can change the string into a symbol (with sym) and evaluate (!!!). In the below example, we get the frequency count of the columns 4 and 5, grouped by 'Country'
library(tidyverse)
totalCount %>%
group_by(Country) %>%
count(!!! rlang::syms(names(.)[4:5]))

dplyr: summarize(first(...)) returns column name

I'm trying to use summarize to get the first result for each group, but it returns the column header instead:
(get_table is a custom function that gets a data table from a Postgres db)
require(dplyr)
require(RPostgres)
tbl <- get_table(my_server, my_table) %>%
select(column_a, column_b) %>%
group_by(column_a) %>%
summarize(first_b = first(column_b))
The result looks like
a first_b
1 "column_b"
2 "column_b"
3 "column_b"
If I use dplyr::collect() before summarize() I get the desired result but this really slows down performance.
Any ideas how I can summarize without using collect first?

Resources