Why wont the group_by() function in R work properly? - r

I have a large dataframe and I am trying to sort by 8 categories in one column and then find the sum of their weight (kg) using the group_by() and summarise() functions from dplyr package.
However, in the 'total' variable created, the sums of some of the categories produce N/A and I'm not sure why as they should be numerical values. There isn't anything weird about the dataframe which I can see.
code:
totals <- db %>% group_by(category) %>% summarise(kilos = sum(weight))

sum function does not work with NA values. Specify the na.rm argument as TRUE and it will ignore the NA values. Below should work:
enter code here totals <- db %>%
group_by(category) %>%
summarise(kilos = sum(weight,na.rm =TRUE))

Related

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

2 Numeric Values In A Dataframe Field In R

I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))

Store specific index value of output in R

I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .
Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

Error dplyr summarise

I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')

Resources