Using the tidyverse package, I can easily aggregate a single variable. However, I wish to create a function which will allow me to aggregate multiple variables simultaneously.
I understand I have to convert the dataframe containing multiple variables to a list and then lapply an aggregating function across this list. However, I am unable to create this function.
Following is a REPREX of what I am trying to do:
# Load package
library(dplyr)
# Load dataset
dat <- data.frame(Titanic)
# Select variables
dat <- dat[, c('Class', 'Sex', 'Age','Survived')]
# Aggregate a single variable
dat %>% group_by(Class) %>% summarise(n=n())
# Desired outcome: Aggregate all variables simultaneously using a function
dat_ls <- as.list(dat) ## Create a list with all the variables
dat_agg <- lapply(dat_ls, function(???)) ## Apply aggregating function to each element in the list
With the list, we can use table
lapply(dat_ls, table)
Another option is to reshape to 'long' format and then use count
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(everything()) %>%
count(name, value)
Related
I would like to create a new column that extracts the hour from a timestamp as a numeric data type. If I had one data frame or tibble, I would do it as follows:
calories_hourly$activity_hour_num <- calories_hourly$activity_hour %>% mdy_hms() %>% format(format = ('%H')) %>% as.numeric()
However, I have one list of 18 tibbles called "fitbit_data" where I would like to perform the operation above for tibbles 6-16. The type casting is calculated from the second column in all of my tibbles. I have an example of the beginning of a failed attempt below:
fitbit_data[6:16] <- fitbit_data[6:16] %>% mutate(activity_hour_num=map(.x=fitbit_data[6:16], .f=~mdy(.x[2])))
Can you please help me code a tidy solution for this R task?
Thank you so much!
You can use map as -
library(purrr)
library(lubridate)
library(dplyr)
k <- 6:16
fitbit_data[k] <- map(fitbit_data[k], ~{.x[[2]] <- lubridate::mdy(.x[[2]]);.x})
Based on the first attempt you can do -
fitbit_data[k] <- map(fitbit_data[k], ~.x %>%
mutate(activity_hour = mdy_hms(activity_hour) %>%
format('%H') %>% as.numeric()))
Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.
So I'm I've tried to find the answer to this probably-obvious question. I have multiple predictor variables that I need to loop through in order to get a summary of another column for each predictor. This data frame will change with every iteration so I need code that work for multiple different data frames. Here are the places I've looked so far:
R- producing a summary calculation for each column that is dependent on aggregations at a factor level
Multiple data frame handling
Using the mtcars package, this is what I've tried:
#get mtcars data from graphics package
install.packages("graphics")
library(graphics)
data <- mtcars
#loop through names
variable <- list(colnames(data))
for(i in variable){
data1 <- data %>%
group_by(i)
summarise('number' = mean(mpg))
}
However, I get the following error:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `i` is unknown
Not sure where to go next.
There are multiple issues in the code,
1) the variable is unnecessarily created as a list
2) Looping through the 'variable' is not getting inside the list, which is an issue from 1.
3) group_by_at can be used in place of group_by for string inputs
4) there is a typo of no connection ie. chain (%>%) between group_by and summarise
5) the output should be stored in a list or else it will be overwritten as we are assigning to the same object 'data1'
The below code does the correction
variable <- colnames(data) #is a `vector` now
data1 <- list() # initialize as a `list`
for(i in variable){
data1[[i]] <- data %>%
group_by_at(i) %>% #changed to `group_by_at`
summarise(number = mean(mpg))
}
Or this can be done in a tidyverse syntax which will return the output as a list of tibble and to avoid the initialization of list and assignment
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg)))
If we need to bind the list elements use bind_rows. But, it would also create multiple columns as the first column name is different and fill with NA
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg))) %>%
set_names(variable) %>%
bind_rows(., .id = 'variable')
I need your help to simplify the following code.
I need to name the columns of matrix and format each of it as factor.
How can I do that for 100 columns without doing it one by one.
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
train.data <- data.frame(x1=factor(z[,1],x2=factor(z[,2],....,x100=factor(z[,52]))
Here's one option
setNames(data.frame(lapply(split(z, col(z)), factor)), paste0("x", 1:p))
or use magrittr piping syntax
library(magrittr)
split(z, col(z)) %>%
lapply(factor) %>%
data.frame %>%
setNames(paste0("x", 1:p))
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))