I am trying to find the quickest and most effective way to produce a table using a for loop (or map in purrrr) in R.
I have 15,881 values which I am trying to loop over, for this example assume the values are the numbers 1 to 15,881 incremented by 1, which is this variable:
values <- c(1:15881)
I am then trying to filter an existing dataframe where a column matches a value and then perform some data cleaning process - the output of this a single dataframe, for clarity my process is the following:
Assume in this situation that I have chosen a single value from the values object e.g. value = values[1]
So then for a single value I have the following:
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
The above code works perfectly fine when I run it for a single value. The output is a desired dataframe. This process takes around 0.7 seconds for a single value.
However, I am trying to append the results of this output to an empty dataframe for each and every single value found in the variable values
So far I have tried the following:
For Loop approach
# empty dataframe to append values to
empty_df <- tibble()
for (value in values){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
empty_df <- bind_rows(empty_df, df)
}
However the above is extremely slow - I did a quick calculation and it would take around 186 minutes ((0.7 seconds per table x 15,881)/60 - seconds in a minute = around 185.7 minutes) - which is a huge amount of time to process just a dataframe.
Is there a quicker way to speed up the above process instead of a for loop? I can't think of any way to improve the fundamentals of the above code as it does the job well and 0.7 seconds to produce a single table seems fast to me but 15,881 tables is obviously going to take a long time.
I tried using the purrr package along with data.table but the furthest I got was this:
combine_dfs <- function(value){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
df <- data.table(df)
rbindlist(list(df, empty_df))
}
Then running with map_df is this:
map_df(values, ~combine_dfs(.))
However, even the above is extremely slow and seems to take around the same time!
Any help is appreciated!
Row binding dataframe in a loop is inefficient irrespective of which library you use.
You have not provided any example data but I think for your case this should work the same.
library(dplyr)
df_to_filter %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country)) -> result
result
You really need to provide an reproducible example firstly. Otherwise we can't provide a complete solution and have nothing to compare the result.
library(data.table)
setDT(df_to_filter)[code %in% values, by = .(code, country)] %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
Related
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?
The syntax I have written so far is:
scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>%
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))
As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.
You can use do.call() to take all columns except id as input of coalesce().
library(dplyr)
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = do.call(coalesce, select(df, -id)))
In addition, you can call coalesce() iteratively by Reduce().
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = Reduce(coalesce, select(df, -id)))
Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.
I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .
Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)
Upfront apology if this has been asked, I have been searching all day and have not found an answer I can apply to my problem.
I am trying to solve this issue using dplyr (and co.) because my previous method (for loops) was too inefficient. I have a dataset of event times, at sites, that are in groups. I want to summarize the number (and proportion) of events that occur in a moving window along a sequence.
# Example data
set.seed(1)
sites = rep(letters[1:10],10)
groups = c('red','blue','green','yellow')
times = round(runif(length(sites),1,100))
timePeriod = seq(1,100)
# Example dataframe
df = data.frame(site = sites,
group = rep(groups,length(sites)/length(groups)),
time = times)
This is my attempt to summarize the number of sites from each group that contain a time (event) within a given moving window of time.
The goal is to move through each element of the vector timePeriod and summarize how many events in each group occurred at timePeriod[i] +/- half-window. Ultimately storing them in, e.g., a dataframe with a column for each group, and a row for each time step, is ideal.
df %>%
filter(time > timePeriod[i]-25 & time < timePeriod[i]+25) %>%
group_by(group) %>%
summarise(count = n())
How can I do this without looping through my sequence of time and storing the summary table for each group individually? Thanks!
Combining lapply and dplyr, you can do the following, which is close to what you had worked so far.
lapply(timePeriod, function(i){
df %>%
filter(time > (i - 25) & time < ( i + 25 ) ) %>%
group_by(group) %>%
summarise(count = n()) %>%
mutate(step = i)
}) %>%
bind_rows()