How can I speed up a function combining rbind and lapply? - r

I have a large dataframe(100K rows, 19 columns). I need to count the number of cases each month that contain each possible combination of 5 items.
The following code works for a small dataset but with my complete dataset it takes way too long. From my searching I suspect that pre-allocating a dataframe is the key, but I cannot figure out how to do that.
library(dplyr)
Case<-c(1,1,1,2,2,3,4,5,5,6,6,6,7,8,8,8,9,9,9)
Month<- c("Jan","Jan","Jan","Mar","Mar","Sep","Sep","Nov","Nov","Dec","Dec","Dec","Apr","Dec","Dec","Dec","Dec","Dec","Dec")
Fruits<-c("Apple","Orange","Grape","Grape","Orange","Apple","Apple","Orange","Grape","Apple","Orange","Grape","Grape","Apple","Orange","Grape","Apple","Orange","Grape")
df<-data.frame(Case,Month,Fruits)
Patterns <- with(df, do.call(rbind, lapply(unique(Case), function(x){
y <- subset(df, Case == x )
Date<-as.character(y$Month[1])
Fruits <- paste(unique(y$Fruits[order(y$Fruits)]), collapse = ' / ')
as.data.frame(unique (cbind(Case = y$Case, Date, Fruits)))
})))
Total<-Patterns %>%
group_by(Date,Fruits) %>%
tally()
The results I get are acceptable but the process takes too long and with a large dataset I run out of memory.

Over large datasets, data.table will be a lot quicker than dplyr:
library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]

We could do all of it in one command using dplyr. First we group_by Case and Month to paste all Fruits together by group and then grouping by Month and Fruits we add the number of rows for each group using tally.
library(dplyr)
df %>%
group_by(Case, Month) %>%
summarise(Fruits = paste(Fruits, collapse = "/")) %>%
group_by(Month, Fruits) %>%
tally()
# OR count()
# Month Fruits n
# <fct> <chr> <int>
#1 Apr Grape 1
#2 Dec Apple/Orange/Grape 3
#3 Jan Apple/Orange/Grape 1
#4 Mar Grape/Orange 1
#5 Nov Orange/Grape 1
#6 Sep Apple 2

Related

R: unique column values, combine rows of second column

From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

Gather twice in same data frame

I have a dataframe where I want to do two separate gathers
library(tidyverse)
id <- c("A","B","C","D","E")
test_1_baseline <- c(1,2,4,5,6)
test_2_baseline <- c(21000, 23400, 26800,29000,30000)
test_1_followup <- c(0,4,2,3,1)
test_2_followup <- c(10000,12000,13000,15000,21000)
layout_1 <-data.frame(id,test_1_baseline,test_1_followup,test_2_baseline,test_2_followup)
This is the current layout.
Each person is 1 line.
The result of Test 1 at baseline is one variable
The result of Test 2 at baseline is a second variable
The same applies to Test 1/2 follow-up results
I would like the data to be tidier. One column for timepoint, one for result of test A, one for result of test B.
id2 <- c("A","B","C","D","E","A","B","C","D","E")
time <- c(rep("baseline",5),rep("followup",5))
test_1_result <- c(1,2,4,5,6,0,4,2,3,1)
test_2_result <- c(21000, 23400, 26800,29000,30000,10000,12000,13000,15000,21000)
layout_2 <- data.frame(id2, time,test_1_result,test_2_result)
I'm currently doing a what seems to me odd process where first of all I gather the test 1 data
test_1 <- select(layout_1,id,test_1_baseline,test_1_followup) %>%
gather("Timepoint","test_1",c(test_1_baseline,test_1_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_followup", "followup"))
Then I do same for test 2 and join them
test_2 <- select(layout_1,id,test_2_baseline,test_2_followup) %>%
gather("Timepoint","test_2",c(test_2_baseline,test_2_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_followup", "followup"))
test_combined <- full_join(test_1,test_2)
I tried doing the first Gather and then the second on the same dataframe but then you end up with duplicates; i.e you end up with
ID 1 Test_1 Baseline Test_2 Baseline
ID 1 Test_1 Baseline Test_2 Followup
ID 1 Test_1 Followup Test_2
Baseline ID 1 Test_1 Followup Test_2 Followup
== 4 rows where there should only be 2
I feel there must be a cleaner tidyverse way to do this.
Guidance welcomed
One option with data.table using melt which can take multiple measure patterns
library(data.table)
nm1 <- unique(sub(".*_", "", names(layout_1)[-1]))
melt(setDT(layout_1), measure = patterns("test_1", "test_2"),
value.name = c('test_1_result', 'test_2_result'),
variable.name = 'time')[, time := nm1[time]][]
You could gather all columns except id, then use separate to split into result and time.
Note that this code assumes that the result name is always 6 characters (test_1, test_2), and separates based on that assumption. You'll need to devise a different separate if that is not the case.
library(tidyr)
library(dplyr)
layout_1 %>%
gather(Var, Val, -id) %>%
separate(Var, into = c("result", "time"), sep = 6) %>%
spread(result, Val) %>%
mutate(time = gsub("_", "", time))
Result:
id time test_1 test_2
1 A baseline 1 21000
2 A followup 0 10000
3 B baseline 2 23400
4 B followup 4 12000
5 C baseline 4 26800
6 C followup 2 13000
7 D baseline 5 29000
8 D followup 3 15000
9 E baseline 6 30000
10 E followup 1 21000

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources