R dplyr chaining group by into function - r

I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?

The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.

What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data

As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.

Related

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

how to divide dataset to test and train whit respect of a group

I want to divide my data set into train and test data. but I have one column as a group.All member of a group must be in train or test. for example if the group column is like this:
group
1
1
1
1
1
2
2
2
3
3
if one of the row of first group is in train set the first 5 rows must be in there and ...
A solution using dplyr. dat_train and dat_test is the final result. I assume a case with 10000 group of training dataset and 5000 group of testing dataset.
library(dplyr)
# Set seed for reproducibility
set.seed(12345)
# Create an example data frame with group and data
dat <- tibble(group = rep(1:15000, each = 5),
data = rnorm(75000))
# Step 1: Create a look up table showing group number
g <- dat %>% distinct(group)
# Step 2: Use sample_n to sampel for train
g_train <- g %>% sample_n(size = 10000)
# Step 3: Use semi_join and anti_join to split dat into train and test
dat_train <- dat %>% semi_join(g_train, by = "group")
dat_test <- dat %>% anti_join(g_train, by = "group")
Let's assume you have a total of 20 groups and you want 8 groups in the training set and the remaining 12 in your test set.
First, let's generate some data to play with:
dat <- data.frame(group=factor(rep(1:20, each=5)), value=rnorm(100))
As you want to sample by group rather than observation, now draw a random sample of size 8 from groups for your training set and put the rest into the test set.
train.groups <- sample(levels(dat$group), 8)
dat.train <- dat[dat$group %in% train.groups, ]
dat.test <- dat[!(dat$group %in% train.groups), ]
You could use the dplyr and tidyverse (package) to solve this.
Assuming your dataset's name is df1.
Here is an example:
library(dplyr)
library(tidyverse)
training_data <- df1 %>% filter(group=1)
testing_data <- df1 %>% filter(group=2)

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Add fake data to a data frame based on variable condition

Good afternoon,
I have to add dummy data to a dataframe whenever a specific variable is absent of several given intervals.
require(plyr)
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
for(x in df$length){
if (x<=0|x>1e+07) {
df <- rbind.fill(df, data.frame(length = c(5000000), grade = c(1000)))
}
This works fine but I am having trouble to check if x is absent in each “length” interval from 0 to 1e+08, with a step of 1e+07, and add “1000“ in “grade” if that is the case. I tried all lot of things, and the end my data frame is only 1 row larger.
After that, I will create subgroups based on these intervals and I need a value for each subgroup.
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
In the end, the objective is to still get an empty space on a boxplot for each condition where there is no data, as the “1000“ I added is way above the limit threshold.
The next step will be to do the same but for each “company” variable.
I hope I am clear, sorry for my English.
Thanks
You can do it using dplyr and tidyr.
First, cut your df$length:
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
Now we can use dplyr to left_join on all the levels of length, we then complete company:length, filter out any NA companies, and change the NA to 1000:
library(dplyr)
library(tidyr)
df %>% left_join(data.frame(length = levels(df$length)), .) %>%
complete(length, company) %>%
filter(!is.na(company)) %>%
mutate(grade = ifelse(is.na(grade), 1000, grade))
Source: local data frame [10 x 3]
length company grade
(fctr) (fctr) (dbl)
1 (0,1e+07] TEST 1000
2 (1e+07,2e+07] TEST 1000
3 (2e+07,3e+07] TEST 1000
4 (3e+07,4e+07] TEST 1000
5 (4e+07,5e+07] TEST 1000
6 (5e+07,6e+07] TEST 1000
7 (6e+07,7e+07] TEST 1000
8 (7e+07,8e+07] TEST 1000
9 (8e+07,9e+07] TEST 1000
10 (9e+07,1e+08] TEST 1000

Copying data between groups in a grouped df

I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.

Resources