Related
I would like to apply a certain function (namely AddLags from below) to groups of a dataframe. To achieve this, I am trying to use two consecutive map_dfr (piping one to another), so as to apply the respective filters. For the last step, I am applying the custom function (mentioned earlier) - using map_dfr (to capture the newly calculated output data in a new object).
The code I have so far is as follows:
# dummy dataset
df <- data.frame(
date = seq(today(),length.out=12,by='month'),
dim1 = c('a','a','a','b','b','b','c','c','c','d','d','d'),
dim2 = c(1,1,1,1,1,1,2,2,2,2,2,2),
value = 1:12
)
# function to apply
AddLags <- function(df,lags_vector,target_col,date_col){
temp_lags <- map_dfc(lags_vector,
~ df %>%
arrange({{date_col}}) %>%
transmute(
across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
)
)
return(temp_lags)
}
# prepare for map_dfr approach
lags_features <- c(1,2)
dims1 <- df %>% pull(dim1) %>% unique %>% sort
dims2 <- df %>% pull(dim2) %>% unique %>% sort
# what I am struggling with
map_dfr(dims1,
~ df %>%
filter(dim1==.x) %>%
map_dfr(dims2,
~ . %>%
filter(dim2==.x) %>%
AddLags(lags_features,variable,date)
)
)
# how the loop version would look like
gather_results <- data.frame()
for(d1 in dims1){
for(d2 in dims2){
tempdata <- df %>% filter(dim1==d1,dim2==dim2) %>% arrange(date)
temp <- AddLags(tempdata)
gather_results %<>% bind_rows(temp)
}
}
In essence, I am traversing through the different groups (through filtering) and applying the custom function respectively, while trying to use map_dfr to consolidate the newly calculated results.
I would like to know how to achieve the above (assuming that is feasible) and what am I missing since for the time being all I get back is an empty dataframe.
BONUS QUESTION:
As I am writing this, I realize that there has to be a better way of doing this instead of looping - for instance using a group_by - but given the nature of the problem and the fact that the function outputs new data, I am not sure how this would look like (assuming is feasible to begin with). So, any kind of suggestion/alternative/best practice would be much appreciated.
DISCLAIMER:
I a big noob when it comes to purrr functionality and not much of an experienced dplyr user either, so kindly forgive my ignorance.
Is this the expected output?
library(tidyverse)
library(lubridate)
group_split(df, dim1, dim2) %>%
map_dfr(~ .x %>% AddLags(1:2, "value", date))
#> # A tibble: 12 × 2
#> value_lag_01 value_lag_02
#> <int> <int>
#> 1 NA NA
#> 2 1 NA
#> 3 2 1
#> 4 NA NA
#> 5 4 NA
#> 6 5 4
#> 7 NA NA
#> 8 7 NA
#> 9 8 7
#> 10 NA NA
#> 11 10 NA
#> 12 11 10
Data:
# dummy dataset
df <- data.frame(
date = seq(today(), length.out = 12, by = "month"),
dim1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
dim2 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
value = 1:12
)
# function to apply
AddLags <- function(df, lags_vector, target_col, date_col) {
temp_lags <- map_dfc(
lags_vector,
~ df %>%
arrange({{ date_col }}) %>%
transmute(
across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
)
)
return(temp_lags)
}
Created on 2022-01-13 by the reprex package (v2.0.1)
As #Limey suggested a possible way would be to use the group_map function :
results_df <- data.frame()
results_df <-
bind_rows(
df %>%
group_by(dim1,dim2) %>%
group_map(~AddLags(.,c(1,2),'value',date))
)
And the expected results would be :
value_lag_01 value_lag_02
<int> <int>
1 NA NA
2 1 NA
3 2 1
4 NA NA
5 4 NA
6 5 4
7 NA NA
8 7 NA
9 8 7
10 NA NA
11 10 NA
12 11 10
However, I personally I would go with #jpdugo17 approach
Consider the following tibble:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
Suppose I want to make a row-wise sum of "x" and "y", creating variable "z", like this:
data %>%
rowwise() %>%
mutate(z = sum(c(x,y), na.rm = T))
This works fine for what I want, but the problem is that my true dataset has many variables and I did not
want to check before what variables I have and what I do not have. So, suppose I may have variables that do not exist among the elements of the sum:
data %>%
rowwise() %>%
mutate(k = sum(c(x,y,w), na.rm = T))
In this case, it will not run, because column "w" does not exist.
How can I make it run anyway, ignoring the non-existence of "w" and summing over "x" and "y"?
PS: I prefer to do it without filtering the dataset before running the sum. I would like to somehow make the sum happen in any case, whether variables exist or not.
if I understood your problem correctly this would be a solution (slight modification of #Duck's comment:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA),
a = c(rnorm(1,1,n = 10)*1000,NA,NA))
wishlist <- c("x","y","w")
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Sum=sum(c_across(colnames(data)[colnames(data) %in% wishlist]),na.rm=T))
x y a Sum
<dbl> <dbl> <dbl> <dbl>
1 3496. 439. -47.7 3935.
2 6046. 460. 2419. 6506.
3 6364. 672. 1030. 7036.
4 1068. 1282. 2811. 2350.
5 2455. 990. 689. 3445.
6 6477. -612. -1509. 5865.
7 7623. 1554. 2828. 9177.
8 5120. 482. -765. 5602.
9 1547. 1328. 817. 2875.
10 5602. -1019. 695. 4582.
11 NA NA NA 0
12 1000 NA NA 1000
Try this:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
data$k <- rowSums(as.data.frame(data[,which(c("x","y","w")%in%names(data))]),na.rm=TRUE)
Output:
# A tibble: 12 x 3
x y k
<dbl> <dbl> <dbl>
1 3121. 934. 4055.
2 6523. 1477. 8000.
3 5538. 863. 6401.
4 3099. 1344. 4443.
5 4241. 284. 4525.
6 3251. -448. 2803.
7 4786. -291. 4495.
8 4378. 910. 5288.
9 5342. 653. 5996.
10 4772. 1818. 6590.
11 NA NA 0
12 1000 NA 1000
I have a situation where I have data distributed between two dataframe, and I need to subset the data from one of the dataframes first, and then conduct a t-test between this subset data and the (entire) data from the other dataframe.
I attempted to use %>% and group_by() to select the data I want, and then I tried to invoke the t-test as shown below.
library(dplyr)
a <- c("AA","AA","AA","AB","AB","AB")
b <- c(1,2,3,1,2,3)
c <- c(12,34,56,78,90,12)
cols1 <- c("SampID", "Reps", "Vals")
df1 <- data.frame(a,b,c)
colnames(df1) <- cols1
df1
SampID Reps Vals
1 AA 1 12
2 AA 2 34
3 AA 3 56
4 AB 1 78
5 AB 2 90
6 AB 3 12
e <- c(1,2,3,4,5,6,7,8,9)
f <- c(11,22,33,44,55,66,77,88,99)
cols2 <- c("CtrlReps","CtrlVals")
df2 <- data.frame(e,f)
colnames(df2) <- cols2
df2
CtrlReps CtrlVals
1 1 11
2 2 22
3 3 33
4 4 44
5 5 55
6 6 66
7 7 77
8 8 88
9 9 99
df1 %>%
group_by(SampID) %>%
t.test(Vals, df2$CtrlVals, var.equal = FALSE)
This, however, returns an error:
Error in match.arg(alternative) :
'arg' must be NULL or a character vector
I also tried using do but that returns an error as well:
outputs <- df1 %>%
group_by(SampID) %>%
do(tpvals = t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE)) %>%
summarise(SampID, pvals = tpvals$p.value)
Error in t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE) :
object 'Vals' not found
I am new to R, and I have exhausted my Google-Fu, so I have no idea what is happening. To the best of my knowledge, these two errors are unrelated, I think but resolving one or the other gives me a way out of the situation. I just don't know how. I am also sure that resolving this problem would immediately land me in the next problem (the one this post actually addresses).
Your inputs/guidance/help would be much appreciated!
Your attempt with do was close, it can be fixed by doing:
outputs <- df1 %>%
group_by(SampID) %>%
do(tpvals = t.test(.$Vals, df2$CtrlVals,
paired = FALSE, var.equal = FALSE)) %>%
summarise(SampID, pvals = tpvals$p.value)
You need .$Vals to get at the Vals column within do, it doesn't work quite the same way as mutate. The data argument for t.test also isn't useful here as you don't have both variables in the same dataframe so you can't put them both in a formula.
Result:
> outputs
# A tibble: 2 x 2
SampID pvals
<fct> <dbl>
1 AA 0.253
2 AB 0.862
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA
Greetings: I am new to dplyr and having some challenges formatting my output. Here is a code snippet that produces some reproducible data, using melt to get it into the shape I need.
set.seed(1234)
library(reshape2)
library(dplyr)
val <- c(0:1)
a <- sample(val, 99, replace=T)
b <- sample(val, 99, replace=T)
c <- sample(val, 99, replace=T)
d <- sample(val, 99, replace=T)
dat <- data.frame(a,b,c,d)
melt.dat <- melt(dat)
Now, I can perform the desired summary:
SummaryTable <- melt.dat %>%
group_by(variable) %>%
summarise_each(funs(sum, sum/n()))
Here is my output:
variable sum *
1 a 50 50.50505
2 b 58 58.58586
3 c 46 46.46465
4 d 46 46.46465
My ideal output would be something as follows. I am unable able to figure out how to specify my column names in the summarise_each or melt functions, set the decimal place and suppress the row numbers. I've spent a long time getting this far, and just can't seem to get the rest figured out!
Letter Count Percent
a 50 50.5
b 58 58.6
c 46 46.5
d 46 46.5
Not sure whether it's possible within dplyr to suppress rownames (numbering), but here's how you could get the names and formatting right:
options(digits = 3)
melt.dat %>%
group_by(Letter = variable) %>%
summarise_each(funs(Count = sum(.), Percent = sum(.)/n()*100), -variable)
#Source: local data frame [4 x 3]
#
# Letter Count Percent
#1 a 45 45.5
#2 b 51 51.5
#3 c 52 52.5
#4 d 48 48.5