Consider the following tibble:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
Suppose I want to make a row-wise sum of "x" and "y", creating variable "z", like this:
data %>%
rowwise() %>%
mutate(z = sum(c(x,y), na.rm = T))
This works fine for what I want, but the problem is that my true dataset has many variables and I did not
want to check before what variables I have and what I do not have. So, suppose I may have variables that do not exist among the elements of the sum:
data %>%
rowwise() %>%
mutate(k = sum(c(x,y,w), na.rm = T))
In this case, it will not run, because column "w" does not exist.
How can I make it run anyway, ignoring the non-existence of "w" and summing over "x" and "y"?
PS: I prefer to do it without filtering the dataset before running the sum. I would like to somehow make the sum happen in any case, whether variables exist or not.
if I understood your problem correctly this would be a solution (slight modification of #Duck's comment:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA),
a = c(rnorm(1,1,n = 10)*1000,NA,NA))
wishlist <- c("x","y","w")
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Sum=sum(c_across(colnames(data)[colnames(data) %in% wishlist]),na.rm=T))
x y a Sum
<dbl> <dbl> <dbl> <dbl>
1 3496. 439. -47.7 3935.
2 6046. 460. 2419. 6506.
3 6364. 672. 1030. 7036.
4 1068. 1282. 2811. 2350.
5 2455. 990. 689. 3445.
6 6477. -612. -1509. 5865.
7 7623. 1554. 2828. 9177.
8 5120. 482. -765. 5602.
9 1547. 1328. 817. 2875.
10 5602. -1019. 695. 4582.
11 NA NA NA 0
12 1000 NA NA 1000
Try this:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
data$k <- rowSums(as.data.frame(data[,which(c("x","y","w")%in%names(data))]),na.rm=TRUE)
Output:
# A tibble: 12 x 3
x y k
<dbl> <dbl> <dbl>
1 3121. 934. 4055.
2 6523. 1477. 8000.
3 5538. 863. 6401.
4 3099. 1344. 4443.
5 4241. 284. 4525.
6 3251. -448. 2803.
7 4786. -291. 4495.
8 4378. 910. 5288.
9 5342. 653. 5996.
10 4772. 1818. 6590.
11 NA NA 0
12 1000 NA 1000
Related
I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable
for a large dataset. I have created this for-loop so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat in dat.loop applying a similar
for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using data.table or tidyverse?
How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code
to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into
multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to
do this.
Any help is appreciated!
EDIT:
I've modified the code #Desmond provided allowing for more generic code since dat and cb will be from user-inputted files,
and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with
"Digits_" and will be specified in the "Digits" column of cb.
library(tidytable)
results <- dat %>%
crossing.(cb) %>%
mutate_rowwise.(disp = (get(`Digits`)-y) *x ) %>%
pivot_wider.(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
fill.(starts_with("disp"), .direction = c("downup"), .by = 'random') %>%
select.(-c(x,y)) %>%
distinct.()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Here's a tidyverse solution:
crossing generates combinations from both datasets
case_when to apply your logic
pivot_wider, filter and bind_cols to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)
I cannot work this one out.
I have an incomplete dataset (many rows and variables) with one factor that specify whether all the other variables are pre- or post- something. I need to get summary statistics for all variables pre- and post- only including rows where the pre- AND post- values are not NA.
I am trying to find a way to replace existing values with NA if the set is incomplete separately for each variable.
The following is a simple example of what I am trying to achieve:
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
which leads to:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 2
3 2 1 NA 15 3
4 2 2 20 20 8
My desired output would be:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 NA
3 2 1 NA 15 3
4 2 2 NA 20 8
I have much more than one variable to deal with and the set is incomplete in a different way for each variable independently. I have the feeling this may be achieved with smart use of existing functions from the plyr / tidyr packages but I cannot find an elegant way to apply the concepts to my problem.
Any help would be appreciated.
You can group by id and if any value has NA in it replace all of them with NA. To apply a function to multiple columns we use across.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('var'), ~if(any(is.na(.))) NA else .))
#for dplyr < 1.0.0 we can use `mutate_at`
#mutate_at(vars(starts_with('var')), ~if(any(is.na(.))) NA else .)
# id myfactor var2change var3change var4change
# <dbl> <fct> <dbl> <dbl> <dbl>
#1 1 1 10 5 NA
#2 1 2 10 10 NA
#3 2 1 NA 15 3
#4 2 2 NA 20 8
It would help to have a grouping variable (group) as well as your time variable (myfactor). Then you can do some finangling to create the variables you want with dplyr.
library(dplyr)
df = data.frame(
group = rep(c(1,2), each = 2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
df %>% group_by(group) %>%
mutate(var3change = all(!is.na(var2change)),
var4change = if_else(var3change, var2change, as.numeric(NA)))
I'm assuming that the dataset you have is ordered, so each pair of observations is grouped by their row index.
By default, the mean() function will return an NA if any of the inputs to it are NA. This is therefore a neat way of getting an NA by group, using dplyr.
library(dplyr)
df = data.frame(
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
# 1 Create ID variable to group rows in pairs
id = c()
j = 0
for (i in 1:length(df$var2change)){
k = floor(j/2)
id = c(id, k)
j = j + 1
}
df$id = id
# Set all variables within group to NA if one of them is
df = df %>%
group_by(id) %>%
mutate(var_changed = mean(var2change))
If you have an explicit ID variable in your data, you can replace the first part of this solution.
EDIT: doing this for multiple variables (based on change to the question):
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
for (col in 2:4) {
col = paste0("var", col, "change")
df = df %>%
group_by(id) %>%
mutate(new_col = mean(get(col)))
df[["new_col"]] = ifelse(is.na(df["new_col"]), NA, df[[col]])
df[col] = NULL
names(df)[names(df) == "new_col"] <- col
}
If speed is an issue, you could speed this up by moving the group_by outside the loop
There are various questions on Stack Overflow regarding this, but I have been unable to find a solution to my question, which follows.
Suppose I have a data frame (or tibble) df with two columns, say X1 and X2. I have a function, say f, which takes inputs X1 and X2 and outputs a vector, say [V1, V2].
Now, if the output were a singleton, then I would be able to write
df %>% mutate(V = f(X1,X2))
to add a column labelled V to my df, and the entry would be f(X1,X2). However, I want to add two columns, V1 and V2. I do not know how to do this.
Of course, I could do something like
df %>% mutate(V1 = f(X1,X2)[1], V2 = f(X1,X2)[2]),
but this (I assume) involves calling the function f twice; I have a large data set, and would rather not call it twice.
Alternatively, I could do
df %>% mutate(V_list = as.list(f(X1,X2)), V1 = V_list[[1]], V2 = V_list[[2]]) %>% select(-V_list),
but this seems like a rather clunky way, and I'd rather not.
Further, I would like eventually to apply this to a grouped tibble, and so then the naive way of writing this would duplicate V_list for each entry in the group. As such, ideally any answer would be 'vectorisable', in the following sense.
Suppose I have done df %>% group_by(var1) and have a function f which takes a data frame with two columns as its input -- this should be thought of as 'a vector of pairs' -- and then outputs a new data frame with two columns.
Here is some code to set-up the example.
library(dplyr)
df = tibble(var1 = c(1,1,2,2), X1 = c(1,2,3,4), X2 = c(5,6,7,8))
f = function(sub_df, var){ return( data.frame(x1 = (x1+x2)^var, x2 = (x1-x2)^var) ) }
If your function outputs a data.frame it will be auto-spliced into new columns by mutate
library(dplyr, warn.conflicts = FALSE)
df = tibble(var1 = c(1,1,2,2), X1 = c(1,2,3,4), X2 = c(5,6,7,8))
f = function(x1,x2) tibble(a = x1 + x2, b = x1 - x2)
df %>%
mutate(f(X1, X2))
#> # A tibble: 4 × 5
#> var1 X1 X2 a b
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 5 6 -4
#> 2 1 2 6 8 -4
#> 3 2 3 7 10 -4
#> 4 2 4 8 12 -4
Created on 2021-09-16 by the reprex package (v2.0.1)
Or if your function outputs a vector you can use purrr:map2 with tidyr::unnest_wider
Modify function so output is named
f = function(x1,x2) c(a = x1 + x2, b = x1 - x2)
Create a new column which is a list containing a vector for each row, then apply unnest_wider to this column to split the vector elements into their own columns.
df %>%
mutate(new = map2(X1, X2, f)) %>%
unnest_wider(new)
# # A tibble: 4 x 5
# var1 X1 X2 a b
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 5 6 -4
# 2 1 2 6 8 -4
# 3 2 3 7 10 -4
# 4 2 4 8 12 -4
This may not be an ideal solution but I have faced this situation and this is what I usually do. Return a delimiter separated string from the function and separate the column based on that delimiter.
f = function(x1,x2){ return( toString(c(x1+x2, x1-x2))) }
library(tidyverse)
df %>%
mutate(new = map2_chr(X1, X2, f)) %>%
separate(new, c("col1", "col2"), sep = ",", convert = TRUE)
# A tibble: 2 x 4
# X1 X2 col1 col2
# <dbl> <dbl> <int> <int>
#1 1 3 4 -2
#2 2 4 6 -2
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA
I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows