Dplyr mutate many columns with each new column conditional on two columns - r

My data frame is like the simple one below with many more columns and rows.
My goal is to add a new column for every "model" based on some ifelse() output using the matching pval and value_IC column.
Here, models are linear, beta and emax.
The closest problem I have found so far was here
https://community.rstudio.com/t/how-to-mutate-at-mutate-if-multiple-columns-using-condition-on-other-column-outside-vars-dplyr/17506/2
where there is always the same "second" column used.
data <- data.frame(pval.linear.orig = c(0.01, 0.06, 0.02),
pval.beta.orig = c(0.06, 0.02, 0.01),
pval.emax.orig = c(0.03, 0.01, 0.07),
value_IC.linear.orig = c(-5, NA, -4),
value_IC.beta.orig = c(NA, NA, -10),
value_IC.emax.orig = c(NA, -11, NA))
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig
1 0.01 0.06 0.03 -5 NA NA
2 0.06 0.02 0.01 NA NA -11
3 0.02 0.01 0.07 -4 -10 NA
If I only wanted it for one model, let's say beta, I would do this:
library(dplyr)
data_new <- data %>% mutate(conv.beta.orig = case_when(
pval.beta.orig > 0.025~ NA,
pval.beta.orig <= 0.025 & !(is.na(value_IC.beta.orig)) ~ TRUE,
pval.beta.orig <= 0.025 & is.na(value_IC.beta.orig) ~ FALSE))
data_new
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig conv.beta.orig
1 0.01 0.06 0.03 -5 NA NA NA
2 0.06 0.02 0.01 NA NA -11 FALSE
3 0.02 0.01 0.07 -4 -10 NA TRUE
to get the conv.beta.orig column. The column name does not have to be exactly in this format.
My problem now is to do so with all models I have each using the pval.MODEL.orig and value_IC.MODEL.orig column as above.
Thank you very much for your help!
This is the first question I ever posted, let me now if I should reformulate something / missed something or didn't spot this problem in case it already exists / etc.

Related

Replace matching column values in multiple dataframes with NA in R

I have a dataframe which is like the followng:
dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
target = rep(c("1", "2", "3"), times = 2),
eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 0.75 0.09
I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:
outliers1 <- data.frame(participant = c("01", "02"),
target = c("1", "3"),
eucDist = c(0.06, 0.75),
eucDist2 = c(0.09, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 02 3 0.75 0.09
outliers2 <- data.frame(participant = "01",
target = "1",
eucDist = 0.06,
eucDist2 = 0.09)
participant target eucDist eucDist2
1 01 1 0.06 0.09
The rows shown in Outliers1 indicate outliers in the eucDist column in dat, and the row shown in Outliers2 indicates an outlier in the eucDist2 column.
I would like to replace the outlier values in the eucDist and eucDist2 columns of datwith 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.
Here is what I would like:
participant target eucDist eucDist2
1 01 1 NA NA
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 NA 0.09
I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:
library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
replace_with_na_all(dat$eucDist)
}
As you have the data frames in this format, you can set the values of the required data to NA in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update(). This assumes you have at least dplyr v1.0.0.
library(dplyr)
outliers1$eucDist <- NA
outliers2$eucDist2 <- NA
dat |>
rows_update(outliers1, by = c("participant", "target")) |>
rows_update(select(outliers2, -eucDist), by = c("participant", "target"))
# participant target eucDist eucDist2
# 1 01 1 NA NA
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 1.45
# 6 02 3 NA 0.09
I would recommend using case_when function in the dplyr package.
dat_final <- dat %>%
mutate(eucDist = case_when(
eucDist %in% outliers1$eucDist ~ as.numeric(NA),
T ~ eucDist
),
eucDist2 = case_when(
eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
T ~ eucDist2
))
> str(dat_final)
'data.frame': 6 obs. of 4 variables:
$ participant: chr "01" "01" "01" "02" ...
$ target : chr "1" "2" "3" "1" ...
$ eucDist : num NA 0.16 0.89 0.1 0.11 NA
$ eucDist2 : num 0.09 0.04 0.03 0.05 NA 0.09
rbind the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for loop.
outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))
for (i in seq_len(nrow(outliers))) {
dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
# participant target eucDist eucDist2
# 1 01 1 NA 0.09
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 NA
# 6 02 3 NA 0.09

Remove leading zeros in numbers *within a data frame*

Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...

Return value in column 1 when value in column 2 exceeds 2 for 1st time

I have a dataframe called "new_dat" containing the time (days) in column t, and temperature data (and occaisionally NA) in columns A - C (please see the example in the code below):
> new_dat
t A B C
1 0.00 0.82 0.88 0.46
2 0.01 0.87 0.94 0.52
3 0.02 NA NA NA
4 0.03 0.95 1.03 0.62
5 0.04 0.98 1.06 0.67
6 0.05 1.01 1.09 0.71
7 0.06 2.00 1.13 2.00
8 0.07 1.06 1.16 0.78
9 0.08 1.07 1.18 0.81
10 0.09 1.09 1.20 0.84
11 0.10 1.10 1.21 0.86
12 0.11 2.00 1.22 0.87
Here is a dput() of the dataframe:
structure(list(t = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07,
0.08, 0.09, 0.1, 0.11), A = c(0.82, 0.870000000000001, NA,
0.949999999999999,
0.979999999999997, 1.01, 2, 1.06, 1.07, 1.09, 1.1, 2), B =
c(0.879999999999999,
0.940000000000001, NA, 1.03, 1.06, 1.09, 1.13, 1.16, 1.18, 1.2,
1.21, 1.22), C = c(0.460000000000001, 0.520000000000003, NA,
0.619999999999997, 0.669999999999998, 0.709999999999997, 2,
0.780000000000001,
0.809999999999999, 0.84, 0.859999999999999, 0.87)), .Names = c("t",
"A", "B", "C"), row.names = c(NA, 12L), class = "data.frame")
As output, I want a vector (list?) of the values of column t where the temperature reading from columns A-C >= 2 for the first time (and only the first time), or - if the temperature is never >= 2 - return the last time reading in column t (0.11 in my example). So 'A' would return the value 0.06 (and not 0.11), 'B' would have the value 0.11 and 'C' 0.06. I intended to use the vector generated to create a new dataframe something like this:
A B C
0.06 0.11 0.06
I'm inexperienced with R (and code in general) so, despite reading that looping can be ineficient (but not really understanding how to accomplish what i want without it), I tried to solve this by looping first by column and then by row as follows:
#create blank vector to add my results to
aer <- c()
#loop by column, then by row, adding values according to the if statement
for (c in 2:ncol(new_dat)){
c <- c
for (r in 1:nrow(new_dat)){
r <- r
if ((!is.na(new_dat[r,c] )) & (new_dat[r,c] >= 2)){
aer <- c(aer, new_dat$t[r])
}
}
}
This returns my vector, aer, as:
> aer
[1] 0.06 0.11 0.06
So it's returning both instances where 'A' is 2, and the one from column 'C'.
I dont know how to instruct the loop to stop and move to the next column after finding one instance where my 'if' statement is true. I also tried adding an 'else' to cover the situation where temperature doesnt exceed 2:
else {
aer <- c(aer, new_dat$t[nrow(new_dat)])
But this did not work.
I would appreciate any help in completing the code, or suggestions for a better solution.
library(tidyverse)
new_dat %>%
gather(col, temp, -t) %>% # reshape data
na.omit() %>% # remove rows with NAs
group_by(col) %>% # for each column value
summarise(v = ifelse(is.na(first(t[temp >= 2])), last(t), first(t[temp >= 2]))) %>% # return the last t value if there are no temp >=2 otherwise return the first t with temp >= 2
spread(col, v) # reshape again
# # A tibble: 1 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.06 0.11 0.06
This solution will create the dataframe for you automatically, instead of returning a vector for you to create the dataframe yourself.
Here is a two steps solution.
First get an index vector of the values you want, then use that index vector to subset the dataframe.
inx <- sapply(new_dat[-1], function(x) {
w <- which(x >= 2)
if(length(w)) min(w) else NROW(x)
})
new_dat[inx, 1]
#[1] 0.06 0.11 0.06

How to include calculations in apply or rowsum?

I need to include some operations before summing the rows in my data frame. Here is an example:
df1 <- data.frame(
AC1Q = c(0.53, 0.57, 0.60, 0.51),
AC4Q = c(0.15, 0.12, 0.09,0.19),
AC2Q = c(0.09, 0.05, 0.07, 0.05),
AC3Q = c(0.23, 0.26, 0.23, 0.26)
)
df1
# AC1Q AC4Q AC2Q AC3Q
# 1 0.53 0.15 0.09 0.23
# 2 0.57 0.12 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
I want to get the row sums based on (sin(2*pi*(AC1Q-0.25)) + sin(2*pi*(-AC4Q+0.25)) - sin(2*pi*(AC2Q+0.25)) - sin(2*pi*(AC3Q-0.25)))/4) The result should be:
# 1 0.20
# 2 0.15
# 3 0.21
# 4 0.10
I am learning apply and tried apply(df1, 1, function(x) (sin(2*pi*(df1$AC1Q-0.25)) + sin(2*pi*(-df1$AC4Q+0.25)) - sin(2*pi*(-df1$AC2Q+0.25)) - sin(2*pi*(df1$AC3Q-0.25)))/4)but the result is wrong. I am not sure what I did wrong. I know I can always do the calculation for each column first, combine them into a data frame, and use rowsum But is there a more efficient way to do it?
apply(df1, 1, function(x) (sin(2*pi)*(x["AC1Q"]-0.25) +
sin(2*pi)*(-x["AC4Q"]+0.25) -
sin(2*pi)*(-x["AC2Q"]+0.25) -
sin(2*pi)*(x["AC3Q"]-0.25))/4)

Get the sum of a specific number of following rows in R

I have to solve this specific problem in R. I have a large list, containing columns and rows in this format:
Day_and_Time Rain1_mm/min Rain2_mm/min
01.12.10 18:01 0 0
.............. .... ...
02.12.10 01:00 0.03 0
02.12.10 01:01 0.03 0
02.12.10 01:02 0.01 0
02.12.10 01:03 0.05 0
02.12.10 01:04 0.03 0.1
02.12.10 01:05 0.04 0
.............. .... ...
02.12.10 18:00 0 0
What I want to do is to write a function that sums up six following rows and return the result as a new row. This means that at the end I have a new list - looking like this for example:
Day_and_Time Rain1_mm/5min Rain2_mm/5min
.............. .... ...
02.12.10 01:05 0.19 0.1
02.12.10 01:10 .... ...
.............. .... ...
Is it possible to do this? The goal is to transform the unit [mm/min] from the first and second column to [mm/5min].
Thank you very much!
Assuming that you read the data in your .csv file as a data frame df, one approach to your problem is to use rollapply from the zoo package to give you a rolling sum:
library(zoo)
ind_keep <- seq(1,floor(nrow(df)/5)*5, by=5) ## 1.
out <- sapply(df[,-1], function(x) rollapply(x,6,sum)) ## 2.
out <- data.frame(df[ind_keep+5,1],out[ind_keep,]) ## 3.
colnames(out) <- c("Day_and_time","Rain1_mm/5min","Rain2_mm/5min") ## 4.
Notes:
Here, we define the indices corresponding to every 5 minutes where we want to keep the rolling sum over the next 5 minutes.
Apply a rolling sum function for each column.
Use sapply over all columns of df that is not the first column. Note that the column indices specified in df[,-1] can be adjusted so that you process only certain columns.
The function to apply is rollapply from the zoo package. The additional arguments are the width of the window 5 and the sum function so that this performs a rolling sum.
At this point, out contains the rolling sums (over 5 minutes) at each minute, but we only want those every 5 minutes. Therefore,
Combines the Day_and_time column from the original df with out keeping only those columns every 5 minutes. Note that we keep the last Day_and_Time in each window.
This just renames the columns.
Using MikeyMike's data, which is
Day_and_Time rain1 rain2
1 2010-02-12 01:00:00 0.03 0.00
2 2010-02-12 01:01:00 0.03 0.00
3 2010-02-12 01:02:00 0.01 0.00
4 2010-02-12 01:03:00 0.05 0.00
5 2010-02-12 01:04:00 0.03 0.10
6 2010-02-12 01:05:00 0.04 0.00
7 2010-02-12 01:06:00 0.02 0.10
8 2010-02-12 01:07:00 0.10 0.10
9 2010-02-12 01:08:00 0.30 0.00
10 2010-02-12 01:09:00 0.01 0.00
11 2010-02-12 01:10:00 0.00 0.01
this gives:
print(out)
## Day_and_time Rain1_mm/5min Rain2_mm/5min
##1 2010-02-12 01:05:00 0.19 0.10
##2 2010-02-12 01:10:00 0.47 0.21
Note the difference in the result, this approach assumes you want overlapping windows since you specified that you want to sum the six numbers between the closed interval [i,i+5] at each 5 minute mark.
To extend the above to a window in the closed interval [i, i+nMin] at each nMin mark:
library(zoo)
nMin <- 10 ## for example 10 minutes
ind_keep <- seq(1, floor(nrow(df)/nMin)*nMin, by=nMin)
out <- sapply(df[,-1], function(x) rollapply(x, nMin+1, sum))
out <- data.frame(df[ind_keep+nMin, 1],out[ind_keep,])
colnames(out) <- c("Day_and_time",paste0("Rain1_mm/",nMin,"min"),paste0("Rain2_mm/",nMin,"min"))
For this to work, the data must have at least 2 * nMin + 1 rows
Hope this helps.
Assuming you want the groups to be 0 - 5 minutes, 6 - 10 minutes, etc. this should give you what you're looking for:
library(data.table)
setDT(df)[,.(day_time = max(Day_and_Time),
rain1_sum=sum(rain1),
rain2_sum=sum(rain2)),
by=.(floor(as.numeric(Day_and_Time)/360))]
floor day_time rain1_sum rain2_sum
1: 3516540 2010-02-12 01:05:00 0.19 0.10
2: 3516541 2010-02-12 01:10:00 0.43 0.21
Data:
df <- structure(list(Day_and_Time = structure(c(1265954400, 1265954460,
1265954520, 1265954580, 1265954640, 1265954700, 1265954760, 1265954820,
1265954880, 1265954940, 1265955000), class = c("POSIXct", "POSIXt"
), tzone = ""), rain1 = c(0.03, 0.03, 0.01, 0.05, 0.03, 0.04,
0.02, 0.1, 0.3, 0.01, 0), rain2 = c(0, 0, 0, 0, 0.1, 0, 0.1,
0.1, 0, 0, 0.01)), .Names = c("Day_and_Time", "rain1", "rain2"
), row.names = c(NA, -11L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000240788>)

Resources