Operations within rows in R - r

Consider the following dataset, where:
Variables 1-7 (var1-7) are linear measurements taken from five lizards (indvA-E);
Variable 8 (var8) is the number of variables, for each lizard, that contain values that are not equal to NA;
Variable 9 (var9) is the sum of variables 1-7;
data <- data.frame(var1 = c(0.13,0.08,0.05,0.11,0.09),
var2 = c(0.17,0.09,0.07,0.15,0.13),
var3 = c(0.19,0.11,0.19,0.17,0.14),
var4 = c(NA,0.11,0.31,0.38,0.17),
var5 = c(NA,NA,0.39,0.41,0.19),
var6 = c(NA,NA,0.40,0.75,NA),
var7 = c(NA,NA,0.45,0.79,NA))
row.names(data) <- c("indv.A","indv.B","indv.C","indv.D","indv.E")
data[,"var8"] <- rowSums(!is.na(data))
data[,"var9"] <- rowSums(data[,1:7], na.rm = TRUE)
data
# var1 var2 var3 var4 var5 var6 var7 var8 var9
# indv.A 0.13 0.17 0.19 NA NA NA NA 3 0.49
# indv.B 0.08 0.09 0.11 0.11 NA NA NA 4 0.39
# indv.C 0.05 0.07 0.19 0.31 0.39 0.40 0.45 7 1.86
# indv.D 0.11 0.15 0.17 0.38 0.41 0.75 0.79 7 2.76
# indv.E 0.09 0.13 0.14 0.17 0.19 NA NA 5 0.72
I'd like to create a new variable, named var10, that can be described as either "var8 divided by (var7 minus the last non-NA value of variables 1-7)" or "var8 divided by all but the last non-NA value of variables 1-7".
For the above dataset, this new variable will contain:
# var1-9 var10
# indv.A [...] 10.00
# indv.B [...] 14.29
# indv.C [...] 4.96
# indv.D [...] 3.55
# indv.E [...] 9.43
I just don't know how to write in R the formula to obtain this variable. Any help will be greatly appreciated.

1) If we need the last non-NA value from var1 to var7, we can do
v1 <- data[cbind(seq_len(nrow(data)), max.col(!is.na(data[1:7]), "last"))]
data$var10 <- data$var8/v1
2) For the second case to skip the last non-NA
data$var10 <- data$var8/
apply(data[1:7], 1, \(x) sum(head(x[!is.na(x)], -1)))
> data$var10
[1] 10.000000 14.285714 4.964539 3.553299 9.433962

Related

Replace matching column values in multiple dataframes with NA in R

I have a dataframe which is like the followng:
dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
target = rep(c("1", "2", "3"), times = 2),
eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 0.75 0.09
I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:
outliers1 <- data.frame(participant = c("01", "02"),
target = c("1", "3"),
eucDist = c(0.06, 0.75),
eucDist2 = c(0.09, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 02 3 0.75 0.09
outliers2 <- data.frame(participant = "01",
target = "1",
eucDist = 0.06,
eucDist2 = 0.09)
participant target eucDist eucDist2
1 01 1 0.06 0.09
The rows shown in Outliers1 indicate outliers in the eucDist column in dat, and the row shown in Outliers2 indicates an outlier in the eucDist2 column.
I would like to replace the outlier values in the eucDist and eucDist2 columns of datwith 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.
Here is what I would like:
participant target eucDist eucDist2
1 01 1 NA NA
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 NA 0.09
I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:
library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
replace_with_na_all(dat$eucDist)
}
As you have the data frames in this format, you can set the values of the required data to NA in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update(). This assumes you have at least dplyr v1.0.0.
library(dplyr)
outliers1$eucDist <- NA
outliers2$eucDist2 <- NA
dat |>
rows_update(outliers1, by = c("participant", "target")) |>
rows_update(select(outliers2, -eucDist), by = c("participant", "target"))
# participant target eucDist eucDist2
# 1 01 1 NA NA
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 1.45
# 6 02 3 NA 0.09
I would recommend using case_when function in the dplyr package.
dat_final <- dat %>%
mutate(eucDist = case_when(
eucDist %in% outliers1$eucDist ~ as.numeric(NA),
T ~ eucDist
),
eucDist2 = case_when(
eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
T ~ eucDist2
))
> str(dat_final)
'data.frame': 6 obs. of 4 variables:
$ participant: chr "01" "01" "01" "02" ...
$ target : chr "1" "2" "3" "1" ...
$ eucDist : num NA 0.16 0.89 0.1 0.11 NA
$ eucDist2 : num 0.09 0.04 0.03 0.05 NA 0.09
rbind the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for loop.
outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))
for (i in seq_len(nrow(outliers))) {
dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
# participant target eucDist eucDist2
# 1 01 1 NA 0.09
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 NA
# 6 02 3 NA 0.09

Apply conditional function in every row of a data frame

I'm new in R and I'm struggling with this df that looks like this:
Date Group Factor 1 Factor 2 Spread
2019-04-01 a 1.01 1.011 0.01
2019-04-02 a 1.02 1.012 0.02
2019-04-03 a 1.03 1.013 0.03
2019-04-01 b 1.005 1.004 0.01
2019-04-02 b 1.0051 1.0041 0.02
2019-04-03 b 1.0052 1.0042 0.03
I would like do verify each group in each row and if the results are Group "a" do Factor1/Factor1(1 day lag) * Factor2 + spread, and if the group it's not "a" do not add the spread.
Since you are conditioning on the group, this is a good example of by (base R), dplyr::group_by, or data.table's x[,,by=].
The equation is effectively the same in all three, capitalizing on the fact that (Group[1] == "a") will be coerced from a logical to numeric when multipled by a number; since FALSE translates to a 0, then effectively disabled adding Spread.
Base
I use within here to make the internals a little more readable, but this is not a requirement (in which case you'd need to prepend x$ in front of all of the variable names).
The lagging can be done using dplyr::lag (even if you don't use the rest of the package for this) or many other techniques. I don't find stats::lag to be the most intuitive in applications like this, but I'm sure somebody will suggest a way to incorporate it into an answer. The use of c(NA, ...) ensures that we don't bring in a different group's data or impute data we don't have, since we have no value to bring in on the first row of a group. Finally, head(..., n = 1) returns the first element of a vector/list, while head(..., n = -1) (negative) returns all but the last.
newx <- by(x, x$Group, function(y) {
within(y, {
NewVal = Factor2 * Factor1 / c(NA, head(Factor1, n=-1)) + (Group[1] == "a") * Spread
})
})
newx
# x$Group: a
# Date Group Factor1 Factor2 Spread NewVal
# 1 2019-04-01 a 1.01 1.011 0.01 NA
# 2 2019-04-02 a 1.02 1.012 0.02 1.042020
# 3 2019-04-03 a 1.03 1.013 0.03 1.052931
# -------------------------------------------------------
# x$Group: b
# Date Group Factor1 Factor2 Spread NewVal
# 4 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5 2019-04-02 b 1.0051 1.0041 0.02 1.0042
# 6 2019-04-03 b 1.0052 1.0042 0.03 1.0043
This is really just a list with some fancy by-specific formatting, so you can treat it as such as combine them in an efficient base-R way:
do.call("rbind.data.frame", c(newx, stringsAsFactors = FALSE))
# Date Group Factor1 Factor2 Spread NewVal
# a.1 2019-04-01 a 1.0100 1.0110 0.01 NA
# a.2 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# a.3 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# b.4 2019-04-01 b 1.0050 1.0040 0.01 NA
# b.5 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# b.6 2019-04-03 b 1.0052 1.0042 0.03 1.004300
dplyr
Many find the tidyverse line of packages to read intuitively.
library(dplyr)
x %>%
group_by(Group) %>%
mutate(NewVal = Factor2 * Factor1 / lag(Factor1) + (Group[1] == "a") * Spread) %>%
ungroup()
# # A tibble: 6 x 6
# Date Group Factor1 Factor2 Spread NewVal
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-04-01 a 1.01 1.01 0.01 NA
# 2 2019-04-02 a 1.02 1.01 0.02 1.04
# 3 2019-04-03 a 1.03 1.01 0.03 1.05
# 4 2019-04-01 b 1.00 1.00 0.01 NA
# 5 2019-04-02 b 1.01 1.00 0.02 1.00
# 6 2019-04-03 b 1.01 1.00 0.03 1.00
data.table
On a different note, many find data.table better because of efficiencies gained from in-place modification (most of R's operations are copy-on-write, meaning some operations re-copy the object or a portion of it with each change).
library(data.table)
X <- as.data.table(x)
X[, NewVal := Factor2 * Factor1 / shift(Factor1) + (Group[1] == "a") * Spread, by = "Group"]
X
# Date Group Factor1 Factor2 Spread NewVal
# 1: 2019-04-01 a 1.0100 1.0110 0.01 NA
# 2: 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# 3: 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# 4: 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5: 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# 6: 2019-04-03 b 1.0052 1.0042 0.03 1.004300
The "in-place" part is evident on the second line here, where it appears as if the [ operation should just return a subset or something of the data ... but in this case using := causes the columns to be created (or changed) in-place.

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

Find correlation between columns whose names are specified as values in another dataframe

I have two dataframes, one is a list of pairs of individuals, similar to below (but with about 150 pairs):
ID_1 ID_2
X14567 X26789
X12637 X34560
X67495 X59023
The other dataframe consists of once column per individual with numerical values relating to that individuals underneath. All told about 300 columns and 300 rows. For example:
X14567 X12637 X26789 X67495 X34560 X59023
0.41 0.29 0.70 0.83 0.41 0.30
0.59 0.44 0.20 0.94 0.03 0.97
0.48 0.91 0.78 0.92 0.40 0.09
0.07 0.21 0.42 0.14 0.96 0.96
0.33 0.13 0.53 0.04 0.52 0.49
0.94 0.28 0.37 0.26 0.11 0.09
I want to find the correlation of these values between each pair of individuals. to end up with something like:
ID_1 ID_2 Correlation
X14567 X26789 -0.25
X12637 X34560 -0.25
X67495 X59023 -0.11
Is there a way that I can pull the values from the first dataframe to specify the name of the two columns that I need to find correlations between in such a way that can be easily repeated for each row of the first dataframe?
Many thanks for your help
If x and y are your two data.frames and the column names are set appropriately, you can use apply.
apply(x, 1, function(row) cor(y[row[1]], y[row[2]]))
From there just add the values to your x data.frame:
x$cor <- apply(x, 1, function(row) cor(y[row[1]], y[row[2]]))
V1 V2 cor
2 X14567 X26789 -0.2515737
3 X12637 X34560 -0.2563294
4 X67495 X59023 -0.1092830
If you just want the correlations between all columns in your second data frame, you can do:
library(reshape2)
df.corr = melt(cor(df))
To remove repeated columns (that is, the correlation of each column with itself):
df.corr = subset(df.corr, Var1 != Var2)
Example using built-in mtcars data frame:
mtcars.corr = melt(cor(mtcars))
Var1 Var2 value
1 mpg mpg 1.00000000
2 cyl mpg -0.85216196
3 disp mpg -0.84755138
...
119 am carb 0.05753435
120 gear carb 0.27407284
121 carb carb 1.00000000

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks
Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))
This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

Resources