My data is structured as follows:
set.seed(20)
RawData <- data.frame(Trial = c(rep(1, 10), rep(2, 10)),
X_Velocity = runif(20, 1, 3),
Y_Velocity = runif(20, 4, 6))
I now wish to calculate an average for X_Velocity and Y_Velocity across every two rows, for each Trial. My anticipated output, for the first four rows would be:
X_Velocity_AVG Y_Velocity_AVG
NA NA
2.6460545 4.522224
NA NA
1.8081265 4.5175165
How do I complete this?
You could do this using function f in which the average of every two elements is computed:
f <- function(a) tapply(a, rep(1:(length(a)/2), each = 2), FUN = mean)
res <- data.frame(X_Velocity_AVG=rep(NA, nrow(RawData)),
Y_Velocity_AVG=rep(NA, nrow(RawData)))
res$X_Velocity_AVG[c(F,T)] <- f(RawData$X_Velocity)
res$Y_Velocity_AVG[c(F,T)] <- f(RawData$Y_Velocity)
# X_Velocity_AVG Y_Velocity_AVG
# 1 NA NA
# 2 2.646055 4.522224
# 3 NA NA
# 4 1.808127 4.517517
# 5 NA NA
# 6 2.943262 4.334551
# 7 NA NA
# 8 1.162082 5.899396
# 9 NA NA
# 10 1.697668 4.739195
# 11 NA NA
# 12 2.473324 4.778723
# 13 NA NA
# 14 1.744730 5.020097
# 15 NA NA
# 16 1.644518 4.986245
# 17 NA NA
# 18 1.431219 5.375815
# 19 NA NA
# 20 2.108719 4.909284
Related
I have Length and Weight values in a data frame. However some of them are missing. The data frame is like:
df <- data.frame(
L = c(13,15,19,NA,NA,32,35,NA,NA,18,15),
W = c(NA,NA,50, NA,NA,NA,80,NA,NA,30,NA)
)
I need a function which will work when length is not NA and weight is NA. it will calculate the weight for length, and else it will do nothing.
lwr <- function(data, length, weight, a, b) {
if(!is.na(data$length) && is.na(data$weight)) {
data$weight = 10^(log(a) + b*log(data$length))
} else {
data$weight
}
}
Here we go
lwr(data=df, length = L, weight = W, a=0.003, b=3.2)
but it does not work.
If you help me, I would be appreciated. Thank you very much for your time.
You probably could do that easier.
f <- \(x, a, b) 10^(log(a) + b*log(x))
naw <- is.na(df$W)
df$W[naw] <- f(df$L[naw], .003, 3.2)
# L W
# 1 13 250.4350
# 2 15 718.8159
# 3 19 50.0000
# 4 NA NA
# 5 NA NA
# 6 32 191078.5331
# 7 35 80.0000
# 8 NA NA
# 9 NA NA
# 10 18 30.0000
# 11 15 718.8159
You should use the vectorized ifelse(), instead of if()...else....
lwr <- function(length, weight, a, b) {
ifelse(is.na(weight), 10^(log(a) + b*log(length)), weight)
}
df |>
transform(W2 = lwr(L, W, a=0.003, b=3.2))
# equivalent:
# df$W2 <- lwr(df$L, df$W, a=0.003, b=3.2)
# L W W2
# 1 13 NA 250.4350
# 2 15 NA 718.8159
# 3 19 50 50.0000
# 4 NA NA NA
# 5 NA NA NA
# 6 32 NA 191078.5331
# 7 35 80 80.0000
# 8 NA NA NA
# 9 NA NA NA
# 10 18 30 30.0000
# 11 15 NA 718.8159
I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!
Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2
You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))
Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7
I have 15 datasets. The 1st column is "subject" and is identical in all sets. The number of the rest of the columns is not the same in all datasets. I need to combine all of this data in a single dataframe. I found the command "Reduce" but I am just starting with R and I couldn't understand if this is what I need and if so, what is the syntax? Thanks!
I suggest including a reproducible example in the future so that others can see the format of data you're working with and what you're trying to do.
Here is some randomly generated example data, each with the "Subject" column:
list_of_dfs <- list(
df1 = data.frame(Subject = 1:4, a = rnorm(4), b = rnorm(4)),
df2 = data.frame(Subject = 5:8, c = rnorm(4), d = rnorm(4), e = rnorm(4)),
df3 = data.frame(Subject = 7:10, f = rnorm(4)),
df4 = data.frame(Subject = 2:5, g = rnorm(4), h = rnorm(4))
)
Reduce with merge is a good choice:
combined_df <- Reduce(
function(x, y) { merge(x, y, by = "Subject", all = TRUE) },
list_of_dfs
)
And the output:
> combined_dfs
Subject a b c d e f g h
1 1 1.1106594 1.2530046 NA NA NA NA NA NA
2 2 -1.0275630 0.6437101 NA NA NA NA -1.9393347 -0.4361952
3 3 0.1558639 1.2792212 NA NA NA NA -0.8861966 1.0137530
4 4 0.4283585 -0.1045530 NA NA NA NA 1.8924896 -0.3788198
5 5 NA NA 0.08261190 0.77058804 -1.165042 NA 0.7950784 -1.3467386
6 6 NA NA 2.51214598 0.62024328 1.496520 NA NA NA
7 7 NA NA 0.01581309 -0.04777196 -1.327884 1.5111734 NA NA
8 8 NA NA 0.80448136 -0.33347573 -2.290428 -0.3863564 NA NA
9 9 NA NA NA NA NA -1.2371795 NA NA
10 10 NA NA NA NA NA 1.6819063 NA NA
I have a large data.frame with 'staggered' data and would like to align it. What I mean is I would like to take something like
and remove the leading (top) NAs from all columns to get
I know about the na.trim function from the zoo package, but this didn't work on either the initial data.frame presented above or its transpose. For this I used, with transposed dataframe t.df,
t.df <- na.trim(t.df, sides = 'left')
This only returned an empty data.frame, and wouldn't work the way I wanted anyway since it would create vectors of different lengths. Can anyone point me to a package or function that might be more helpful?
Here is the code for my example used above:
# example of what I have
var1 <- c(1,2,3,4,5,6,7,8,9,10)
var2 <- c(6,2,4,7,3,NA,NA,NA,NA,NA)
var3 <- c(NA,NA,8,6,3,7,NA,NA,NA,NA)
var4 <- c(NA,NA,NA,NA,5,NA,2,6,2,9)
df <- data.frame(var1, var2, var3, var4)
# transpose and (unsuccessful) attempt to remove leading NAs
t.df <- t(df)
t.df <- na.trim(t.df, sides = 'left')
We can loop over the columns (lapply(..) and apply na.trim. Then, pad NAs at the end of the each of the list elements by assigning length as the maximum length from the list elements.
library(zoo)
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
# var1 var2 var3 var4
#1 1 6 8 5
#2 2 2 6 NA
## 3 4 3 2
#4 4 7 7 6
#5 5 3 NA 2
#6 6 NA NA 9
#7 7 NA NA NA
#8 8 NA NA NA
#9 9 NA NA NA
#10 10 NA NA NA
Or as #G.Grothendieck mentioned in the comments
replace(df, TRUE, do.call("merge", lapply(lst, zoo)))
You can do with base functions:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
df[,] <- lapply(df, my.na.trim)
df
# var1 var2 var3 var4
# 1 1 6 8 5
# 2 2 2 6 NA
# 3 3 4 3 2
# 4 4 7 7 6
# 5 5 3 NA 2
# 6 6 NA NA 9
# 7 7 NA NA NA
# 8 8 NA NA NA
# 9 9 NA NA NA
# 10 10 NA NA NA
alternative coding for the function:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
r1 <- r$length[1]
c(tail(x, -r1), head(x, r1))
}
We can use the cbind.na() function from the qpcR package and combine it with the na.trim() function from the zoo package:
do.call(qpcR:::cbind.na, lapply(df, zoo::na.trim))
# var1 var2 var3 var4
# [1,] 1 6 8 5
# [2,] 2 2 6 NA
# [3,] 3 4 3 2
# [4,] 4 7 7 6
# [5,] 5 3 NA 2
# [6,] 6 NA NA 9
# [7,] 7 NA NA NA
# [8,] 8 NA NA NA
# [9,] 9 NA NA NA
#[10,] 10 NA NA NA
If speed is a matter you can use this data.table solution.
library(data.table)
dt_foo <- function(dt) {
shift_v <- sapply(dt, function(col) min(which(+(is.na(col)) == 0))-1)
shift_expr <- parse(text = paste0("list(", paste("shift(", names(shift_v), ", n = ", shift_v, ", type = 'lead')", collapse = ", "), ")"))
dt[, names(shift_v) := eval(shift_expr), with = F]
dt[]
}
Some benchmarking follows.
library(zoo)
library(microbenchmark)
set.seed(1)
DT <- as.data.table(matrix(sample(c(0:9L, NA), 1e8, T, prob = c(rep(.01, 10), .9)), ncol = 1000))
zoo_foo <- function(df) {
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
}
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
microbenchmark(dt_foo(copy(DT)), zoo_foo(DT),
as.data.frame(lapply(DT, my.na.trim)), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dt_foo(copy(DT)) 1.468749 1.618289 1.690293 1.699926 1.725534 1.893018 10 a
zoo_foo(DT) 6.493227 6.516247 6.834768 6.779045 7.190705 7.319058 10 c
as.data.frame(lapply(DT, my.na.trim)) 4.988514 5.013340 5.384399 5.385273 5.508889 6.517748 10 b
I'm trying to tabulate/map the counts of 2 factor-class vectors (b1 & b2) into a bigger dataframe. Summary of the vectors are as below:
> summary(b1)
(4,6] (6,8] NA's
16 3 1
> summary(b2)
(4,6] (6,8] NA's
9 0 11
I would like to map the above counts into a bigger dataframe:
Intervals b1 b2
1 (-Inf,0] NA NA
2 (0,2] NA NA
3 (2,4] NA NA
4 (4,6] NA NA
5 (6,8] NA NA
6 (8,10] NA NA
7 (10,12] NA NA
8 (12, Inf] NA NA
My question: is there a vectorized or more direct way to do the above without resorting to a 'for' loop + if-else condition checking? It seems like something easily done, but I'm have been having this mental block and I haven't been successful in finding relevant help online. Any help/hint is appreciated. Thanks in advance!
My sample code is attached:
NoOfElement <- 20
MyBreaks <- c(seq(4, 8, by=2))
MyBigBreaks <- c(-Inf, seq(0,12, by=2), Inf)
set.seed(1)
a1 <- rnorm(NoOfElement, 5); a2 <- rnorm(NoOfElement, 4)
b1 <- cut(a1, MyBreaks); b2 <- cut(a2, MyBreaks)
c <- seq(-10, 10)
d <- cut(c, MyBigBreaks)
e <- data.frame( Intervals=levels(d), b1=NA, b2=NA )
The table function does the tabulation that you need. It returns a named vector, and you can compare the names against the column e$Intervals to assign the correct values.
This relies on the fact that the order of the factor levels is the same in e$Intervals and b1 and b2. This is so because these all come from cut.
e$b1[e$Intervals %in% names(table(b1))] <- table(b1)
e$b2[e$Intervals %in% names(table(b2))] <- table(b2)
e
## Intervals b1 b2
## 1 (-Inf,0] NA NA
## 2 (0,2] NA NA
## 3 (2,4] NA NA
## 4 (4,6] 16 9
## 5 (6,8] 3 0
## 6 (8,10] NA NA
## 7 (10,12] NA NA
## 8 (12, Inf] NA NA