Given a data frame like this:
A <- c(1,2,3,4,NA,6,7,8,9,10,11,12,13,14,15)
B <- c(NA,NA,NA,20,NA,NA,NA,15,NA,NA,NA,NA,11,NA,9)
DF <- data.frame(A, B)
I would like to calculate the mean for a range of values in column A, based on the value in column B. Specifically, every time there is a non-NA value in column B, I would like to calculate the mean of the range of rows 2 above and 2 below in column A.
For example, the first non-NA value in column B is 20. So I would like to calculate the mean of the two rows above (2, 3), two rows below (NA, 6), and the row adjacent (4). So:
mean(2,3,4,NA,6)
Similarly, the next non-NA value in row B is 15. Which would be
mean(6,7,8,9,10)
So, the end result for the entire data frame would be a new column C
DF$C <- c(NA,NA,NA,3.75,NA,NA,NA,8,NA,NA,NA,NA,13,NA,14)
You could try the following.
nona <- !is.na(DF$B)
DF$C <- replace(
DF$B,
nona,
vapply(which(nona), function(i) {
ii <- (i-2):(i+2)
mean(DF$A[ii[ii > 0]], na.rm = TRUE)
}, 1)
)
Here we are finding the non-NA values in column B and then using that vector to set up the indices for the values we want to find the mean for in column A, being careful to remove any negative subscripts that might occur should the first one or two values of column B not be NA. The above code gives the following result for DF.
A B C
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 20 3.75
5 NA NA NA
6 6 NA NA
7 7 NA NA
8 8 15 8.00
9 9 NA NA
10 10 NA NA
11 11 NA NA
12 12 NA NA
13 13 11 13.00
14 14 NA NA
15 15 9 14.00
Here is an approach with the zoo package:
library(zoo)
width <- 5 # the observation ± 2
DF$C <- rollapply(DF$A, width, mean, na.rm = TRUE, partial = TRUE)
# when DF$B is NA, assign NA to corresponding DF$C
DF$C[is.na(DF$B)] <- NA
partial = TRUE allows calculating the mean with a partial window at the leading and trailing parts of the DF$A vector where the whole window can't be accommodated (i.e. the first 2 and last 2 values of DF$A where a window of size 5 is not possible).
Related
Probably simple but tricky question especially for larger data sets. Given two dataframes (df1,df2) of equal dimensions as below:
head(df1)
a b c
1 0.8569720 0.45839112 NA
2 0.7789126 0.36591578 NA
3 0.6901663 0.88095485 NA
4 0.7705756 0.54775807 NA
5 0.1743111 0.89087819 NA
6 0.5812786 0.04361905 NA
and
head(df2)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 1
3 0.08982958 0.4453491 2
4 0.75196925 0.6745908 3
5 0.73216793 0.6418483 4
6 0.73640209 0.7448011 5
How can one find all columns where if(all(is.na(df1)), in this case c, go to df2and set all values in matching column (c) to NAs.
Desired output
head(df3)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 NA
3 0.08982958 0.4453491 NA
4 0.75196925 0.6745908 NA
5 0.73216793 0.6418483 NA
6 0.73640209 0.7448011 NA
My actual dataframes have more than 140000 columns.
We can use colSums on the negated logical matrix (is.na(df1)), negate (!) thevector` so that 0 non-NA elements becomes TRUE and all others FALSE, use this to subset the columns of 'df2' and assign it to NA.
df2[!colSums(!is.na(df1))] <- NA
df2
# a b c
#1 0.21210312 0.7670091 NA
#2 0.19767464 0.3050934 NA
#3 0.08982958 0.4453491 NA
#4 0.75196925 0.6745908 NA
#5 0.73216793 0.6418483 NA
#6 0.73640209 0.7448011 NA
Or another option is to loop over the columns and check whether all the elements are NA to create a logical vector for subsetting the columns of 'df2' and assigning it to NA
df2[sapply(df1, function(x) all(is.na(x)))] <- NA
If these are big datasets, another option would be set from data.table (should be more efficient as this does the assignment in place)
library(data.table)
setDT(df2)
j1 <- which(sapply(df1, function(x) all(is.na(x))))
for(j in j1){
set(df2, i = NULL, j = j, value = NA)
}
This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10
I'm trying to calculate the mode for numeric columns. The columns which are not numeric, should have a "NA" as a placeholder in the vector. I would also need percentages according to a target. Some example data:
c1= c("A", "B", "C", "C", "B", "C", "C")
c2= factor(c(1, 1, 2, 2,1,2,1), labels = c("Y","N"))
d= as.Date(c("2015-02-01", "2015-02-03","2015-02-01","2015-02-05", "2015-02-03","2015-02-01", "2015-02-03"), format="%Y-%m-%d")
x= c(1,1,2,3,1,2,4)
y= c(1,2,2,6,2,3,1)
t= c(1,0,1,1,0,0,1)
df=data.frame(c1, c2, d, x, y,t)
df
c1 c2 d x y t
1 A Y 2015-02-01 1 1 1
2 B Y 2015-02-03 1 2 0
3 C N 2015-02-01 2 2 1
4 C N 2015-02-05 3 6 1
5 B Y 2015-02-03 1 2 0
6 C N 2015-02-01 2 3 0
7 C Y 2015-02-03 4 1 1
I would need the mode for each numeric column:
mode=as.numeric(c("NA","NA", "NA", 1,2,1))
mode
[1] NA NA NA 1 2 1
and a vector of percentages of rows with t==1, when value in column == mode
[1] NA NA NA 0.33 0.33
and a vector of percentages of rows with t==1, when value in column != mode
[1] NA NA NA 0.75 0.75
How could I calculate such vectors?
The best I have found for mode is:
library(plyr)
mode_fun <- function(x) {
mode0 <- names(which.max(table(x)))
if(is.numeric(x)) return(as.numeric(mode0))
mode0
}
kdf_mode=apply(kdf,2, numcolwise(mode_fun))
But it gives an error if there are any non numeric columns.
We can use sapply to loop over the columns of 'df', apply the mode_fun to get the output vector ('v1'). We use an if/else condition to return NA for non-numeric columns.
v1 <- unname(sapply(df, function(x) if(!is.numeric(x)) NA else mode_fun(x)))
v1
#[1] NA NA NA 1 2 1
For the second case (I guess we don't need the 6th column i.e. 't'). We loop through the columns of 'df' with sapply, use the if/else condition. In the else condition, we compare whether the mode values is equal to the column values (mode_fun(x)==x)). We use the & to get the logical index of values that are equal to mode that corresponds to t==1. Get the sum and divide by the sum(v1).
unname(sapply(df[-6], function(x) if(!is.numeric(x)) {
NA
} else {
v1 <- mode_fun(x)==x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.3333333 0.3333333
For the third, we change the condition to get the logical index where the column is not equal to the mode. Do the same as in the previous case.
unname(sapply(df[-6], function(x) if(!is.numeric(x)){
NA
} else {
v1 <- mode_fun(x)!=x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.75 0.75
After we calculate 'v1', this can be also done without looping with sapply. We create a logical index where the column class is 'numeric' and the column names is not 't' ('indx').
indx <- sapply(df, is.numeric) & names(df)!='t'
We subset the 'df' and 'v1' based on 'indx' (df[indx], v1[indx]), make the lengths by replicating the vector using col. The col gives the numeric index of the columns in df[indx]. Then we check whether the subset dataset is equal to the vector to give a logical matrix.
indx1 <- df[indx]==v1[indx][col(df[indx])]
As in the previous code, we use & to check whether the TRUE values in 'indx1' also corresponds to 't==1. DocolSums, divide by thecolSumsof 'indx1', and concatenate (c) with theNA` elements of 'v1'
unname(c(v1[is.na(v1)], colSums(indx1& t==1)/colSums(indx1)))
#[1] NA NA NA 0.3333333 0.3333333
Similarly, we can create 'indx2' by changing the condition and then do colSums as before
indx2 <- df[indx]!=v1[indx][col(df[indx])]
unname(c(v1[is.na(v1)], colSums(indx2& t==1)/colSums(indx2)))
#[1] NA NA NA 0.75 0.75
Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums
I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1