Replace the rows in dataframe with condition - r

Hi in relation to the question here:
[Dynamically replace row in dataframe with vector
I have a data.frame for example:
d <- read.table(text=' V1 V2 V3 V4 V5 V6 V7
1 1 a 2 3 4 9 6
2 1 b 2 2 4 5 NA
3 1 c 1 3 4 5 8
4 1 d 1 2 3 6 9
5 2 a 1 2 3 4 5
6 2 b 1 4 5 6 7
7 2 c 1 2 3 5 8
8 2 d 2 3 6 7 9', header=TRUE)
Now I want to take one row, for example the first one (1a) and:
Get the min and max value from that row. In this case min=2 and max=9 (note there are missing values in between for example there is no 5, 7, or 8 in that row).
Now I want to replace that row with all missing values and extend it (the row will be longer than all others as it will go from 2 until 9 (2,3,4,5,6,7,8,9). The whole data.frame should then be automatically extended by NA columns for the other rows that are not as long as the one I replaced.
Now the following code does achieve this:
row.to.change <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add > 0) {
d <- cbind(d, replicate(num.add, rep(NA, nrow(d))))
} else if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
and finally renames the extended data.frame headers as the default ones:
d[row.to.change,c(-1, -2)] <- new.row
colnames(d) <- paste0("V", seq_len(ncol(d)))
Now: This does work for the row that I specify in: row.to.replace but how does this work, if for example I want it to work for all rows which have a 'b' in the second column? Something like: "do this where d$V2 == 'b'"? In case the data.frame is 5000 rows long.

You have already solved. Just make a function and then apply it to each row of your data.
rtc=function(row.to.change){# <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
new.row
}
#d2=d
newr=lapply(1:nrow(d),rtc) # for the hole data
# for specific condition, like lines with "b" in V2 change to:
# newr=lapply(1:nrow(d),function(z)if(d$V2[z]=="b")rtc(z) else as.numeric(d[z,c(-1, -2)]))
mxl=max(sapply(newr,length))
newr=lapply(newr,function(z)if(length(z)<mxl)c(z,rep(NA,mxl-length(z))) else z)
if (ncol(d)-2 < mxl) {
d <- cbind(d, replicate(mxl-ncol(d)+2, rep(NA, nrow(d))))
}
d[,c(-1, -2)] <- do.call(rbind,newr)
colnames(d) <- paste0("V", seq_len(ncol(d)))
d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 a 2 3 4 5 6 7 8 9 NA
2 1 b 2 3 4 5 NA NA NA NA NA
3 1 c 1 2 3 4 5 6 7 8 NA
4 1 d 1 2 3 4 5 6 7 8 9
5 2 a 1 2 3 4 5 NA NA NA NA
6 2 b 1 2 3 4 5 6 7 NA NA
7 2 c 1 2 3 4 5 6 7 8 NA
8 2 d 2 3 4 5 6 7 8 9 NA

Related

How to remove rows with NAs from two dataframes based on NAs from one?

I am trying to remove the same rows with NA in df1 from df2.
eg.
df1
A
1 1
2 NA
3 7
4 NA
df2
A B C D
1 2 4 7 10
2 3 6 1 3
3 9 5 1 3
4 4 9 2 5
Intended outcome:
df1
A
1 1
3 7
df2
A B C D
1 2 4 7 10
3 9 5 1 3
I have already tried things along the lines of...
newdf <- df2[-which(rowSums(is.na(df1))),]
and
noNA <- function(x) { x[!rowSums(!is.na(df1)) == 1]}
NMR_6mos_noNA <- as.data.frame(lapply(df2, noNA))
or
noNA <- function(x) { x[,!is.na(df1)]}
newdf3 <- as.data.frame(lapply(df2, noNA))
We can use is.na to create a logical condition and use that to subset the rows of 'df1' and 'df2'
i1 <- !is.na(df1$A)
df1[i1, , drop = FALSE]
# A
#1 1
#3 7
df2[i1,]
# A B C D
# 1 2 4 7 10
#3 9 5 1 3

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

create an other data if elements are same

I have two data sets A and B (shown below), and wanted to create third data set called C, based on this condition: If elements of A and B are Same (or matched) then its should be C (if not macthed then that element should be NA/missing).
A
2 5 9 3
5 3 2 1
2 1 1 3
B
2 7 9 3
5 3 6 1
2 2 2 3
expected C should look like
2 NA 9 3
5 3 NA 1
2 NA NA 3
BOTH data have same dimensions, any suggestion please?
`is.na<-`(A,!A==B)
V1 V2 V3 V4
1 2 NA 9 3
2 5 3 NA 1
3 2 NA NA 3
This should work for both data frame and matrix.
If A and B are data frames:
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# 1 2 NA 9 3
# 2 5 3 NA 1
# 3 2 NA NA 3
If A and B are matrix:
A <- as.matrix(A)
B <- as.matrix(B)
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# [1,] 2 NA 9 3
# [2,] 5 3 NA 1
# [3,] 2 NA NA 3
DATA
A <- read.table(text = "2 5 9 3
5 3 2 1
2 1 1 3",
header = FALSE)
B <- read.table(text = "2 7 9 3
5 3 6 1
2 2 2 3",
header = FALSE)

sorting a list of data frame on a condition

I have a list of data frames containing different number of columns.
Say Y is a list of 3 data frames containing 4,10 and 5 columns respectively
I want to sort these data frames in a list based on a condition that which column will be sorted first and so on. for that i have another list:
i1 = list(c(0),c(4,5,2,3),c(3))
i2 = c(0,4,1)
in first data frame i don't want to sort anything and for second and third data frame i want to follow the order given in i1 and i2
i have tried writing this function which works for 1 data frame but not working for a list
for (i in 1:length(i1){
if (i2[i] < 1) {
sorted[[i]]=y[[i]]
} else {
for(j in i1[[i]]){
sorted[[i]] <- y[[i]][order(y[[i]][j],]
}}}
We can do this with Map
Map(function(x,y, z) if(z < 1) x else x[do.call(order, x[y]),], Y, i1, i2)
#[[1]]
# V1 V2 V3 V4
#1 3 10 7 10
#2 3 3 4 2
#3 8 8 7 1
#4 6 9 7 6
#5 7 3 4 2
#[[2]]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 1 7 4 3 5 1 5 10 5 4
#3 8 6 4 7 3 4 5 3 3 10
#4 2 7 2 7 3 3 8 2 2 8
#2 6 1 3 8 4 4 9 5 3 10
#5 3 1 10 10 1 4 6 2 8 5
#[[3]]
# V1 V2 V3 V4 V5
#2 3 6 2 3 8
#4 10 1 3 4 2
#5 7 8 4 9 5
#1 2 4 6 4 4
#3 1 9 6 9 10
data
set.seed(24)
Y <- list(as.data.frame(matrix(sample(1:10, 4*5, replace=TRUE), 5, 4)),
as.data.frame(matrix(sample(1:10, 10*5, replace=TRUE), 5, 10)),
as.data.frame(matrix(sample(1:10, 5*5, replace=TRUE), 5, 5)))

padding missing rows between two data frames in R

I have two large data frames A (N1 by 6), B (N2 by 2). The first two columns of A are the keys for matching B, all keys in A is a subset of B.
What I want to do is: padding A with those keys that are in B but not in A, and fill other 4 columns with "NA", reserve for missing value imputation later.
A
1 2 3 4 5 6
1 3 4 5 6 7
B
1 2
1 3
1 4
My new A
1 2 3 4 5 6
1 3 4 5 6 7
1 4 NA NA NA NA
I come up with something like this
rowDiff <- setdiff(A[,1:2],B)
pad <- cbind(rowDiff, matrix(rep("NA",4*nrow(rowDiff)),ncol=4))
A <- rowbind(A,pad)
Any more efficient solution? Thanks
Would this work?
merge(B, A, all.x=TRUE)
It tests OK:
> A <- read.table(text="1 2 3 4 5 6
+ 1 3 4 5 6 7")
>
> B <- read.table(text="1 2
+ 1 3
+ 1 4")
> merge(B, A, all.x=TRUE)
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 3 4 5 6 7
3 1 4 NA NA NA NA

Resources