create an other data if elements are same - r

I have two data sets A and B (shown below), and wanted to create third data set called C, based on this condition: If elements of A and B are Same (or matched) then its should be C (if not macthed then that element should be NA/missing).
A
2 5 9 3
5 3 2 1
2 1 1 3
B
2 7 9 3
5 3 6 1
2 2 2 3
expected C should look like
2 NA 9 3
5 3 NA 1
2 NA NA 3
BOTH data have same dimensions, any suggestion please?

`is.na<-`(A,!A==B)
V1 V2 V3 V4
1 2 NA 9 3
2 5 3 NA 1
3 2 NA NA 3

This should work for both data frame and matrix.
If A and B are data frames:
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# 1 2 NA 9 3
# 2 5 3 NA 1
# 3 2 NA NA 3
If A and B are matrix:
A <- as.matrix(A)
B <- as.matrix(B)
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# [1,] 2 NA 9 3
# [2,] 5 3 NA 1
# [3,] 2 NA NA 3
DATA
A <- read.table(text = "2 5 9 3
5 3 2 1
2 1 1 3",
header = FALSE)
B <- read.table(text = "2 7 9 3
5 3 6 1
2 2 2 3",
header = FALSE)

Related

Find the index of columns containing more than 5 NA values

I want to subset a dataframe and extract only the columns that contain 5 or more NA values.
data.frame(A = rep(1, 10), B = c(rep(2,5), rep(3,5)), D = rep(5, 10), E = c(rep(1,2), rep(NA,6), rep(6,2)), F = c(rep(NA,2), rep(2,8)))
A B D E F
1 1 2 5 1 NA
2 1 2 5 1 NA
3 1 2 5 NA 2
4 1 2 5 NA 2
5 1 2 5 NA 2
6 1 3 5 NA 2
7 1 3 5 NA 2
8 1 3 5 NA 2
9 1 3 5 6 2
10 1 3 5 6 2
So in this example I want to have the index of the column "E".
My original dataset has about 3000 columns, so speed is more or less important.
I have been trying to do this with sum(is.na) and filter_if(any_vars) but all to no avail..
Using ColSums with is.na
names(df)[colSums(is.na(df))>5]
[1] "E"
We can use colSums on logical matrix (is.na(df1)), get the index with which and extract the names
names(which(colSums(is.na(df1)) >= 5))
#[1] "E"
which(unlist(lapply(df, function(x) sum(is.na(x)) > 5)))
4

Replace all values in a data.table given a condition

How would you replace all values in a data.table given a condition?
For example
ppp <- data.table(A=1:6,B=6:1,C=1:6,D=3:8)
A B C D
1 6 1 3
2 5 2 4
3 4 3 5
4 3 4 6
5 2 5 7
6 1 6 8
I want to replace all "6" by NA
A B C D
1 NA 1 3
2 5 2 4
3 4 3 5
4 3 4 NA
5 2 5 7
NA 1 6 8
I've tried something like
ppp[,ifelse(.SD==6,NA,.SD)]
but it doesn't work, it produces a much wider table.
A native data.table way to do this would be:
for(col in names(ppp)) set(ppp, i=which(ppp[[col]]==6), j=col, value=NA)
# Test
> ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8
This approach - while perhaps more verbose - is nevertheless going to be significantly faster than ppp[ppp == 6] <- NA, because it avoids the copying of all columns.
Even easier:
ppp[ppp == 6] <- NA
ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8
Importantly, this doesn't change its class:
is.data.table(ppp)
[1] TRUE

Complex restructuring of R dataframe

as I have a dataframe like this:
participant v1 v2 v3 v4 v5 v6
1 4 2 9 7 2
2 6 8 1
3 5 4 5
4 1 1 2 3
Every two consecutive variables (v1 and v2, v3 and v4, v5 and v6) belong to each other (this is what I call "count" later).
I desperatly search a way to get the following:
participant count v(odd numbers) v(even numbers)
1 1 4 2
2 9
3 7 2
2 1 6
2 8
3 1
3 1
2 5 4
3 5
4 1 1 1
2 2
3 3
As this is my first question on stackoverflow ever, I hope you understand my request. I searched a lot for similar problems (and solutions to them) but found nothing. I would very much appreciate your support.
We can use melt
library(data.table)
melt(setDT(d1), measure = list(paste0("v", seq(1, 6, by= 2)),
paste0("v", seq(2,6, by = 2))))[order(participant)]
# participant variable value1 value2
# 1: 1 1 4 2
# 2: 1 2 NA 9
# 3: 1 3 7 2
# 4: 2 1 NA 6
# 5: 2 2 8 NA
# 6: 2 3 NA 1
# 7: 3 1 NA NA
# 8: 3 2 5 4
# 9: 3 3 NA 5
#10: 4 1 1 1
#11: 4 2 NA 2
#12: 4 3 3 NA

Replace the rows in dataframe with condition

Hi in relation to the question here:
[Dynamically replace row in dataframe with vector
I have a data.frame for example:
d <- read.table(text=' V1 V2 V3 V4 V5 V6 V7
1 1 a 2 3 4 9 6
2 1 b 2 2 4 5 NA
3 1 c 1 3 4 5 8
4 1 d 1 2 3 6 9
5 2 a 1 2 3 4 5
6 2 b 1 4 5 6 7
7 2 c 1 2 3 5 8
8 2 d 2 3 6 7 9', header=TRUE)
Now I want to take one row, for example the first one (1a) and:
Get the min and max value from that row. In this case min=2 and max=9 (note there are missing values in between for example there is no 5, 7, or 8 in that row).
Now I want to replace that row with all missing values and extend it (the row will be longer than all others as it will go from 2 until 9 (2,3,4,5,6,7,8,9). The whole data.frame should then be automatically extended by NA columns for the other rows that are not as long as the one I replaced.
Now the following code does achieve this:
row.to.change <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add > 0) {
d <- cbind(d, replicate(num.add, rep(NA, nrow(d))))
} else if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
and finally renames the extended data.frame headers as the default ones:
d[row.to.change,c(-1, -2)] <- new.row
colnames(d) <- paste0("V", seq_len(ncol(d)))
Now: This does work for the row that I specify in: row.to.replace but how does this work, if for example I want it to work for all rows which have a 'b' in the second column? Something like: "do this where d$V2 == 'b'"? In case the data.frame is 5000 rows long.
You have already solved. Just make a function and then apply it to each row of your data.
rtc=function(row.to.change){# <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
new.row
}
#d2=d
newr=lapply(1:nrow(d),rtc) # for the hole data
# for specific condition, like lines with "b" in V2 change to:
# newr=lapply(1:nrow(d),function(z)if(d$V2[z]=="b")rtc(z) else as.numeric(d[z,c(-1, -2)]))
mxl=max(sapply(newr,length))
newr=lapply(newr,function(z)if(length(z)<mxl)c(z,rep(NA,mxl-length(z))) else z)
if (ncol(d)-2 < mxl) {
d <- cbind(d, replicate(mxl-ncol(d)+2, rep(NA, nrow(d))))
}
d[,c(-1, -2)] <- do.call(rbind,newr)
colnames(d) <- paste0("V", seq_len(ncol(d)))
d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 a 2 3 4 5 6 7 8 9 NA
2 1 b 2 3 4 5 NA NA NA NA NA
3 1 c 1 2 3 4 5 6 7 8 NA
4 1 d 1 2 3 4 5 6 7 8 9
5 2 a 1 2 3 4 5 NA NA NA NA
6 2 b 1 2 3 4 5 6 7 NA NA
7 2 c 1 2 3 4 5 6 7 8 NA
8 2 d 2 3 4 5 6 7 8 9 NA

padding missing rows between two data frames in R

I have two large data frames A (N1 by 6), B (N2 by 2). The first two columns of A are the keys for matching B, all keys in A is a subset of B.
What I want to do is: padding A with those keys that are in B but not in A, and fill other 4 columns with "NA", reserve for missing value imputation later.
A
1 2 3 4 5 6
1 3 4 5 6 7
B
1 2
1 3
1 4
My new A
1 2 3 4 5 6
1 3 4 5 6 7
1 4 NA NA NA NA
I come up with something like this
rowDiff <- setdiff(A[,1:2],B)
pad <- cbind(rowDiff, matrix(rep("NA",4*nrow(rowDiff)),ncol=4))
A <- rowbind(A,pad)
Any more efficient solution? Thanks
Would this work?
merge(B, A, all.x=TRUE)
It tests OK:
> A <- read.table(text="1 2 3 4 5 6
+ 1 3 4 5 6 7")
>
> B <- read.table(text="1 2
+ 1 3
+ 1 4")
> merge(B, A, all.x=TRUE)
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 3 4 5 6 7
3 1 4 NA NA NA NA

Resources