Replace a complex, conditional for loop with apply in R - r

I'm relatively new to R and I'm hoping to replace my messy loop with something more eloquent and faster (apply?). Basically, I want to populate a new matrix based on if values in the same position in other matrices match one another. Let me illustrate:
>df1
V1 V2 V3
1 A G A
2 T T T
3 C A A
4 G C G
>df2
V1
1 A
2 T
3 C
4 G
>df3
V1 V2 V3
1 .25 .99 .41
2 .21 .25 .75
3 .35 .65 .55
4 .75 .21 .11
>newdf <- data.frame(matrix(ncol= ncol(df3), nrow = nrow(df3)))
Note that df1 and df3 will always have the same dimensions as one another, and df2 will always have the same nrow.
If positions Match: If df1[i,j] == df2[i], then I want newdf[i,j] = df3[i,j]
If positions don't match: If df1[i,j] != df2[i], then I want newdf[i,j] = 1-df3[i,j]
For instance df1[1,2] = 'G' and df2[1] = 'A', so I want newdf[1,2] = (1- df3[1,2])
I wrote a very gross for loop to perform this successfully:
df1<- as.matrix(df1)
df2<- as.matrix(df2)
df3<- as.matrix(df3)
newdf <- data.frame(matrix(ncol= ncol(df3), nrow = nrow(df3)))
for (i in (1:nrow(df1))){
for (j in (1:ncol(df1))){
if (df1[i,j] == df2[i]) {
newdf[i,j] = df3[i,j] }
else {
newdf[i,j] = 1- df3[i,j] }
}
}
Which gives me the desired results:
>newdf
X1 X2 X3
1 0.25 0.01 0.41
2 0.21 0.25 0.75
3 0.35 0.35 0.45
4 0.75 0.79 0.11
This is a very slow and messy process when I have lots of data. Are there any suggestions for other ways to solve this, perhaps using the apply family? Thanks and sorry for the nasty code.

You can use an apply to create an index of those values that don't match, then simply subtract them from one
idx <- (!apply(df1, 2, function(x) x == df2))
## alternatively, you can use x != df2 too
## idx <- (apply(df1, 2, function(x) x != df2))
df3[idx] <- 1 - df3[idx]
df3
# V1 V2 V3
# 1 0.25 0.01 0.41
# 2 0.21 0.25 0.75
# 3 0.35 0.35 0.45
# 4 0.75 0.79 0.11
Explanation
Where the apply gives a matrix of TRUE/FALSE based on whether df1 matches df2
V1 V2 V3
[1,] TRUE FALSE TRUE
[2,] TRUE TRUE TRUE
[3,] TRUE FALSE FALSE
[4,] TRUE FALSE TRUE
So taking the negation of this using ! gives the opposite values.
!apply(df1, 2, function(x) x == df2)
V1 V2 V3
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE TRUE
[4,] FALSE TRUE FALSE
which then tells us which values of df we need to change
df3[idx]
[1] 0.01 0.35 0.79 0.45
And alternative is to make df2 the same size as df1
df2 <- cbind(df2, rep( df2, ncol( df1 ) - 1))
df1 != df2

Related

Find values in data frame 2 which is found in data frame 1, within a certain range

I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})

Add a column that shows whether the two previous columns include 0 or not

I have a data.frame called dat. I want to add a new column to it called dif. Then, in each row if lower and upper range included 0 (e.g., -0.41 to 0.1 in 1st row) I want the dif value to show FALSE else (e.g., 0.10 to 0.2 2nd row) TRUE.
Is this possible to do in R for any similar data.frame (the following is a toy example a functional answer is appreciated)?
dat <- data.frame(lower = c(-0.41, .1, -.2), upper = 1:3*.1, row.names = paste("a", 1:3)) # add a column called `dif`
desired_output <- data.frame(lower = c(-0.41, .1, -.2), upper = 1:3*.1, dif = c(F,T,F), row.names = paste("a", 1:3))
You can use dplyr::between:
library(dplyr)
dat %>%
rowwise() %>%
mutate(dif = !between(0, lower, upper))
Output
# A tibble: 3 x 3
# Rowwise:
lower upper dif
<dbl> <dbl> <lgl>
1 -0.41 0.1 FALSE
2 0.1 0.2 TRUE
3 -0.2 0.3 FALSE
You can use :
transform(dat, dif = lower > 0 | upper < 0)
# lower upper dif
#1 -0.41 0.1 FALSE
#2 0.10 0.2 TRUE
#3 -0.20 0.3 FALSE
We can use mutate from dplyr
library(dplyr)
mutate(dat, dif = lower > 0 | upper < 0)
Or an option in base R
Reduce(`|`, Map(function(x, y) match.fun(y)(x, 0), dat, c(">", "<")))
#[1] FALSE TRUE FALSE
Here is another base R option using do.call with *
dat$dif <- do.call("*", dat) > 0
such that
> dat
lower upper dif
a 1 -0.41 0.1 FALSE
a 2 0.10 0.2 TRUE
a 3 -0.20 0.3 FALSE

Reordering rows and columns in R

I know this has been answered before, but,
given a correlation matrix which looks like this:
V A B C D
A 1 0.3 0.1 0.4
B 0.2 1 0.4 0.3
C 0.1 0 1 0.9
D 0.3 0.3 0.1 1
which can be loaded in R as follows:
corr.matrix <- read.table("path/to/file", sep = '\t', header = T)
rownames(corr.matrix) <- corr.matrix$V
corr.matrix <- corr.matrix[, 2:ncol(corr.matrix)]
Based on 2 other files that dictate which of the rows and columns to be plotted (Because some are of no interest to me), I want to rearrange the rows and columns in to how the 2 separate files dictate.
For example:
cols_order.txt
C
D
E
B
A
...
rows.txt
D
E
Z
B
T
A
...
I read those other 2 files like this:
rows.order <- ("rows_order.txt", sep = '\n', header=F)
colnames(rows.order) <- "Variant"
cols.order <- ("cols_order.txt", sep = '\n', header=F)
colnames(cols.order) <- "Variant"
And after this step I do this:
corr.matrix <- corr.matrix[rows.order$Variant, cols.order$Variant]
The values that I don't want to be plotted are successfully removed, but the order gets scrambled. How can I fix this?
The .order datasets are read correctly (I checked 3 times).
Here is a potential solution to your question. I tried to re-create a small-sized data.frame based on your question. The key here is the match function as well as some basic subsetting/filtering techniques in R:
## Re-create your example:
V <- data.frame(
A = c(1 , 0.3, 0.1 , 0.4),
B = c(0.2, 1 , 0.4 , 0.3),
C = c(0.1, 0 , 1 , 0.9),
D = c(0.3, 0.3, 0.1 , 1)
) #matrix() also ok
rownames(V) <- LETTERS[1:4]
## Reorder using `match` function
## Needs to be in data.frame form
## So use as.data.frame() if needed
## Here, I don't have the text file
## So if you want to load in txt files specifying rows columns
## Use `read.csv` or `read.table to load
## And then store the relevant info into a vector as you did
col_order <- c("C","D","E","B","A")
col_order_filtered <- col_order[which(col_order %in% colnames(V))]
rows <- c("D","E","Z","B","T","A")
## Filter rows IDs, since not all are present in your data
row_filtered <- rows[rows %in% rownames(V)]
V1 <- V[match(rownames(V), row_filtered), match(colnames(V), col_order_filtered)]
V1 <- V1[-which(rownames(V1)=="NA"), ]
V1
## D C A B
## C 0.1 1.0 0.1 0.4
## B 0.3 0.0 0.3 1.0
## A 0.3 0.1 1.0 0.2
Alternatively, if you are comfortable with dplyr package and the syntax, you can use it and often it is handy:
## Continued from previous code
library(dplyr)
V2 <- V %>%
select(C, D, B, A, everything()) %>%
slice(match(rownames(V), row_filtered))
rownames(V2) <- row_filtered
V2
## C D B A
## D 1.0 0.1 0.4 0.1
## B 0.0 0.3 1.0 0.3
## A 0.1 0.3 0.2 1.0
Hope that helps.

Creating new columns based on a pattern

I have a large dataset which has a pattern similar to the dataPattern below. I need help with the code to create the desiredresult dataset
library(data.table)
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
(dataPattern <- data.table(V1, V2))
(desiredresult <- data.table(V1, V2, c(rep(c(0.24, 0.25), times = 4),
rep(c(0.34, 0.35), times = 4)),
c(rep(c(0.95, 1.05), times = 4),
rep(c(1.95, 2.05), times = 4))))
I need help to create column V3 in the desiredresult. The pattern is as follows:
if V1 == "a" then V3 = V2
if V1 != "a" we repeat the previous corresponding set of V2 values until a new value of a is reached then the new values of V2 is placed in V3, etc. The above repeats for all new values of a.
I also need your help with the code to create Column V4 in the desiredresult which is similar to column V3 except it checks if V1 == "f" and places the values of f from V2 into V4 and repeats it if V1 != "f"
I have tried:
rle(dataPattern$V1 == "a" )
# Run Length Encoding
# lengths: int [1:4] 2 6 2 6
# values : logi [1:4] TRUE FALSE TRUE FALSE
The sequence where V1 != "a" or V1 != "f" appears to be equal to the number of FALSE minus Number of TRUE. This is how many times each a sequence need to be repeated until a new a is reached
Many Thanks
OK, here is a better way, I think, to get the values of V2 into the column depending on V1=='a'.
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
dataPattern <- data.frame(V1, V2)
dataPattern$V3 <- ifelse(dataPattern$V1 == "a", dataPattern$V2, NA)
dataPattern$V4 <- ifelse(dataPattern$V1 == "f", dataPattern$V2, NA)
for (i in 1:nrow(dataPattern)){
if (dataPattern$V1[i] == "a"){
tmpa <- dataPattern$V3[i]
}
if (is.na(dataPattern$V3[i])){
dataPattern$V3[i] <- tmpa
}
if (dataPattern$V1[nrow(dataPattern)-(i-1)] == "f"){
tmpf <- dataPattern$V4[nrow(dataPattern)-(i-1)]
}
if (is.na(dataPattern$V4[nrow(dataPattern)-(i-1)])){
dataPattern$V4[nrow(dataPattern)-(i-1)] <- tmpf
}
}
output, which I think is more correct, according to your stated rules, than desiredoutput:
> dataPattern
V1 V2 V3 V4
1 a 0.24 0.24 0.95
2 a 0.25 0.25 0.95
3 b 2.00 0.25 0.95
4 c 3.00 0.25 0.95
5 d 4.00 0.25 0.95
6 e 5.00 0.25 0.95
7 f 0.95 0.25 0.95
8 f 1.05 0.25 1.05
9 a 0.34 0.34 1.95
10 a 0.35 0.35 1.95
11 b 2.00 0.35 1.95
12 c 3.00 0.35 1.95
13 d 4.00 0.35 1.95
14 e 5.00 0.35 1.95
15 f 1.95 0.35 1.95
16 f 2.05 0.35 2.05
This seems to work:
dataPattern[, `:=`(
V3 = head(V2,2),
V4 = tail(V2,2)
), by=cumsum( V1 == "a" & shift(V1,type="lead") == "a" )]
The result passes the all.equal(dataPattern, desiredresult) check. Depending on what your actual use-case looks like, you might need to put something different inside the cumsum.

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources