how to deal with missing value in if else statement? - r

I have a dataframe, mydata, constructed as follows:
col1<-c(8.20e+07, 1.75e+08, NA, 4.80e+07,
3.40e+07, NA, 5.60e+07, 3.00e+06 )
col2<-c(1960,1960,1965,1986,1960
,1969,1960,1993)
col3<-c ( NA,2.190,NA,NA, 5.000, NA,
1.700,4.220)
mydata<-data.frame(col1,col2,col3)
mydata
# col1 col2 col3
# 1 8.20e+07 1960 NA
# 2 1.75e+08 1960 2.19
# 3 NA 1965 NA
# 4 4.80e+07 1986 NA
# 5 3.40e+07 1960 5.00
# 6 NA 1969 NA
# 7 5.60e+07 1960 1.70
# 8 3.00e+06 1993 4.22
I want to create a col4 that has the values "a", "b" and "c",
if col1 is smaller than 4.00e+07, then col4=="a"; if col1 is not less than 4.00e+07, then col4=="b", else col4=="c"
Here is my code:
col4 <-ifelse(col1<4.00e+07, "a",
ifelse(col1 >=4.00e+07, "b",
ifelse(is.na(col1 =4.00e+07), "b", "c" )))
but this evaluates to:
# [1] "b" "b" NA "b" "a" NA "b" "a"
It doesn't change the NA value in col1 as "c".
The outcome should be:
# [1] "b" "b" "c" "b" "a" "c" "b" "a"
What is the problem in my code? Any suggestion would be appreciated!

You have to check is.na first, because NA < 4.00e+07 results in NA. If the first argument of ifelse() is NA, the result will be NA as well:
ifelse(c(NA, TRUE, FALSE), "T", "F")
## [1] NA "T" "F"
As you can see, for the first vector element the result is indeed NA. Even if the other arguments of ifelse() have special code that would take care of this situation, it won't help because that code is never taken into account.
For your example, checking for NA first gives you the desired result:
col4 <- ifelse(is.na(col1), "c",
ifelse(col1 < 4.00e+07, "a","b"))
col4
## [1] "b" "b" "c" "b" "a" "c" "b" "a"

This can be also done with cut
v1 <- with(mydata, as.character(cut(col1,
breaks=c(-Inf, 4.00e+07, Inf), labels=c("a", "b"))))
v1[is.na(v1)] <- "c"
v1
#[1] "b" "b" "c" "b" "a" "c" "b" "a"

Related

Fill missing values in a matrix with values from different rows in that matrix in R

I have a matrix with nucleotide sequences (containing NAs) in rows as shown here:
n.mat
samp1 <- c("a","c","a",NA,"t","c")
samp2 <- c("a","c","t","t",NA,"a")
samp3 <- c("a","g","g","c","a","c")
samp4 <- c("a","g",NA,"g","g", NA)
samp5<- c(NA, "g","g","g","t","g")
n.mat <- rbind(samp1,samp2,samp3,samp4,samp5)
[,1] [,2] [,3] [,4] [,5] [,6]
samp1 "a" "c" "a" NA "t" "c"
samp2 "a" "c" "t" "t" NA "a"
samp3 "a" "g" "g" "c" "a" "a"
samp4 "a" "g" NA "g" "g" NA
samp5 NA "g" "g" "g" "t" "g"
I also have a data frame with two columns containing the sequence names:
df
df <- data.frame(
X1 = c("samp1", "samp2", "samp3", "samp4", "samp5"),
X2 = c("samp2", "samp5", "samp1", "samp3", "samp2"))
X1 X2
1 samp1 samp2
2 samp2 samp5
3 samp3 samp1
4 samp4 samp3
5 samp5 samp2
I would like to fill the gaps of a row in the matrix with nucleotides/values from another row in the matrix indicated by the df$X2 column in the data frame.
So for example: samp1 in the matrix has an NA in its row in the fourth column. So I would like to take the string in the same column from samp2 (indicated by the data frame in column X2). For samp2 I would like to fill the NA by taking the string from samp5 (indicated by the data frame in column X2).
If there is no NA in the row as it is in samp3, then do nothing.
If there are two NAs in a row (as it is in samp4), then I would like to take both strings from samp3 both columns.
I have tried following code:
replace.na <- function(n.mat,val) {
i <- is.na(n.mat)
j <- which(i)
k <- which(!i)
n.mat[j[j > k[length(k)]]] <- val
n.mat
}
n.mat[,-1] <- t(apply(matrix[,-1],1,replace.na))
But I am not quite sure how to include the df table to replace the NAs.
Here's some very compact code that I will explain (and it assumes either using R v 4.x or creating the 'df' dataframe with stringsAsFactors=FALSE):
n.mat[ is.na(n.mat) ] <- n.mat[df[['X2']],][ is.na(n.mat)]
n.mat
#------
[,1] [,2] [,3] [,4] [,5] [,6]
samp1 "a" "c" "a" "t" "t" "c"
samp2 "a" "c" "t" "t" "t" "a"
samp3 "a" "g" "g" "c" "a" "a"
samp4 "a" "g" "g" "g" "g" "a"
samp5 "a" "g" "g" "g" "t" "g"
The is.na(.n.mat) returns a logical matrix of the same dimension as n.mat. It is used as an index on both sides of the assignment but on the right side of the assignment is is picking from a matrix that has the rows rearranged by the ordering of the "replacement rows" that you specified in df. If the 'X1' column had not been ibn the same order as the target matrix, you would have needed to reorder that column via an order call, but it wasn't needed here.
df <- read.table(text= 'X1 X2
1 samp1 samp2
2 samp2 samp5
3 samp3 samp1
4 samp4 samp3
5 samp5 samp2', header=TRUE, stringsAsFactors=FALSE)
Note the stringsAsFactors=FALSE. I think my failing to use that (since I'm still on R 3.6 meant that I had factors in the X2 column.
The other way of doing this is to create a two column index of NA positions with the arr.idx parameter set to TRUE:
pos <- which(is.na(n.mat),arr.ind=TRUE)
> pos
row col
samp5 5 1
samp4 4 3
samp1 1 4
samp2 2 5
samp4 4 6
Then you can index with that 2 column matrix:
n.mat[pos] <- n.mat[ df[['X2']] ,][pos]
> n.mat
[,1] [,2] [,3] [,4] [,5] [,6]
samp1 "a" "c" "a" "t" "t" "c"
samp2 "a" "c" "t" "t" "t" "a"
samp3 "a" "g" "g" "c" "a" "a"
samp4 "a" "g" "g" "g" "g" "a"
samp5 "a" "g" "g" "g" "t" "g"
R's matrix indexing can produce some very compact solutions to problems like this. You should consider reading the ?'[' help page for more details and examples. The time you put into that effort will be repaid many times over if you continue using R. I'm sure I've read through it 10 or 20 times by now.
Reprex:
n.mat <- matrix( scan(text = 'samp1 "a" "c" "a" "NA" "t" "c"
samp2 "a" "c" "t" "t" "NA" "a"
samp3 "a" "g" "g" "c" "a" "a"
samp4 "a" "g" "NA" "g" "g" "NA"
samp5 "NA" "g" "g" "g" "t" "g" ', what=""), nrow=5, byrow=TRUE)
n.mat <- matrix(n.mat[ ,-1], nrow=5, dimnames=list(n.mat[,1], NULL))
df <- read.table(text= 'X1 X2
1 samp1 samp2
2 samp2 samp5
3 samp3 samp1
4 samp4 samp3
5 samp5 samp2', header=TRUE, stringsAsFactors=FALSE)
You could try something like this below. Note that the output is a bit different, but I copied/pasted your code for creating your matrix (see samp3 column 6).
t(sapply(rownames(n.mat), function(x) {
na_cols <- is.na(n.mat[x, ])
n.mat[x, na_cols] <- n.mat[df[df$X1 == x, "X2"], na_cols]
n.mat[x, ]
}))
Output
[,1] [,2] [,3] [,4] [,5] [,6]
samp1 "a" "c" "a" "t" "t" "c"
samp2 "a" "c" "t" "t" "t" "a"
samp3 "a" "g" "g" "c" "a" "c"
samp4 "a" "g" "g" "g" "g" "c"
samp5 "a" "g" "g" "g" "t" "g"
Data
n.mat <- structure(c("a", "a", "a", "a", NA, "c", "c", "g", "g", "g",
"a", "t", "g", NA, "g", NA, "t", "c", "g", "g", "t", NA, "a",
"g", "t", "c", "a", "c", NA, "g"), .Dim = 5:6, .Dimnames = list(
c("samp1", "samp2", "samp3", "samp4", "samp5"), NULL))

Replacing matrix cells for corresponding row names

I'm working with a matrix that looks like this input.
I'm trying to replace numbers on column 2 by their corresponding row name. I.e. all 1s would be replaced by row.name(matrix). Thus, I'd have the following output.
The actual matrix is too large for loop application... I'm sorry I'm using images since I found it easier to represent this on excel. I'm also sorry about being quite new at R...
Vectorized approach (should be the fastest you can get):
mat <- matrix(c(letters[1:11], 1,1,1,2,2,3,3,3,4,4,4), ncol = 2)
colnames(mat) <- c("A", "B")
rownames(mat) <- 1:11
> mat
A B
1 "a" "1"
2 "b" "1"
3 "c" "1"
4 "d" "2"
5 "e" "2"
6 "f" "3"
7 "g" "3"
8 "h" "3"
9 "i" "4"
10 "j" "4"
11 "k" "4"
mat[, "B"] <- mat[as.numeric(mat[, "B"]), "A"]
> mat
A B
1 "a" "a"
2 "b" "a"
3 "c" "a"
4 "d" "b"
5 "e" "b"
6 "f" "c"
7 "g" "c"
8 "h" "c"
9 "i" "d"
10 "j" "d"
11 "k" "d"
Or you could use sapply:
mat[, "B"] <- sapply(mat[, "B"], function(x) mat[as.numeric(x), "A"])
Edit: I've put the vectorized solution at the top, as this is clearly the faster (or even fastest?) approach.

Unique characters from a column of concatenated strings

I have a data.frame with a string column 'city' which consists of concatenated letters separated by ;
dt = data.frame(id = letters[1:6],
city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
dt
# id city
# 1 a A;B
# 2 b B;D
# 3 c A;D;G
# 4 d A;C
# 5 e F;G
# 6 f C;D`
I hope to get the unique individual letters from the 'city' column:
city = c("A","B","C","D","F","G")`
How to do this?
A cleaner solution would be:
dt= data.frame(id=letters[1:6],city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
city=strsplit(as.character(dt$city), ";")
city=sort(unique(unlist(city)))
[1] "A" "B" "C" "D" "F" "G"
The data:
dt= data.frame(id=letters[1:6],city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
> dt
id city
1 a A;B
2 b B;D
3 c A;D;G
4 d A;C
5 e F;G
6 f C;D
Split the column city, using as.character to convert to strings:
city <- unlist(strsplit(as.character(dt$city), ";", fixed = T))
> city
[1] "A" "B" "B" "D" "A" "D" "G" "A" "C" "F" "G" "C" "D"
Now use unique and order to get the output:
city <- unique(city)
> city
[1] "A" "B" "D" "G" "C" "F"
city <- city[order(city)]
> city
[1] "A" "B" "C" "D" "F" "G"
> dput(city)
c("A", "B", "C", "D", "F", "G")
Edit: Updated with OPs new data.
Edit2: Updated to omit the sapply, as apparently strsplit is vectorized. Thanks #Cris!

print objects in common in two character objects

I have two character objects, I need to see how many characters they have in common and then print them. I have no problem seeing how many they have in common, but I can't seem to figure out the code to print them. Here's a simple exemple:
LETTERS
list <- c("A", "H", "J", "K")
length(na.exclude(pmatch(LETTERS[1:20],list[1:3])))
print(pmatch(LETTERS[1:20],list[1:3]))
This prints:
LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
list <- c("A", "H", "J", "K")
length(na.exclude(pmatch(LETTERS[1:20],list[1:3])))
[1] 3
print(pmatch(LETTERS[1:20],list[1:3]))
[1] 1 NA NA NA NA NA NA 2 NA 3 NA NA NA NA NA NA NA NA NA NA
So I know that there are 3 in common and I know their positions but how do I make it print "A" "H" "J"?
Try using %in%
> LETTERS[LETTERS %in% list]
[1] "A" "H" "J" "K"
For your example:
myletters<-LETTERS[1:20]
> myletters[myletters %in% list[1:3]]
[1] "A" "H" "J"
Alternative: using pmatch as suggested by you
pmatch(list[1:3],myletters) # gives the indices
[1] 1 8 10
myletters[pmatch(list[1:3],myletters)] # get the letters
[1] "A" "H" "J"
If you want only the final result as a set (duplicates removed), use this:
intersect(LETTERS, c("A", "H", "J"))
If you want to use partial matching, you must observe that pmatch does not allow more than one element in the first input matching the same one in the second. Notice the difference:
mylist <- c("B","A","B","2")
> pmatch(mylist, LETTERS)
[1] 2 1 NA NA
> Vectorize(pmatch, "x")(mylist, LETTERS)
B A B 2
2 1 2 NA
Now, if you want to print the elements of mylist that match (partially) with the elements of, say, LETTERS, keeping the order and duplicates, you can use this:
> mylist[!is.na(Vectorize(pmatch, "x")(mylist, LETTERS))]
[1] "B" "A" "B"

How to extract unique levels from 2 columns in a data frame in r

I have the data.frame
df<-data.frame("Site.1" = c("A", "B", "C"),
"Site.2" = c("D", "B", "B"),
"Tsim" = c(2, 4, 7),
"Jaccard" = c(5, 7, 1))
# Site.1 Site.2 Tsim Jaccard
# 1 A D 2 5
# 2 B B 4 7
# 3 C B 7 1
I can get the unique levels for each column using
top.x<-unique(df[1:2,c("Site.1")])
top.x
# [1] A B
# Levels: A B C
top.y<-unique(df[1:2,c("Site.2")])
top.y
# [1] D B
# Levels: B D
How do I get the unique levels for both columns and turn them into a vector i.e:
v <- c("A", "B", "D")
v
# [1] "A" "B" "D"
top.xy <- unique(unlist(df[1:2,]))
top.xy
[1] A B D
Levels: A B C D
Try union:
union(top.x, top.y)
# [1] "A" "B" "D"
union(unique(df[1:2, c("Site.1")]),
unique(df[1:2, c("Site.2")]))
# [1] "A" "B" "D"
You can get the unique levels for the firs two collumns:
de<- apply(df[,1:2],2,unique)
de
# $Site.1
# [1] "A" "B" "C"
# $Site.2
# [1] "D" "B"
Then you can take the symmetric difference of the two sets:
union(setdiff(de$Site.1,de$Site.2), setdiff(de$Site.2,de$Site.1))
# [1] "A" "C" "D"
If you're intrested in just two first two rows (as in your example):
de<- apply(df[1:2,1:2],2,unique)
de
# Site.1 Site.2
# [1,] "A" "D"
# [2,] "B" "B"
union(de[,1],de[,2])
# [1] "A" "B" "D"

Resources