R split character vector using strsplit? - r

As a newbie in R how to treat correctly a variable having multiple values like that :
x = c("1","1","1/2","2","2/3","1/3")
As you see value 3 only appears in conjonction with others.
To compute x further, the best would be to obtain 3 vectors like :
X[1] = c(1,1,1,NA,NA,1)
because "1" appears in 1st, 2nd, 3rd and 6th places.
idem with X[2] and X[3]
All information seems to be preserved doing so : Am I wrong ?
I have already tested strsplit but it is not preserving NA's values that are not already in my vector.

An alternative is to use cSplit_e from my "splitstackshape" package.
x = c("1","1","1/2","2","2/3","1/3")
library(splitstackshape)
cSplit_e(data.frame(x), "x", "/")
# x x_1 x_2 x_3
# 1 1 1 NA NA
# 2 1 1 NA NA
# 3 1/2 1 1 NA
# 4 2 NA 1 NA
# 5 2/3 NA 1 1
# 6 1/3 1 NA 1
(Note that the results here are transposed in comparison to the results in the accepted answer.)

This seems to work:
x = c("1","1","1/2","2","2/3","1/3")
#Split on your character. This may not be inclusive of all characters that
#need to be split on.
xsplit <- strsplit(x, "\\/")
#Find the unique items
xunique <- unique(unlist(xsplit))
#Iterate over each xsplit for all unique values
out <- sapply(xsplit, function(z)
sapply(xunique, function(zz) zz %in% z)
)
#convert FALSE to NA
out[out == FALSE] <- NA
#Results in
> out
[,1] [,2] [,3] [,4] [,5] [,6]
1 TRUE TRUE TRUE NA NA TRUE
2 NA NA TRUE TRUE TRUE NA
3 NA NA NA NA TRUE TRUE

Related

In R, use if loop with agrep to assign value

The pattern list looks like:
pattern <- c('aaa','bbb','ccc','ddd')
X came from df looks like:
df$X <- c('aaa-053','aaa-001','aab','bbb')
What I tried to do: use agrep to find the matching name in pattern based on df$X, then assign value to an existing column 'column2' based on the matching result, for example, if 'aaa-053' matched 'aaa', then 'aaa' would be the value in 'column2', if not matched, then return na in that column.
for (i in 1:length(pattern)) {
match <- agrep(pattern, df$X, ignore.case=TRUE, max=0)
if agrep = TRUE {
df$column2 <- pattern
} else {df$column2 <- na
}
}
Ideal column2 in df looks like:
'aaa','aaa',na,'bbb'
agrep by itself isn't going to give you much to determine which to use when multiples match. For instance,
agrep(pattern[1], df$x)
# [1] 1 2 3
which makes sense for the first two, but the third is not among your expected values. Similarly, it's feasible that it might select multiple patterns for a given string.
Here's an alternative:
D <- adist(pattern, df$x, fixed = FALSE)
D
# [,1] [,2] [,3] [,4]
# [1,] 0 0 1 3
# [2,] 3 3 2 0
# [3,] 3 3 3 3
# [4,] 3 3 3 3
D[D > 0] <- NA
D
# [,1] [,2] [,3] [,4]
# [1,] 0 0 NA NA
# [2,] NA NA NA 0
# [3,] NA NA NA NA
# [4,] NA NA NA NA
apply(D, 2, function(z) which.min(z)[1])
# [1] 1 1 NA 2
pattern[apply(D, 2, function(z) which.min(z)[1])]
# [1] "aaa" "aaa" NA "bbb"

returning matrix column indices matching value(s) in R

I'm looking for a fast way to return the indices of columns of a matrix that match values provided in a vector (ideally of length 1 or the same as the number of rows in the matrix)
for instance:
mat <- matrix(1:100,10)
values <- c(11,2,23,12,35,6,97,3,9,10)
the desired function, which I call rowMatches() would return:
rowMatches(mat, values)
[1] 2 1 3 NA 4 1 10 NA 1 1
Indeed, value 11 is first found at the 2nd column of the first row, value 2 appears at the 1st column of the 2nd row, value 23 is at the 3rd column of the 3rd row, value 12 is not in the 4th row... and so on.
Since I haven't found any solution in package matrixStats, I came up with this function:
rowMatches <- function(mat,values) {
res <- integer(nrow(mat))
matches <- mat == values
for (col in ncol(mat):1) {
res[matches[,col]] <- col
}
res[res==0] <- NA
res
}
For my intended use, there will be millions of rows and few columns. So splitting the matrix into rows (in a list called, say, rows) and calling Map(match, as.list(values), rows) would be way too slow.
But I'm not satisfied by my function because there is a loop, which may be slow if there are many columns. It should be possible to use apply() on columns, but it won't make it faster.
Any ideas?
res <- arrayInd(match(values, mat), .dim = dim(mat))
res[res[, 1] != seq_len(nrow(res)), 2] <- NA
# [,1] [,2]
# [1,] 1 2
# [2,] 2 1
# [3,] 3 3
# [4,] 2 NA
# [5,] 5 4
# [6,] 6 1
# [7,] 7 10
# [8,] 3 NA
# [9,] 9 1
#[10,] 10 1
Roland's answer is good, but I'll post an alternative solution:
res <- which(mat==values, arr.ind = T)
res <- res[match(seq_len(nrow(mat)), res[,1]), 2]

Need to vectorize function that using loop (replace NA rows with values from vector)

How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9

use function in every two columns in R

Now I have a data set that looks like this:
> data
a b c d
[1,] 0.5943590 2.195610 0.5332164 1.3004142
[2,] 0.7635876 1.917823 0.9714945 1.3251010
[3,] 0.9942722 2.350122 1.2048159 1.1675700
[4,] 0.3736785 1.876318 0.9109197 0.8520509
And then I want to use a function for every two columns, for example,
F2<- function(x,y) (sum((x - y) ^ 2)) #define function
F2(data$a, data$b) #use function for first two columns
F2(data$a, data$c) #use function for first and third columns
F2(data$b, data$c) #use function for second and third columns
..................
How to use apply family to do this? Any help is greatly appreciated.
That's a job for combn:
#some data
set.seed(42)
m <- matrix(rnorm(16),4)
F2<- function(x,y) (sum((x - y) ^ 2))
res <- matrix(NA, ncol(m), ncol(m))
res[lower.tri(res)] <- combn(ncol(m), 2,
FUN=function(ind) F2(m[,ind[1]], m[,ind[2]]))
print(res)
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] 2.992875 NA NA NA
# [3,] 4.293073 8.320698 NA NA
# [4,] 7.944818 6.484424 16.44946 NA
#for nicer printing
as.dist(res)
# 1 2 3
# 2 2.992875
# 3 4.293073 8.320698
# 4 7.944818 6.484424 16.449463
And of course for this specific function you should better use dist, which is optimized for that kind of distance calculations:
dist(t(m))^2
# 1 2 3
# 2 2.992875
# 3 4.293073 8.320698
# 4 7.944818 6.484424 16.449463

Questions about missing data

In a matrix, if there is some missing data recorded as NA.
how could I delete rows with NA in the matrix?
can I use na.rm?
na.omit() will take matrices (and data frames) and return only those rows with no NA values whatsoever - it takes complete.cases() one step further by deleting the FALSE rows for you.
> x <- data.frame(c(1,2,3), c(4, NA, 6))
> x
c.1..2..3. c.4..NA..6.
1 1 4
2 2 NA
3 3 6
> na.omit(x)
c.1..2..3. c.4..NA..6.
1 1 4
3 3 6
I think na.rm usually only works within functions, say for the mean function. I would go with complete.cases: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.htm
let's say you have the following 3x3-matrix:
x <- matrix(c(1:8, NA), 3, 3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 NA
then you can get the complete cases of this matrix with
y <- x[complete.cases(x),]
> y
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
The complete.cases-function returns a vector of truth values that says whether or not a case is complete:
> complete.cases(x)
[1] TRUE TRUE FALSE
and then you index the rows of matrix x and add the "," to say that you want all columns.
If you want to remove rows that contain NA's you can use apply() to apply a quick function to check each row. E.g., if your matrix is x,
goodIdx <- apply(x, 1, function(r) !any(is.na(r)))
newX <- x[goodIdx,]

Resources