Interpreting this error message in R - r

I have the following matrix
mat<-read.csv("mat.csv")
sel<-c(135, 211)
I would like to select the rows in 'mat' that correspond to 'sel'
I do it in the following way:
subset(mat, mat$V2==c(sel))
and I get the following error:
Warning message:
In l[, 2] == c(135, 211) :
longer object length is not a multiple of shorter object length
And also it only selects one of the two.

Try this (credits go to Roland)
mat[mat$V2 %in% sel,]
X V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
11 11 1 135 2 7 100 2 0 0 0 0
15 15 1 211 5 7 100 2 0 0 0 0
from ?'%in% you can read:
%in% is a more intuitive interface as a binary operator, which returns
a logical vector indicating if there is a match or not for its left operand.
If you have a logical vector indicating the matching, then you can use it for indexing and selecting the elements you want. In this case mat$V2 %in% sel matches all elements of mat$V2 that are in sel it will give you a logical vector, then using it in mat[row, col] you'll get ontly those desired elements as in mat[mat$V2 %in% sel,] this means: Give all the columns for those rows which elements meeting the condition mat$V2 %in% sel.

Related

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Output row index of first element of every column in a matrix to satisfy a logical condition

I have a large matrix (8,000 x 8,000) containing numerical data. I would like an output matrix with a single row containing the row index of the first element in each column to satisfy a logical operator. Note that not all the columns will have an element satisfying the condition.
Example input:
Column
Row 1 2 3 4
1 34.349 23.642 64.321 12.320
2 74.734 11.755 29.424 55.432
3 31.345 99.328 64.236 45.453
4 22.436 84.345 45.323 21.008
5 7.323 101.324 45.254 32.233
6 119.345 23.324 72.474 53.543
Logical operator: x > 70 gives an example output of:
Column
Row 1 2 3 4
1 2 3 6 NA
I'm new to R and struggled to get this output using the standard match and which functions.
Since, it is a matrix we can use apply with margin = 2 (column-wise). Here we check if the column has at least one value greater than 70 and return it's index or else return NA.
apply(mat > 70, 2, function(x) if (any(x)) which.max(x) else NA)
#V1 V2 V3 V4
# 2 3 6 NA
Ideally apply(mat > 70, 2, which.max) , would have given you what you need but it fails when you have no element greater than 70 hence, the check with if and any condition.
This would also work with dataframe.
In case there are no elements greater than 70 and column has NA values, it returns an error.
mat[1, 4] <- NA
apply(mat > 70, 2, function(x) if (any(x)) which.max(x) else NA)
Error in if (any(x)) which.max(x) else NA :
missing value where TRUE/FALSE needed
In such case, we can use na.rm argument in any to avoid this error.
apply(mat > 70, 2, function(x) if (any(x, na.rm = TRUE)) which.max(x) else NA)
#V1 V2 V3 V4
# 2 3 6 NA

taking the sum of a TRUE/FALSE vector in r

I am working analyzing SNP data for a fungus, and I am trying to impute the missing data by changing the Ns to the genotype of the more frequent allele....see below.
newdata is a matrix of my snps (rows)and fungal isolates(columns). The genotypes for each snp are in the 0, 1, and N format, and that is why I am trying to impute the missing genotypes.
newdata_imputed=newdata
for (k in 1:nrow(newdata)){
u=newdata[k,]
x<-sum(u==0)
y<-sum(u==1)
all_freq=y/(x+y)
if (all_freq<0.5){
newdata_imputed[k,]=gsub("N",0,u)
} else{newdata_imputed[k,]=gsub("N",1,u)}
print(k)
}
However, I keep getting this error:
[1] 295
[1] 296
Error in if (all_freq < 0.5) { : missing value where TRUE/FALSE needed
It is obvious that the code runs but stops after encountering a problem. Please, can someone tell me what I am doing wrong? I am a newbie to R, and any advice would be greatly appreciated.
#akrun, the reason why i used a for loop is because it is nested in another for loop..so after using your code.
newdata=as.data.frame(newdata)
u=newdata
all_freq <- rowSums(u==1)/rowSums((u==1)|(u==0))
indx <- all_freq < 0.5
indx1 <- indx & !is.na(indx)
indx2 <- !indx & !is.na(indx)
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
newdata[] <- lapply(newdata, as.numeric)
I got weird values
newdata[1:10,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 1 1 1 1 1 1 1 1 1 1
Please where is the "3" coming from.???? I should only have 0 or 1
We could do this using rowSums. As #bergant and #MatthewLundberg mentioned in the comments, if there are rows with no 0 or 1 elements, we get NaN based on the calculation. One way would be to modify the logical condition by including !is.na, i.e. elements that are not NA along with the previous condition.
#using `rowSums` to create the all_freq vector
all_freq <- rowSums(newdata==1)/rowSums((newdata==1)|(newdata==0))
#Create a logical index based on elements that are less than 0.5
indx <- all_freq < 0.5
#The NA elements can be changed to FALSE by adding another condition
indx1 <- indx & !is.na(indx)
#similarly for elements that are > 0.5
indx2 <- !indx & !is.na(indx)
Now, we subset the rows of the 'newdata' with 'indx1', loop through the columns (lapply) and use gsub with pattern and replacement arguments and assign the output back to the subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
Similarly, we can do the replacement for the rows that are greater than 0.5 for 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
The gsub output columns are character class, which can be converted back to numeric (if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24)
newdata <- as.data.frame(matrix(sample(c(0:1, "N"), 10*4, replace=TRUE),
ncol=4), stringsAsFactors=FALSE)
newdata[7,] <- 2

R independent columns in matrix

I am trying to find independent columns to solve the system of linear equations. Here my simplified example:
> mat = matrix(c(1,0,0,0,-1,1,0,0,0,-1,1,0,0,0,-1,0,-1,0,0,1,0,0,1,-1), nrow=4, ncol=6, dimnames=list(c("A", "B", "C", "D"), paste("v", 1:6, sep="")))
> mat
v1 v2 v3 v4 v5 v6
A 1 -1 0 0 -1 0
B 0 1 -1 0 0 0
C 0 0 1 -1 0 1
D 0 0 0 0 1 -1
The matrix is full rank:
qr(mat)$rank
gives me 4, and since there are 6 columns, there should be 6-4=2 independent columns from which I can calculate the others.
I know that columns v4 and v6 are independent... My first question is, how can I find these columns (maybe with qr(mat)$pivot)?
By rearranging the linear equations on paper, I see that
[v1, v2, v3, v4, v5, v6] = [v4, v4-v6, v4-v6, v4, v4, v6, v6]
and thus I can find from arbitrary values for v4 and v6 a vector that lies in the null space by multiplying v4 and v6 with the vectors below:
v4 * [1,1,1,1,0,0] + v6 * [0,-1,-1,0,1,1]
My second question is: How do I find these vectors, meaning how do I solve the matrix for v4 and v6?
For example
qr.solve(mat, cbind(c(0,0,0,0), c(0,0,0,0)))
gives me two vectors of length 6 with only zeros.
Any help is appreciated, many thanks in advance!
-H-
Use the pivot information to find a set of independent columns:
q <- qr(mat)
mmat <- mat[,q$pivot[seq(q$rank)]]
mmat
## v1 v2 v3 v5
## A 1 -1 0 -1
## B 0 1 -1 0
## C 0 0 1 0
## D 0 0 0 1
qr(mmat)$rank
## [1] 4
Why does this work? The meaning of pivot is given in QR.Auxiliaries {base} brought up with ?qr.Q. In particular:
qr.R returns R. This may be pivoted, e.g., if a <- qr(x) then x[, a$pivot] = QR.
The number of rows of R is either nrow(X) or ncol(X) (and may depend on whether
complete is TRUE or FALSE).
Pivoting is done to order the eigenvalues in decreasing absolute value, for numerical stability. This also means that any 0 eigenvalues are at the end, beyond q$rank in q$pivot (and nonexistent in the current example, where Q is a 4x4 orthogonal matrix).
The final lines in the QR.Auxiliaries {base} show this relationship:
pivI <- sort.list(a$pivot) # the inverse permutation
stopifnot(
all.equal(x[, a$pivot], qr.Q(a) %*% qr.R(a)), # TRUE
all.equal(x , qr.Q(a) %*% qr.R(a)[, pivI])) # TRUE too!
If you start with v4 and v6 then you need 2 more with non-zero values inrows 1 and 2 so that you need to pick v1 and either v2 or v3. These are all possible basis choices that will have maximal rank.
> qr(mat[, c(1,2,4,6)])$rank
[1] 4
> qr(mat[, c(1,2,3,5)])$rank
[1] 4
> qr(mat[, c(1,3,4,6)])$rank
[1] 4
It is simply not the case that "independent columns" are uniquely determined. There may be sets of columns that are necessarily dependent, e.g., ones which are scalar multiples of each other, but that is not the case here.
On the other hand this will be rank deficient:
> qr(mat[, c(1,2,3,4)])$rank
[1] 3

Loop over rows of dataframe applying function with if-statement

I'm new to R and I'm trying to sum 2 columns of a given dataframe, if both the elements to be summed satisfy a given condition. To make things clear, what I want to do is:
> t.d<-as.data.frame(matrix(1:9,ncol=3))
> t.d
V1 V2 V3
1 4 7
2 5 8
3 6 9
> t.d$V4<-rep(0,nrow(t.d))
> for (i in 1:nrow(t.d)){
+ if (t.d$V1[i]>1 && t.d$V3[i]<9){
+ t.d$V4[i]<-t.d$V1[i]+t.d$V3[i]}
+ }
> t.d
V1 V2 V3 V4
1 4 7 0
2 5 8 10
3 6 9 0
I need an efficient code, as my real dataframe has about 150000 rows and 200 columns. This gives an error:
t.d$V4<-t.d$V1[t.d$V1>1]+ t.d$V3[t.d$V3>9]
Is "apply" an option? I tried this:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d$V4<-rep(0,nrow(t.d))
my.fun<-function(x,y){
if(x>1 && y<9){
x+y}
}
t.d$V4<-apply(X=t.d,MAR=1,FUN=my.fun,x=t.d$V1,y=t.d$V3)
but it gives an error as well.
Thanks very much for your help.
This operation doesn't require loops, apply statements or if statements. Vectorised operations and subsetting is all you need:
t.d <- within(t.d, V4 <- V1 + V3)
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0
t.d
V1 V2 V3 V4
1 1 4 7 0
2 2 5 8 10
3 3 6 9 0
Why does this work?
In the first step I create a new column that is the straight sum of columns V1 and V4. I use within as a convenient way of referring to the columns of d.f without having to write d.f$V all the time.
In the second step I subset all of the rows that don't fulfill your conditions and set V4 for these to 0.
ifelse is your friend here:
t.d$V4<-ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0)
I'll chip in and provide yet another version. Since you want zero if the condition doesn't mach, and TRUE/FALSE are glorified versions of 1/0, simply multiplying by the condition also works:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d <- within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9))
...and it happens to be faster than the other solutions ;-)
t.d <- data.frame(V1=runif(2e7, 1, 2), V2=1:2e7, V3=runif(2e7, 5, 10))
system.time( within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9)) ) # 3.06 seconds
system.time( ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0) ) # 5.08 seconds
system.time( { t.d <- within(t.d, V4 <- V1 + V3);
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0 } ) # 4.50 seconds

Resources