Calculating weight ratios in the presence of empty cells - r

I have a sample which needs to weighed in order to represent the population.
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
V1 V2 V3 V4
1: 1 0 2 2
2: 3 4 3 0
population <- fread("
10,20,20,10
30,40,20,10
")
This weight would simply be:
weights <- population/sample
V1 V2 V3 V4
1: 10 Inf 10.000000 5
2: 10 10 6.666667 Inf
However, because V2 in row 1 of the sample has no observations, it receives an infinite weight (Note that also V4 in row 2 receives an Inf, but this is easier to solve, because the weight is irrelevant, as there are no observations in either the sample or the population).
A solution to the problem, would be to count V1 and V2 together in the sample and the population.
EDIT:
After some thought I realised that, for the weights to be correct, only the population values have to be adapted. If V1 and V2 in row 1 of population are added together in V1 of population, this will already lead to the correct weight for the sample observation of V1 row . The value of V2 becomes irrelevant because there is no observation in the sample to receive that weight.
End of EDIT
The observation would then get a weight of:
(population[1,1]+population[1,2])/(sample[1,1]+sample[1,2])
(10+20)/(1+0)=30
In my actual data, there however many more rows, with hero and there a 0 in the sample. I am trying to figure out if there is a way to write my code, so that I do not have to do this manually..
Desired outcome (notice that the weight of V1 row 1 is now 30):
weights
V1 V2 V3 V4
1: 30 0 10.000000 5
2: 10 10 6.666667 0
Attempt
I was think of doing something like:
for (i in seq_along(ncol(sample))) {
lapply(population, (ifelse(sample[i]==0), population[i]<-population[i+1], population[i])
}
Where the values in the population of the cell to right will be added when the value in the sample is zero. However I am having trouble getting the syntax right, and even if it did, it does not solve the case where V4 is 0.

Here is a rather verbose solution. In case there are more columns that should be aggregated in case of zeros in sample, I would have proposed a more flexible approach but this seems sufficient for your example
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
population <- fread("
10,20,20,10
30,40,20,10
")
# aggregate Values if sample is zero
population[sample$V1 == 0, `:=`(V1 = 0,
V2 = V1 + V2)]
population[sample$V2 == 0, `:=`(V1 = V1 + V2,
V2 = 0)]
weights <- population/sample
# Fix NaNs
weights[is.na(weights), ] <- 0
weights
#> V1 V2 V3 V4
#> 1: 30 0 10.000000 5
#> 2: 10 10 6.666667 Inf

Related

Subtract a row of values across many rows

I have a dataset that simplifies to something like this, let's call that dataset B
V1 V2 V3 V4
sample1 1 2 3
sample2 4 5 6
sample3 7 8 9
then I have another separate row (on its own) called blank,
it would look something like this.
V1 V2 V3 V4
blank 0.5 1.0 1.5
I would like to subtract blank to all the rows of B.
So far I've tried:
B[,2:ncol(B)] <- lapply(B[,2:ncol(B)], function(x) x - blank[,2:ncol(blank)])
B[,2:ncol(B)] <- sweep(B[,2:ncol(B)], 1, blank[,2:ncol(blank)])
B[,2:ncol(B)] <- B[,2:ncol(B)] - blank[,2:ncol(blank)])
B[,2:ncol(B)] <- for(i in 1:nrow(B)){B[ i ,2:ncol(B)] - blank[,2:ncol(B)]}
None of which would work. first one tells me that "replacement element 1 is a matrix/data of 1 row, need 3". Second one tells me "STATS is longer than the extent of 'dim(x)[MARGIN]'", changing margin into 2 does not solve the problem. The third one says "‘-’ only defined for equally-sized data frames". The fourth one returns me a blank matrix.
I've looked through the forum to the best of my ability, but they are limited to applying only one value across the entire dataset, I would like to subtract a whole row of values across the rest of the rows in a dataset.
The end result should look like this (no rounding required).
V1 V2 V3 V4
sample1 0.5 1.0 1.5
sample2 3.5 4.0 4.5
sample3 6.5 7.0 7.5
You can subtract the one row from all rows of the second dataframe by repeating the one row as many times as there are rows in the second dataframe and simply subtract those two dataframe like below.
df1 <- t(data.frame(c(1,2,3), c(4,5,6), c(7,8,9)))
df2 <- data.frame(.5, 1, 1.5)
df1[,]-df2[rep(1,3),] # Note that inside the rep i am creating 3 rows if you have
#more rows you need to change 3 to number of rows you have
We can use sweep :
B[-1] <- sweep(B[-1], 2, unlist(blank[-1]), `-`)
B
# V1 V2 V3 V4
#1 sample1 0.5 1 1.5
#2 sample2 3.5 4 4.5
#3 sample3 6.5 7 7.5
Or using transpose
B[-1] <- t(t(B[-1]) - unlist(blank[-1]))

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

-Inf content in a Position Weight Matrix

What does it mean when I have -Inf content in some positions of a Position Weight Matrix?
I am using the seqLogo package. For plotting the seqLogo:
library(seqLogo)
seqLogo(weight_matrix, ic.scale=TRUE, xaxis=TRUE, yaxis=TRUE, xfontsize=15, yfontsize=15)
and I have:
Error in seqLogo(weight_matrix, ic.scale = TRUE, xaxis = TRUE, yaxis =
TRUE, : Columns of PWM must add up to 1.0
From the error it is obvious, column sum must be equal to 1. As it is sum of probabilities, which can't be more than 1. See example:
Below works fine, using example m matrix from seqLogo package:
library(seqLogo)
# get example matrix
mFile <- system.file("Exfiles/pwm1", package="seqLogo")
m <- read.table(mFile)
# check if all columns have sum of 1
colSums(m)
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 1 1 1 1 1 1 1
# plot, all great!
seqLogo(m)
Now, let's change one of the values, so that column sum is more than 1. This will give us error.
m[1, 1] <- 1
# check if all columns have sum of 1
# V1 V2 V3 V4 V5 V6 V7 V8
# 2 1 1 1 1 1 1 1
seqLogo(m)
# Error in seqLogo(m) : Columns of PWM must add up to 1.0
Other reason could be that matrix values are already logged. If they are then convert them back to probabilities using:
plotMatrix <- 2 ^ weight_matrix * 0.25
then plot:
seqLogo(plotMatrix)

R independent columns in matrix

I am trying to find independent columns to solve the system of linear equations. Here my simplified example:
> mat = matrix(c(1,0,0,0,-1,1,0,0,0,-1,1,0,0,0,-1,0,-1,0,0,1,0,0,1,-1), nrow=4, ncol=6, dimnames=list(c("A", "B", "C", "D"), paste("v", 1:6, sep="")))
> mat
v1 v2 v3 v4 v5 v6
A 1 -1 0 0 -1 0
B 0 1 -1 0 0 0
C 0 0 1 -1 0 1
D 0 0 0 0 1 -1
The matrix is full rank:
qr(mat)$rank
gives me 4, and since there are 6 columns, there should be 6-4=2 independent columns from which I can calculate the others.
I know that columns v4 and v6 are independent... My first question is, how can I find these columns (maybe with qr(mat)$pivot)?
By rearranging the linear equations on paper, I see that
[v1, v2, v3, v4, v5, v6] = [v4, v4-v6, v4-v6, v4, v4, v6, v6]
and thus I can find from arbitrary values for v4 and v6 a vector that lies in the null space by multiplying v4 and v6 with the vectors below:
v4 * [1,1,1,1,0,0] + v6 * [0,-1,-1,0,1,1]
My second question is: How do I find these vectors, meaning how do I solve the matrix for v4 and v6?
For example
qr.solve(mat, cbind(c(0,0,0,0), c(0,0,0,0)))
gives me two vectors of length 6 with only zeros.
Any help is appreciated, many thanks in advance!
-H-
Use the pivot information to find a set of independent columns:
q <- qr(mat)
mmat <- mat[,q$pivot[seq(q$rank)]]
mmat
## v1 v2 v3 v5
## A 1 -1 0 -1
## B 0 1 -1 0
## C 0 0 1 0
## D 0 0 0 1
qr(mmat)$rank
## [1] 4
Why does this work? The meaning of pivot is given in QR.Auxiliaries {base} brought up with ?qr.Q. In particular:
qr.R returns R. This may be pivoted, e.g., if a <- qr(x) then x[, a$pivot] = QR.
The number of rows of R is either nrow(X) or ncol(X) (and may depend on whether
complete is TRUE or FALSE).
Pivoting is done to order the eigenvalues in decreasing absolute value, for numerical stability. This also means that any 0 eigenvalues are at the end, beyond q$rank in q$pivot (and nonexistent in the current example, where Q is a 4x4 orthogonal matrix).
The final lines in the QR.Auxiliaries {base} show this relationship:
pivI <- sort.list(a$pivot) # the inverse permutation
stopifnot(
all.equal(x[, a$pivot], qr.Q(a) %*% qr.R(a)), # TRUE
all.equal(x , qr.Q(a) %*% qr.R(a)[, pivI])) # TRUE too!
If you start with v4 and v6 then you need 2 more with non-zero values inrows 1 and 2 so that you need to pick v1 and either v2 or v3. These are all possible basis choices that will have maximal rank.
> qr(mat[, c(1,2,4,6)])$rank
[1] 4
> qr(mat[, c(1,2,3,5)])$rank
[1] 4
> qr(mat[, c(1,3,4,6)])$rank
[1] 4
It is simply not the case that "independent columns" are uniquely determined. There may be sets of columns that are necessarily dependent, e.g., ones which are scalar multiples of each other, but that is not the case here.
On the other hand this will be rank deficient:
> qr(mat[, c(1,2,3,4)])$rank
[1] 3

Loop over rows of dataframe applying function with if-statement

I'm new to R and I'm trying to sum 2 columns of a given dataframe, if both the elements to be summed satisfy a given condition. To make things clear, what I want to do is:
> t.d<-as.data.frame(matrix(1:9,ncol=3))
> t.d
V1 V2 V3
1 4 7
2 5 8
3 6 9
> t.d$V4<-rep(0,nrow(t.d))
> for (i in 1:nrow(t.d)){
+ if (t.d$V1[i]>1 && t.d$V3[i]<9){
+ t.d$V4[i]<-t.d$V1[i]+t.d$V3[i]}
+ }
> t.d
V1 V2 V3 V4
1 4 7 0
2 5 8 10
3 6 9 0
I need an efficient code, as my real dataframe has about 150000 rows and 200 columns. This gives an error:
t.d$V4<-t.d$V1[t.d$V1>1]+ t.d$V3[t.d$V3>9]
Is "apply" an option? I tried this:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d$V4<-rep(0,nrow(t.d))
my.fun<-function(x,y){
if(x>1 && y<9){
x+y}
}
t.d$V4<-apply(X=t.d,MAR=1,FUN=my.fun,x=t.d$V1,y=t.d$V3)
but it gives an error as well.
Thanks very much for your help.
This operation doesn't require loops, apply statements or if statements. Vectorised operations and subsetting is all you need:
t.d <- within(t.d, V4 <- V1 + V3)
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0
t.d
V1 V2 V3 V4
1 1 4 7 0
2 2 5 8 10
3 3 6 9 0
Why does this work?
In the first step I create a new column that is the straight sum of columns V1 and V4. I use within as a convenient way of referring to the columns of d.f without having to write d.f$V all the time.
In the second step I subset all of the rows that don't fulfill your conditions and set V4 for these to 0.
ifelse is your friend here:
t.d$V4<-ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0)
I'll chip in and provide yet another version. Since you want zero if the condition doesn't mach, and TRUE/FALSE are glorified versions of 1/0, simply multiplying by the condition also works:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d <- within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9))
...and it happens to be faster than the other solutions ;-)
t.d <- data.frame(V1=runif(2e7, 1, 2), V2=1:2e7, V3=runif(2e7, 5, 10))
system.time( within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9)) ) # 3.06 seconds
system.time( ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0) ) # 5.08 seconds
system.time( { t.d <- within(t.d, V4 <- V1 + V3);
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0 } ) # 4.50 seconds

Resources