sum if greater than in r - r

I have a dataframe (obs) with 145 rows and more than 1000 columns plus a numeric vector with 145 values (thr).
I would like to derive another vector (sumifs) with 145 elements where each element is the sum of the values of obs[n,] >= thr[n].
I thought I could run a for loop where a single row sum is calculated more or less like:
sumifs[n] <- if(obs[n,]>=thr[n],sum(obs[n,]))
but I didn't manage to make it work for the single row either.
I've been giving a look to other questions where it has been suggested to use aggregate or the plyr package but I didn't really find anything.
A simplified example with only 15 rows and 3 columns is following
c1 <- rep(1:5,3)
c2 <- rep(3:7,3)
c3 <- rep(2:6,3)
obs <- data.frame(r1,r2,r3)
thr <- c(2,2,3,3,4,4,5,5,2,2,3,3,4,4,5)
obs
r1 r2 r3
1 1 3 2
2 2 4 3
3 3 5 4
4 4 6 5
5 5 7 6
6 1 3 2
7 2 4 3
8 3 5 4
9 4 6 5
10 5 7 6
11 1 3 2
12 2 4 3
13 3 5 4
14 4 6 5
15 5 7 6
therefore, sumifs should be:
sumifs
5
9
12
15
18
0
0
0
15
18
3
7
9
15
18

#your data
DF <- as.data.frame(matrix(1:6, ncol = 2))
#turn into matrix
m <- as.matrix(DF)
#your threshold
thr <- c(3, 1, 7)
#compare
m >= thr
# V1 V2
#[1,] FALSE TRUE
#[2,] TRUE TRUE
#[3,] FALSE FALSE
#logical values get turned to 0/1 during arithmetics
#thus we can just multiply the matrix with the comparison
m * (m >= thr)
# V1 V2
#[1,] 0 4
#[2,] 2 5
#[3,] 0 0
#and calculate the row sums
rowSums(m * (m >= thr))
#[1] 4 7 0

Related

Extract positions in a data frame based on a vector

In a dataset I want to know where there are missing values, therefore i use which(is.na(df)). Then I do for example imputation in this dataset and thereafter I want to extract the imputed positions. But I dont know how to extract these data. Does anyone have suggestions? Thanks!
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,67,8,9,0,6,7,9)
B <- c(5,6,31,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,89,3,2,9,NA,12,69,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 31 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 69
10 b 9 4 6 8
pos_na <- which(is.na(df))
pos_na
[1] 13 27 34 35 47
# after imputation
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,4,67,8,9,0,6,7,9)
B <- c(5,6,31,9,8,1,65,9,7,4)
C <- c(2,3,5,8,2,2,7,6,4,6)
D <- c(6,5,89,3,2,9,6,12,69,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 4 31 5 89
4 a 67 9 8 3
5 a 8 8 2 2
6 b 9 1 2 9
7 b 0 65 7 6
8 b 6 9 6 12
9 b 7 7 4 69
10 b 9 4 6 8
Wanted output: 4,65,8,2 6
To store positions of NA use which with arr.ind = TRUE which gives row and column numbers.
pos_na <- which(is.na(df), arr.ind = TRUE)
pos_na
# row col
#[1,] 3 2
#[2,] 7 3
#[3,] 4 4
#[4,] 5 4
#[5,] 7 5
So that after imputation you can extract the values directly.
as.numeric(df[pos_na])
[1] 4 65 8 2 6
Instead of wrapping with which, we can keep it as a logical matrix
i1 <- is.na(df[-1])
Then, after the imputation, just use the i1
df[-1][i1]
#[1] 4 65 8 2 6
Note, the -1 indexing for columns is to remove the first column which is 'character'

Compare 2 values of the same row of a matrix with the row and column index of another matrix in R

I have a matrix1 with 11217 rows and 2 columns, a second matrix2 which has 10 rows and 10 columns. Now, I want to compare the values in the rows of matrix 1 with the indices of matrix 2 and if these are the same then the value of the corresponding index (currently 0) of the matrix2 should be increased with +1.
c1 <- x[2:11218] #these values go from 1 to 10
#second column from index 3 to N
c2 <- x[3:11219] #these values also go from 1 to 10
#matrix with column c1 and c2
m1 <- as.matrix(cbind(c1 = c1, c2 = c2))
#empty matrix which will count the frequencies
m2 <- matrix(0, nrow = 10, ncol = 10)
#change row and column names of m2 to the numbers of 1 to 10
dimnames(m2) <-list(c(1:10), c(1:10))
#go through every row of the matrix m1 and look which rotation appears, add 1 to m2 if the rotation
#equals the corresponding index
r <- c(1:10)
c <- c(1:10)
for (i in 1:nrow(m1)) {
if(m1[i,1] == r & m1[i,2] == c)
m2[r,c]+1
}
no frequencies where calculated, i don't understand why?
It appears that you are trying to replicate the behavior of table. I'd recommend just using it instead.
Simpler data (it appears you did not include variable x):
m1 <-
matrix(round(runif(20, 1,10))
, ncol = 2)
Then, use table. Here, I am setting the values of each column to be a factor to ensure that the right columns are generated:
table(factor(m1[,1], 1:10)
, factor(m1[,2], 1:10))
gives:
1 2 3 4 5 6 7 8 9 10
1 3 4 0 4 2 0 5 3 2 0
2 3 7 9 7 4 5 3 4 5 2
3 4 6 3 10 8 9 4 2 7 3
4 5 2 14 3 7 13 8 11 3 3
5 2 13 2 5 8 5 7 7 8 6
6 1 10 7 4 5 6 8 5 8 5
7 3 3 6 5 4 5 4 8 7 7
8 5 5 8 7 6 10 5 4 3 4
9 2 5 8 4 7 4 4 6 4 2
10 3 1 2 3 3 5 3 5 1 0

Reduce columns of a matrix by a function in R

I have a matrix sort of like:
data <- round(runif(30)*10)
dimnames <- list(c("1","2","3","4","5"),c("1","2","3","2","3","2"))
values <- matrix(data, ncol=6, dimnames=dimnames)
# 1 2 3 2 3 2
# 1 5 4 9 6 7 8
# 2 6 9 9 1 2 5
# 3 1 2 5 3 10 1
# 4 6 5 1 8 6 4
# 5 6 4 5 9 4 4
Some of the column names are the same. I want to essentially reduce the columns in this matrix by taking the min of all values in the same row where the columns have the same name. For this particular matrix, the result would look like this:
# 1 2 3
# 1 5 4 7
# 2 6 1 2
# 3 1 1 5
# 4 6 4 1
# 5 6 4 4
The actual data set I'm using here has around 50,000 columns and 4,500 rows. None of the values are missing and the result will have around 40,000 columns. The way I tried to solve this was by melting the data then using group_by from dplyr before reshaping back to a matrix. The problem is that it takes forever to generate the data frame from the melt and I'd like to be able to iterate faster.
We can use rowMins from library(matrixStats)
library(matrixStats)
res <- vapply(split(1:ncol(values), colnames(values)),
function(i) rowMins(values[,i,drop=FALSE]), rep(0, nrow(values)))
res
# 1 2 3
#[1,] 5 4 7
#[2,] 6 1 2
#[3,] 1 1 5
#[4,] 6 4 1
#[5,] 6 4 4
row.names(res) <- row.names(values)

Combinations of Variables in R

I'm trying to create a fake data frame to examine the effects from a multinomial logit model in R. I have code that does precisely what I want to do, wich is to create a row representing every combination of levels of different variables.
var1 <- seq(1,10,1)
var2 <- seq(1,20,5)
FakeData <- as.data.frame(matrix(NA, nrow=length(var1) * length(var2),
ncol=2))
row <- 1
for(i in 1:length(var1)){
for(j in 1:length(var2)){
FakeData[row, 1] <- var1[i]
FakeData[row, 2] <- var2[j]
row <- row + 1
}
}
> head(FakeData)
V1 V2
1 1 1
2 1 6
3 1 11
4 1 16
5 2 1
6 2 6
My problem is that this code is very inefficient when applied to my problem with four variables of around ten levels each. Any tips on functions that might make it quicker?
You may be looking for expand.grid ?
R> expand.grid(var1, var2)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 1 6
12 2 6
13 3 6
14 4 6
15 5 6
16 6 6
17 7 6
18 8 6
19 9 6
20 10 6

R: create a data frame out of a rolling window

Lets say I have a data frame with the following structure:
DF <- data.frame(x = 0:4, y = 5:9)
> DF
x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
what is the most efficient way to turn 'DF' into a data frame with the following structure:
w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Where w is a length 2 window rolling through the dataframe 'DF.' The length of the window should be arbitrary, i.e a length of 3 yields
w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 7
3 3 8
3 4 9
I am a bit stumped by this problem, because the data frame can also contain an arbitrary number of columns, i.e. w,x,y,z etc.
/edit 2: I've realized edit 1 is a bit unreasonable, as xts doesn't seem to deal with multiple observations per data point
My approach would be to use the embed function. The first thing to do is to create a rolling sequence of indices into a vector. Take a data-frame:
df <- data.frame(x = 0:4, y = 5:9)
nr <- nrow(df)
w <- 3 # window size
i <- 1:nr # indices of the rows
iw <- embed(i,w)[, w:1] # matrix of rolling-window indices of length w
> iw
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 3 4
[3,] 3 4 5
wnum <- rep(1:nrow(iw),each=w) # window number
inds <- i[c(t(iw))] # the indices flattened, to use below
dfw <- sapply(df, '[', inds)
dfw <- transform(data.frame(dfw), w = wnum)
> dfw
x y w
1 0 5 1
2 1 6 1
3 2 7 1
4 1 6 2
5 2 7 2
6 3 8 2
7 2 7 3
8 3 8 3
9 4 9 3

Resources