Select rows of a matrix that meet a condition - r

In R with a matrix:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
I want to extract the submatrix whose rows have column three = 11. That is:
one two three four
[1,] 1 6 11 16
[3,] 3 8 11 18
[4,] 4 9 11 19
I want to do this without looping. I am new to R so this is probably very obvious but the
documentation is often somewhat terse.

This is easier to do if you convert your matrix to a data frame using as.data.frame(). In that case the previous answers (using subset or m$three) will work, otherwise they will not.
To perform the operation on a matrix, you can define a column by name:
m[m[, "three"] == 11,]
Or by number:
m[m[,3] == 11,]
Note that if only one row matches, the result is an integer vector, not a matrix.

I will choose a simple approach using the dplyr package.
If the dataframe is data.
library(dplyr)
result <- filter(data, three == 11)

m <- matrix(1:20, ncol = 4)
colnames(m) <- letters[1:4]
The following command will select the first row of the matrix above.
subset(m, m[,4] == 16)
And this will select the last three.
subset(m, m[,4] > 17)
The result will be a matrix in both cases.
If you want to use column names to select columns then you would be best off converting it to a dataframe with
mf <- data.frame(m)
Then you can select with
mf[ mf$a == 16, ]
Or, you could use the subset command.

Subset is a very slow function , and I personally find it useless.
I assume you have a data.frame, array, matrix called Mat with A, B, C as column names; then all you need to do is:
In the case of one condition on one column, lets say column A
Mat[which(Mat[,'A'] == 10), ]
In the case of multiple conditions on different column, you can create a dummy variable. Suppose the conditions are A = 10, B = 5, and C > 2, then we have:
aux = which(Mat[,'A'] == 10)
aux = aux[which(Mat[aux,'B'] == 5)]
aux = aux[which(Mat[aux,'C'] > 2)]
Mat[aux, ]
By testing the speed advantage with system.time, the which method is 10x faster than the subset method.

If your matrix is called m, just use :
R> m[m$three == 11, ]

If the dataset is called data, then all the rows meeting a condition where value of column 'pm2.5' > 300 can be received by -
data[data['pm2.5'] >300,]

Related

how to create a row that is calculated from another row automatically like how we do it in excel?

does anyone know how to have a row in R that is calculated from another row automatically? i.e.
lets say in excel, i want to make a row C, which is made up of (B2/B1)
e.g. C1 = B2/B1
C2 = B3/B2
...
Cn = Cn+1/Cn
but in excel, we only need to do one calculation then drag it down. how do we do it in R?
In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned:
df["C1"] <- df["B2"]/df["B1"]
df["C2"] <- df["B3"]/df["B2"]
In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on.
EDIT 1:
There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R:
Example dataset for demonstration:
df <- data.frame("B1" = c(1, 2, 3),
"B2" = c(2, 4, 6),
"B3" = c(4, 8, 12))
Column calculation:
for (i in 1:ncol(df)-1) {
col_name <- paste0("C", i)
df[col_name] <- df[, i+1]/df[, i]
}
Output:
B1 B2 B3 C1 C2
1 1 2 4 2 2
2 2 4 8 2 2
3 3 6 12 2 2
So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents.
EDIT 2:
Rowwise, as you actually meant it apparently, works similarly:
a <- c(10,15,20, 1)
df <- data.frame(a)
for (i in 1:nrow(df)) {
df$b[i] <- df$a[i+1]/df$a[i]
}
Output:
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 NA
You can do this just using vectors, without a for loop.
a <- c(10,15,20, 1)
df <- data.frame(a)
df$b <- c(df$a[-1], 0) / df$a
print(df)
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 0.000000
Explanation:
In the example data, df$a is the vector 10 15 20 1.
df$a[-1] is the same vector with its first element removed, 15 20 1.
And using c() to add a new element to the end so that the vector has the same lenght as before:
c(df$a[-1],0) which is 15 20 1 0
What we want for column b is this vector divided by the original df$a.
So:
df$b <- c(df$a[-1], 0) / df$a

how to get all rows with max value of a variable [duplicate]

This question already has answers here:
Extracting indices for data frame rows that have MAX value for named field
(3 answers)
Closed 4 years ago.
I have matrix containing two columns and many rows. The first column name is idCombinaison and the second column name is accuarcy. The accuarcy has a float values.
Now I want to get all rows which the value of accuarcy == max value. In some cases (like depicted in the picture), I can have many rows which the value of accuarcy equals to max, so I want to get all these rows!
I tried this:
maxAccuracy <- subset(accuarcyMatrix, accuarcyMatrix['accuarcy'] == max(accuarcyMatrix['accuarcy']))
But this return an empty vector. Any ideas please?
A reproducible data simulating your matrix:
set.seed(123)
x <- matrix(sample(1:9, 30, T), 10, 3)
row.names(x) <- 1:10
colnames(x) <- LETTERS[1:3]
# A B C
# 1 3 9 9
# 2 8 5 7
# 3 4 7 6
# ...
In matrix objects, you need to use a binary way to extract element such as data[a, b]. Take the above data for example, x["C"] will return NA and x[, "C"] will return all elements in column C. Therefore, the following two codes are going to generate different outputs.
subset(x, x["C"] == max(x["C"]))
# A B C (Empty)
subset(x, x[, "C"] == max(x[, "C"]))
# A B C
# 1 3 9 9
# 4 8 6 9
Maybe something like this?
library(dplyr)
accuarcyMatrix %>%
filter_at(vars(accuarcy),
any_vars(.==max(.))
)
Base R solution (although this is very likely a duplicate):
accuarcyMatrix[ which(accuarcyMatrix$accuarcy == max(accuarcyMatrix$accuarcy) , ]
I'm guessing you will want to change "accuarcy" to "accuracy"

Call apply-like function on two rows to match

I have a dataframe with multiple rows. I want to call a function is using any two rows. For example, Let's say I have this data and this myFunc which accepts two args:
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
df
q1 q2 q3 q4 q5
1 1 5 5 5 2
2 2 5 2 5 3
3 5 5 5 5 1
myFunc<-function(a,b) sum((df[a,]==df[b,] & df[a,]==5)*1)
A want to apply myFunc for row 1 and 2, myFunc(1,2) and I expect 2, myFunc compute how many "5" are have in common under the same column, between row 1 and 2.
Since I have thousands of rows, and I want to match all pairs, I want do this without writing a for loop, maybe with the do call or apply function family.
I tried this:
a=c(1,2) # match the row 1 and 2
b=c(2,3) # match the row 2 and 3
my_list=list(a,b)
do.call("myFunc", my_list)
But I got 4, instead of 2 and 2, any ideas?
The question recently changed. My understanding of it is that the input should be a list of pairs of row numbers and the output should be the same length as that list such that each component of the output is the number of columns with both entries equal to 5 in both rows defined by the corresponding pair. Thus for df shown in the question the list L shown below would correspond to c(myFunc(1, 2), myFunc(2, 3)) where myFunc is as defined in the question.
L <- list(1:2, 2:3)
myFunc2 <- function(x) myFunc(x[1], x[2])
sapply(L, myFunc2)
## [1] 2 2
Note that *1 in myFunc is unnecessary since sum will coerce a logical argument to numeric.
An alternative might be to specify the first row numbers as a vector and the second row numbers as another vector. In terms of L that would be a <- sapply(L, "[", 1); b <- sapply(L, "[", 2). Then use mapply.
a <- c(1, 2) # L[[1]][1], L[[2]][1]
b <- c(2, 3) # L[[1]][2], L[[2]][2]
mapply(myFunc, a, b)
## [1] 2 2
Try passing the rows instead of the row index
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
myFunc<-function(a,b) sum((a==b & a==5)*1)
myFunc(df[1,],df[2,])
This worked for me (returned 2)

Merge two columns into one, delete colnames

I have a table like:
a
n_msi2010 n_msi2011
1 -0.122876 1.818750
2 1.328930 0.931426
3 -0.111653 4.400060
4 1.222900 4.500450
5 3.604160 6.110930
I would like to merge these two columns into one column to obtain (I don't want to keep column names):
a
n_msi2010
1 -0.122876
2 1.328930
3 -0.111653
4 1.222900
5 3.604160
6 1.818750
7 0.931426
8 4.400060
9 4.500450
10 6.110930
When I am using prefabricated data like
x <- cbind(c(1, 2, 3), c(4, 5, 6))
colnames(x)<-c("a","b")
c(t(x))
# 1 4 2 5 3 6
c((x))
# 1 2 3 4 5 6
the column merging works fine. Only in "a" exemple id doesn't work and it creates 2 separate vectors. I don't really understand why. Any help? Thanks
It seems like your question is about column versus row order vector creation from a data.frame.
Using t() on a data.frame converts the data.frame to a matrix, and using c() on the matrix removes its dimensions.
With that knowledge, you can try:
# create a vector of values, column by column
c(as.matrix(a)) # you are missing the `as.matrix` in your current approach
# create a vector of values, row by row
c(t(a)) # you already know this works
Other approaches to get the "column by column" result would be:
unlist(a, use.names = FALSE)
stack(a)[, "values"] # add `drop = FALSE` if you want to retain a data.frame
Not a elegant way but it seems it can combine two or several columns to one.
n_msi2010 <- 1:5
n_msi2011 <- 6:10
a <- data.frame(n_msi2010, n_msi2011)
vector <- vector()
for (i in 1:dim(a)[2]){
vector <- append(vector, as.vector(a[,i]))
vector
}
You may do
as.matrix(vector) or data.frame(vector)

Remove the rows of data frame whose cells match a given vector

I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops.
for example
ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data.frame")
remove all rows that have cells that contain the values 8,9,10
I guess i could use ff[ !ff[,1] %in% c(8, 9, 10), ] or subset(ff, !ff[,1] %in% c(8,9,10) )
but in order to remove all the values from the dataset i have to parse each column (probably with a for loop, something i wish to avoid).
Is there any other (cleaner) way?
Thanks a lot
apply your test to each row:
keeps <- apply(ff, 1, function(x) !any(x %in% 8:10))
which gives a boolean vector. Then subset with it:
ff[keeps,]
j.1 j.2 j.3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
11 11 12 13
12 12 13 14
13 13 14 15
>
I suppose the apply strategy may turn out to be the most economical but one could also do either of these:
ff[ !rowSums( sapply( ff, function(x) x %in% 8:10) ) , ]
ff[ !Reduce("+", lapply( ff, function(x) x %in% 8:10) ) , ]
Vector addition of logical vectors, (equivalent to any) followed by negation. I suspect the first one would be faster.

Resources