R logical indexing by equality on multiple columns - r

Say I have the following dataframe, df:
A B C
1 4 25 a
2 3 79 b
3 4 25 c
4 6 17 d
5 4 21 e
6 5 25 f
How to I index the lines where elements in certain columns match a vector, e.g. df$A == 4 & df$B == 25?
I would have thought something like: df[df[,c("A", "B")] == c(4, 25),] would have worked, but this doesn't give me the right answer (it returns no lines).
I would like a method that uses a vector of column names to match on, and a vector of values to match.

Try this:
df[colSums(t(df[,c("A", "B")]) == c(4, 25))==2,]
x A B C
1 1 4 25 a
3 3 4 25 c
This works recycling the vector c(4, 25).

You could also use a simple merge, which would have analogies for data.table and dplyr and in sql as well:
merge(df, setNames(list(4,25),c("A","B")))
# A B C
#1 4 25 a
#2 4 25 c

Related

Conditional Subset, Manipulate and Replace

Following on from a previous question here I extracted the following data.frame
DF <- data.frame(A =c("One","Two","Three","Four","Five"),
B=c(1,1,2,2,3),
D=c(10,2,3,-5,5))
subset(DF, B %in% c(1,3))
A B D
1 One 1 10
2 Two 1 2
5 Five 3 5
but now I want to (for example) multiply the numbers by (say) five and replace them in the original data.frame
The following code
subset(DF, B %in% c(1,3))[,2:3] * 5
B D
1 5 50
2 5 10
5 15 25
gives me the numbers I want but how to I get them back to
A B D
1 One 5 50
2 Two 5 10
3 Three 2 3
4 Four 2 -5
5 Five 15 25
The answer is staring me in the face (ie the index numbers ... but how do I get to them)?
You can do
DF[DF$B %in% c(1, 3), 2:3] <- DF[DF$B %in% c(1, 3), 2:3] * 5
DF
# A B D
#1 One 5 50
#2 Two 5 10
#3 Three 2 3
#4 Four 2 -5
#5 Five 15 25

Mean of Multiple Columns in R

I am trying take the mean of a list of columns in R and am running into a issue. Let's say I have:
A B C D
1 2 3 4
5 6 7 8
9 10 11 12
What I am trying to do is take the mean of columns c(A,C) and save it as a value say (E) as well as the mean of columns c(B,D) and have it save as a different value say F. Is that possible?
E F
2 3
6 7
10 11
Check out dplyr:
library(dplyr)
df <- df %>% mutate(E=(A+C)/2, F=(B+D)/2)
df
A B C D E F
1 1 2 3 4 2 3
2 5 6 7 8 6 7
3 9 10 11 12 10 11
We can subset the dataset with columns 1 & 2, another one with 3 & 4, add them together, divide by 2, and change the column names with setNames
setNames((df1[1:2] + df1[3:4])/2, c("E", "F"))
# E F
#1 2 3
#2 6 7
#3 10 11
Or another option is rowMeans by keeping it in a list using the recycling logical vector, loop through the list (using sapply) and get the rowMeans
i1 <- c(TRUE, FALSE)
sapply(list(df1[i1], df1[!i1]), rowMeans)
Or another option is unlist the dataset, convert it to array and use apply to get the mean
apply(array(unlist(df1), c(3, 2, 2)), c(1,2), mean)

R - Match column by other columns in same data frame

In the following dataframe I want to create a new column called D2 that matches the corresponding A, B, or C column. For example, if D == A, I want D2 == A2.
A A2 B B2 C C2 D
1 10 2 90 3 9 1
1 11 2 99 3 15 1
1 42 2 2 3 9 2
1 5 2 54 3 235 2
1 13 2 20 3 10 3
1 6 2 1 3 4 3
This is what I want the new data frame to look like:
A A2 B B2 C C2 D D2
1 10 2 90 3 9 1 10
1 11 2 99 3 15 1 11
1 42 2 2 3 9 2 2
1 5 2 54 3 235 2 54
1 13 2 20 3 10 3 10
1 6 2 1 3 4 3 4
I have succeeded in doing this with ifelse statements using dplyr, but because I am doing this with many columns, it gets tedious after a while. I was wondering if there is a more clever way to accomplish the same task.
library(dplyr)
newdata <- olddata %>% mutate(D2=ifelse(D==A,A2,ifelse(D==B,B2,C2)))
We can do this efficiently with max.col from base R. Subset the 'olddata' with only 'A', 'B', 'C' columns ('d1'), check whether it is equal to 'D' (after replicating the 'D' to match the lengths), use max.col to find the index of the maximum element (i.e TRUE in this case, assuming that there will be a single TRUE value per rows), multiply by 2 as the 'A1', 'B2', 'C2' columns are alternating after 'A', 'B', 'C', cbind with the row sequence to create the row/column index and extract the elements based on that to create the 'D2' column.
d1 <- olddata[c("A", "B", "C")]
olddata$D2 <- olddata[cbind(1:nrow(d1), max.col(d1 == rep(olddata["D"],
ncol(d1)), "first")*2)]
olddata$D2
#[1] 10 11 2 54 10 4
A slightly different approach would be to compare the columns separately in a loop using lapply (should be efficient if the dataset is very big as converting to a big logical matrix can cost the memory) and based on that we subset the corresponding columns of A2, B2, C2 with mapply
i1 <- grep("^[^D]", names(olddata)) #create an index for columns that are not D
i2 <- seq(1, ncol(olddata[i1]), by = 2)#for subsetting A, B, C
i3 <- seq(2, ncol(olddata[i1]), by = 2)# for subsetting A2, B2, C2
olddata$D2 <- c(mapply(`[`, olddata[i3], lapply(olddata[i2], `==`, olddata$D)))
olddata$D2
[1] 10 11 2 54 10 4

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Resources