Say I have a dataframe:
df <- data.frame(rbind(c(10,1,5,4), c(6,0,3,10), c(7,1,10,10)))
colnames(df) <- c("a", "b", "c", "d")
df
a b c d
10 1 5 4
6 0 3 10
7 1 10 10
And a vector of numbers (which correspond to the four column names a,b,c,d)
threshold <- c(7,1,5,8)
I need to compare each row in the data frame to the vector. When the value in the data frame meets or exceeds that in the vector, I need to return the column name. The output would be:
a b c d cols
10 1 5 4 a,b,c #10>7, 1>=1, 5>=5
6 0 3 10 d #10>8
7 1 10 10 a,b,c,d ##7>=7, 1>=1, 10>=5, 10>-8
The column cols can be a string that simply lists the columns where the value is exceeded.
Is there any clever way to do this? I'm migrating an old Excel function and I can write a loop or something, but I thought there almost had to be a better way.
You do not need which and the desired output is for comma separated values:
df$cols <- apply(df[-1], 1, function(x) toString(names(df)[-1][x >= threshold]))
df
id a b c d cols
1 aa 10 1 5 4 a, b, c
2 bb 6 0 3 10 d
3 cc 7 1 10 10 a, b, c, d
We can also try
i1 <- which(df >=threshold[col(df)], arr.ind=TRUE)
df$cols <- unname(tapply(names(df)[i1[,2]], i1[,1], toString))
df$cols
#[1] "a, b, c" "d" "a, b, c, d"
You can try this:
df$cols <- apply(df[, 2:5], 1, function(x) names(df[, 2:5])[which(x >= threshold)])
Related
I am trying to use a vector of logical expressions to subset a data frame. I have a data frame I want to subset based on several columns where I want to exclude "B" each time. First I want do define a vector for logical expressions based on data frame column names.
set.seed(42)
n <- 24
dataframe <- data.frame(column1=as.character(factor(paste("obs",1:n))),
rand1=rep(LETTERS[1:4], n/4),
rand2=rep(LETTERS[1:6], n/6),
rand3=rep(LETTERS[1:3], n/3),
x=rnorm(n))
columns <- colnames(dataframe)[2:4]
criteria <- quote(rep(paste0(columns[1:3], " != ", quote("B")), length(columns)))
What I want to achieve is a vector criteria containing
rand1 != "B" rand2 != "B" rand3 != "B" so I can use it to subset data frame based on columns like
dfs1 <- subset(dataframe, criteria[1])
dfs2 <- subset(dataframe, criteria[2])
dfs3 <- subset(dataframe, criteria[3])
I might be misunderstanding your question, but it seems like you want a collection of data.frames where each one excludes rows where a given column = 'B'.
Assuming this is what you want:
cols <- c('rand1', 'rand2', 'rand3')
result <- lapply(dataframe[, cols], function(x) dataframe[x!='B',])
will create a list of data.frames, each of which has the result of excluding rows where the indicated column == 'B'.
Based on Using tidy eval for multiple, arbitrary filter conditions
filter_fun <- function(df, cols, conds){
fp <- map2(cols, conds, function(x, y) quo((!!(as.name(x))) != !!y))
filter(df, !!!fp)
}
filter_col <- columns[1:3] %>% as.list()
cond_list <- rep(list("B"), length(columns[1:3]))
filter_fun(dataframe, cols = filter_col,
conds = cond_list)
column1 rand1 rand2 rand3 x
1 obs 1 A A A 1.3709584
2 obs 3 C C C 0.3631284
3 obs 4 D D A 0.6328626
4 obs 7 C A A 1.5115220
5 obs 9 A C C 2.0184237
6 obs 12 D F C 2.2866454
7 obs 13 A A A -1.3888607
8 obs 15 C C C -0.1333213
9 obs 16 D D A 0.6359504
10 obs 19 C A A -2.4404669
11 obs 21 A C C -0.3066386
12 obs 24 D F C 1.2146747
If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5
This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 5 years ago.
This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B
If I now remove duplicates, I get the following data frame:
df[duplicated(df),]
a b
3 A B
6 B A
8 C B
However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?
Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.
Thanks!
Extending Ari's answer, to specify columns to check if other columns are also there:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
a b c d
1 A A 10 B
2 A B 8 S
3 A B 7 J
4 B C 3 Q
5 B A 2 I
6 B A 6 U
7 C B 4 L
8 C B 5 V
cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
newdf[i, ] = sort(df[i,cols])
}
df[!duplicated(newdf),]
a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
One solution is to first sort each row of df:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.
The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.
Stealing from the other linked answer using data.table, you could try something like:
df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]
A comparison benchmark with a larger dataset (df2):
df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]
system.time(
df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user system elapsed
# 0.07 0.00 0.06
system.time({
for (i in 1:nrow(df2))
{
df2[i, ] = sort(df2[i, ])
}
df2[!duplicated(df2),]
}
)
# user system elapsed
# 42.07 0.02 42.09
Using apply will be a better option than loops.
newDf <- data.frame(t(apply(df,1,sort)))
All you need to do now is remove duplicates.
newDf <- newDf[!duplicated(newDf),]
Ok, I have a matrix of values with certain identifiers, such as:
A 2
B 3
C 4
D 5
E 6
F 7
G 8
I would like to pull out a subset of these values (using R) based on a list of the identifiers ("B", "D", "E") for example, so I would get the following output:
B 3
D 5
E 6
I'm sure there's an easy way to do this (some sort of apply?) but I can't seem to figure it out. Any ideas? Thanks!
If the letters are the row names, then you can just use this:
m <- matrix(2:8, dimnames = list(LETTERS[1:7], NULL))
m[c("B","D","E"),]
# B D E
# 3 5 6
Note that there is a subtle but very important difference between: m[c("B","D","E"),] and m[rownames(m) %in% c("B","D","E"),]. Both return the same rows, but not necessarily in the same order.
The former uses the character vector c("B","D","E") as in index into m. As a result, the rows will be returned in the order of character vector. For instance:
# result depends on order in c(...)
m[c("B","D","E"),]
# B D E
# 3 5 6
m[c("E","D","B"),]
# E D B
# 6 5 3
The second method, using %in%, creates a logical vector with length = nrow(m). For each element, that element is T if the row name is present in c("B","D","E"), and F otherwise. Indexing with a logical vector returns rows in the original order:
# result does NOT depend on order in c(...)
m[rownames(m) %in% c("B","D","E"),]
# B D E
# 3 5 6
m[rownames(m) %in% c("E","D","B"),]
# B D E
# 3 5 6
This is probably more than you wanted to know...
Your matrix:
> m <- matrix(2:8, dimnames = list(LETTERS[1:7]))
You can use %in% to filter out the desired rows. If the original matrix only has a single column, using drop = FALSE will keep the matrix structure. Otherwise it will be converted to a named vector.
> m[rownames(m) %in% c("B", "D", "E"), , drop = FALSE]
# [,1]
# B 3
# D 5
# E 6
I have two data.frames:
DF1
Col1 Col2 ...... ...... Col2000
A H
c d
d e
n b
e A
b n
H c
DF2
A
b
c
d
e
n
H
I need simply to match the only one column in DF2 with each column in DF1. I need to match them because I need to know exactly the ranking of the match. Anyway I tried to write a function but since I'm not an R expert something goes wrong in my code:
lapply(DF1, function(x) match(DF1[,i], DF2[,1]))
To get a correct result, you need a correct command :
lapply(DF1, function(x) match(x, DF2[,1]))
is doing what you're trying to do. Take :
DF1 <- data.frame(
Col1 = c('A','c','d','n','e','b','H'),
Col2 = c('H','d','e','b','A','n','c')
)
DF2 <- data.frame(c('A','b','c','d','e','n','H'))
Then:
> lapply(DF1, function(x) match(x, DF2[,1]))
$Col1
[1] 1 3 4 6 5 2 7
$Col2
[1] 7 4 5 2 1 6 3