Filtering multiple Categorical Data [duplicate] - r

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 6 years ago.
I have a data frame df with an ID column eg A,B,etc. I also have a vector containing certain IDs:
L <- c("A", "B", "E")
How can I filter the data frame to get only the IDs present in the vector? Individually, I would use
subset(df, ID == "A")
but how do I filter on a whole vector?

You can use the %in% operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match():
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.

I reckon you need to use 'match'. It matches the values in one vector to the values in another vector, and gives NA where there's no match. So then you subset based on !is.na of the match.
See ?match and you can probably work it out for yourself, in which case you'll learn more than from the exact answer someone will do shortly which will just encourage you to cut n paste :)

Related

Defining what species I want to include in a scatterplot [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 6 years ago.
I have a data frame df with an ID column eg A,B,etc. I also have a vector containing certain IDs:
L <- c("A", "B", "E")
How can I filter the data frame to get only the IDs present in the vector? Individually, I would use
subset(df, ID == "A")
but how do I filter on a whole vector?
You can use the %in% operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match():
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.
I reckon you need to use 'match'. It matches the values in one vector to the values in another vector, and gives NA where there's no match. So then you subset based on !is.na of the match.
See ?match and you can probably work it out for yourself, in which case you'll learn more than from the exact answer someone will do shortly which will just encourage you to cut n paste :)

View all rows where there is a duplicate in one of the columns in R [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 6 years ago.
I have a data frame df with an ID column eg A,B,etc. I also have a vector containing certain IDs:
L <- c("A", "B", "E")
How can I filter the data frame to get only the IDs present in the vector? Individually, I would use
subset(df, ID == "A")
but how do I filter on a whole vector?
You can use the %in% operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match():
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.
I reckon you need to use 'match'. It matches the values in one vector to the values in another vector, and gives NA where there's no match. So then you subset based on !is.na of the match.
See ?match and you can probably work it out for yourself, in which case you'll learn more than from the exact answer someone will do shortly which will just encourage you to cut n paste :)

how to get all rows with max value of a variable [duplicate]

This question already has answers here:
Extracting indices for data frame rows that have MAX value for named field
(3 answers)
Closed 4 years ago.
I have matrix containing two columns and many rows. The first column name is idCombinaison and the second column name is accuarcy. The accuarcy has a float values.
Now I want to get all rows which the value of accuarcy == max value. In some cases (like depicted in the picture), I can have many rows which the value of accuarcy equals to max, so I want to get all these rows!
I tried this:
maxAccuracy <- subset(accuarcyMatrix, accuarcyMatrix['accuarcy'] == max(accuarcyMatrix['accuarcy']))
But this return an empty vector. Any ideas please?
A reproducible data simulating your matrix:
set.seed(123)
x <- matrix(sample(1:9, 30, T), 10, 3)
row.names(x) <- 1:10
colnames(x) <- LETTERS[1:3]
# A B C
# 1 3 9 9
# 2 8 5 7
# 3 4 7 6
# ...
In matrix objects, you need to use a binary way to extract element such as data[a, b]. Take the above data for example, x["C"] will return NA and x[, "C"] will return all elements in column C. Therefore, the following two codes are going to generate different outputs.
subset(x, x["C"] == max(x["C"]))
# A B C (Empty)
subset(x, x[, "C"] == max(x[, "C"]))
# A B C
# 1 3 9 9
# 4 8 6 9
Maybe something like this?
library(dplyr)
accuarcyMatrix %>%
filter_at(vars(accuarcy),
any_vars(.==max(.))
)
Base R solution (although this is very likely a duplicate):
accuarcyMatrix[ which(accuarcyMatrix$accuarcy == max(accuarcyMatrix$accuarcy) , ]
I'm guessing you will want to change "accuarcy" to "accuracy"

filtering data frames columns based on vector in a loop [duplicate]

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)

Filtering a data frame on a vector [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 6 years ago.
I have a data frame df with an ID column eg A,B,etc. I also have a vector containing certain IDs:
L <- c("A", "B", "E")
How can I filter the data frame to get only the IDs present in the vector? Individually, I would use
subset(df, ID == "A")
but how do I filter on a whole vector?
You can use the %in% operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match():
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.
I reckon you need to use 'match'. It matches the values in one vector to the values in another vector, and gives NA where there's no match. So then you subset based on !is.na of the match.
See ?match and you can probably work it out for yourself, in which case you'll learn more than from the exact answer someone will do shortly which will just encourage you to cut n paste :)

Resources