I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)
Related
I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)
Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)
My data frame contains the output of a survey with a select multiple question type. Some cells have multiple values.
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
a b
1 1 1
2 2 1, 2
3 3 1, 2, 3
I would like to flatten out the list to obtain the following output:
df
a b
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
should be easy but somehow I can't find the search terms. thanks.
You can just use unnest from "tidyr":
library(tidyr)
unnest(df, b)
# a b
# 1 1 1
# 2 2 1
# 3 2 2
# 4 3 1
# 5 3 2
# 6 3 3
Using base R, one option is stack after naming the list elements of 'b' column with that of the elements of 'a'. We can use setNames to change the names.
stack(setNames(df$b, df$a))
Or another option would be to use unstack to automatically name the list element of 'b' with 'a' elements and then do the stack to get a data.frame output.
stack(unstack(df, b~a))
Or we can use a convenient function listCol_l from splitstackshape to convert the list to data.frame.
library(splitstackshape)
listCol_l(df, 'b')
Here's one way, with data.table:
require(data.table)
data.table(df)[,as.integer(unlist(b)),by=a]
If b is stored consistently, as.integer can be skipped. You can check with
unique(sapply(df$b,class))
# [1] "numeric" "integer"
Here's another base solution, far less elegant than any other solution posted thus far. Posting for the sake of completeness, though personally I would recommend akrun's base solution.
with(df, cbind(a = rep(a, sapply(b, length)), b = do.call(c, b)))
This constructs the first column as the elements of a, where each is repeated to match the length of the corresponding list item from b. The second column is b "flattened" using do.call() with c().
As Ananda Mahto pointed out in a comment, sapply(b, length) can be replaced with lengths(b) in the most recent version of R (3.2, if I'm not mistaken).
A base R approach might also be to create a new data.frame for each row and rbind it afterwards:
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
df <- lapply(seq_along(df$a), function(x){data.frame(a = df$a[[x]], b = df$b[[x]])})
df <- do.call("rbind", df)
df
I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)
This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 6 years ago.
I have a data frame df with an ID column eg A,B,etc. I also have a vector containing certain IDs:
L <- c("A", "B", "E")
How can I filter the data frame to get only the IDs present in the vector? Individually, I would use
subset(df, ID == "A")
but how do I filter on a whole vector?
You can use the %in% operator:
> df <- data.frame(id=c(LETTERS, LETTERS), x=1:52)
> L <- c("A","B","E")
> subset(df, id %in% L)
id x
1 A 1
2 B 2
5 E 5
27 A 27
28 B 28
31 E 31
If your IDs are unique, you can use match():
> df <- data.frame(id=c(LETTERS), x=1:26)
> df[match(L, df$id), ]
id x
1 A 1
2 B 2
5 E 5
or make them the rownames of your dataframe and extract by row:
> rownames(df) <- df$id
> df[L, ]
id x
A A 1
B B 2
E E 5
Finally, for more advanced users, and if speed is a concern, I'd recommend looking into the data.table package.
I reckon you need to use 'match'. It matches the values in one vector to the values in another vector, and gives NA where there's no match. So then you subset based on !is.na of the match.
See ?match and you can probably work it out for yourself, in which case you'll learn more than from the exact answer someone will do shortly which will just encourage you to cut n paste :)