Flatten list column in data frame with ID column - r

My data frame contains the output of a survey with a select multiple question type. Some cells have multiple values.
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
a b
1 1 1
2 2 1, 2
3 3 1, 2, 3
I would like to flatten out the list to obtain the following output:
df
a b
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
should be easy but somehow I can't find the search terms. thanks.

You can just use unnest from "tidyr":
library(tidyr)
unnest(df, b)
# a b
# 1 1 1
# 2 2 1
# 3 2 2
# 4 3 1
# 5 3 2
# 6 3 3

Using base R, one option is stack after naming the list elements of 'b' column with that of the elements of 'a'. We can use setNames to change the names.
stack(setNames(df$b, df$a))
Or another option would be to use unstack to automatically name the list element of 'b' with 'a' elements and then do the stack to get a data.frame output.
stack(unstack(df, b~a))
Or we can use a convenient function listCol_l from splitstackshape to convert the list to data.frame.
library(splitstackshape)
listCol_l(df, 'b')

Here's one way, with data.table:
require(data.table)
data.table(df)[,as.integer(unlist(b)),by=a]
If b is stored consistently, as.integer can be skipped. You can check with
unique(sapply(df$b,class))
# [1] "numeric" "integer"

Here's another base solution, far less elegant than any other solution posted thus far. Posting for the sake of completeness, though personally I would recommend akrun's base solution.
with(df, cbind(a = rep(a, sapply(b, length)), b = do.call(c, b)))
This constructs the first column as the elements of a, where each is repeated to match the length of the corresponding list item from b. The second column is b "flattened" using do.call() with c().
As Ananda Mahto pointed out in a comment, sapply(b, length) can be replaced with lengths(b) in the most recent version of R (3.2, if I'm not mistaken).

A base R approach might also be to create a new data.frame for each row and rbind it afterwards:
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
df <- lapply(seq_along(df$a), function(x){data.frame(a = df$a[[x]], b = df$b[[x]])})
df <- do.call("rbind", df)
df

Related

Using named vectors instead of positions in square brackets [duplicate]

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)

filtering data frames columns based on vector in a loop [duplicate]

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)

Split column into vectors by group R - independent of column order

Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)

How to recode a set of variables in a dataframe in R

I have a dataframe with different variables containing values from 1 to 5. I want to recode some variables in the way that 5 becomes 1 and vice versa (x=6-x).
I want to define a list of variables, that will be recoded like this in my dataframe.
Here is my approach using lapply. I haven't really understood it yet.
#generate example-dataset
var1<-sample(1:5,100,rep=TRUE)
var2<-sample(1:5,100,rep=TRUE)
var3<-sample(1:5,100,rep=TRUE)
dat<-as.data.frame(cbind(var1,var2,var3))
recode.list<-c("var1","var3")
recode.function<- function(x){
x=6-x
}
lapply(recode.list,recode.function,data=dat)
There's no need for an external function or for a package for this. Just use an anonymous function in lapply, like this:
df[recode.list] <- lapply(df[recode.list], function(x) 6-x)
Using [] lets us replace just those columns directly in the original dataset. This is needed since just using lapply would result in the data as a named list.
As noted in the comments, you can actually even skip lapply:
df[recode.list] <- 6 - df[recode.list]
You can use mapvalues from plyr.
require(plyr)
# if you just want to replace 5 with 1 and vice versa
df[, recode.list] <- sapply(df[, recode.list], mapvalues, c(1, 5), c(5,1))
# if you want to apply to x=6-x to all values (in this case you don't need mapvalues)
df[, recode.list] <- sapply(df[, recode.list], mapvalues, 1:5, 5:1)
Here's an option to do this with dplyr:
recode.function<- function(x){
x <- 6-x
}
recode.list <- c("var1","var3")
require(dplyr)
df %>% mutate_each_(funs(recode.function), recode.list)
# var1 var2 var3
#1 2 2 4
#2 3 3 3
#3 3 5 2
#4 3 3 2
#5 4 3 3
#6 5 4 1
#...

Extracting specific columns from a data frame

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)

Resources