I have just started to work with lists and lapply function and I'm experiencing some difficulty. I have a list of multiple dataframes and would like to subset dataframes that satisfy a specific condition and save it as a separate list. For instance,
l <- list(data.frame(PPID=1:5, gender=c(rep("male", times=5))),
data.frame(PPID=1:5, gender=c("male", "female", "male", "male", "female")),
data.frame(PPID=1:3, gender=c("male", "female", "male")))
print(l)
What I want to do is to subset only the lists that have both gender (male and female) and save that as another list. So my outcome should be another list which contains only second and third data frames in l.
Things that I tried include:
ll <- subset(l, lapply(1:length(l), function(i) {
length(levels(l[[i]]$gender)) == 2
}))
ll <- subset(l, lapply(1:length(l), function(i) {
l[[i]]$gender == "male" | l[[i]]$gender == "female"
}))
But this returned me a list of 0.
Any help would be greatly appreciated!!
If you're willing to switch to purrr, you can simply :
> library(purrr)
> keep(l, ~ length(unique(.x$gender)) > 1)
[[1]]
PPID gender
1 1 male
2 2 female
3 3 male
4 4 male
5 5 female
[[2]]
PPID gender
1 1 male
2 2 female
3 3 male
This works in base R:
lapply(l, function(x) if (length(unique(x$gender)) == 2) x)
#[[1]]
#NULL
#
#[[2]]
# PPID gender
#1 1 male
#2 2 female
#3 3 male
#4 4 male
#5 5 female
#
#[[3]]
# PPID gender
#1 1 male
#2 2 female
#3 3 male
If you don't want to keep the NULL entries, you can do
l2 <- lapply(l, function(x) if (length(unique(x$gender)) == 2) x)
Filter(Negate(is.null), l2);
One of the issues with your code is that while gender is a factor, it doesn't have the same levels in all list elements. You can check:
str(l);
#List of 3
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ PPID : int [1:5] 1 2 3 4 5
# ..$ gender: Factor w/ 1 level "male": 1 1 1 1 1
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ PPID : int [1:5] 1 2 3 4 5
# ..$ gender: Factor w/ 2 levels "female","male": 2 1 2 2 1
# $ :'data.frame': 3 obs. of 2 variables:
# ..$ PPID : int [1:3] 1 2 3
# ..$ gender: Factor w/ 2 levels "female","male": 2 1 2
Related
I have the following list of dataframes structure:
str(mylist)
List of 2
$ L1 :'data.frame': 12471 obs. of 3 variables:
...$ colA : Date[1:12471], format: "2006-10-10" "2010-06-21" ...
...$ colB : int [1:12471], 62 42 55 12 78 ...
...$ colC : Factor w/ 3 levels "type1","type2","type3",..: 1 2 3 2 2 ...
I would like to replace type1 or type2 with a new factor type4.
I have tried:
mylist <- lapply(mylist, transform, colC =
replace(colC, colC == 'type1','type4'))
Warning message:
1: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
I do not want to read in my initial data with stringAsFactor=F but i have tried adding type4 as a level in my initial dataset (before splitting into a list of dataframes) using:
levels(mydf$colC) <- c(levels(mydf$colC), "type4")
but I still get the same error when trying to replace.
how do I tell replace that type4 is to be treated as a factor?
You can try to use levels options to renew your factor.
Such as,
status <- factor(status, order=TRUE, levels=c("1", "3", "2",...))
c("1", "3", "2",...) is your type4 in here.
As you state, the crucial thing is to add the new factor level.
## Test data:
mydf <- data.frame(colC = factor(c("type1", "type2", "type3", "type2", "type2")))
mylist <- list(mydf, mydf)
Your data has three factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
Now add the fourth factor level, then your replace command should work:
## Change levels:
for (ii in seq(along = mylist)) levels(mylist[[ii]]$colC) <-
c(levels(mylist[[ii]]$colC), "type4")
## Replace level:
mylist <- lapply(mylist, transform, colC = replace(colC,
colC == 'type1','type4'))
The new data has four factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2
I have a list of dataframes, each one with several columns. An example of my data could be:
Ind_ID<-rep(1:15)
Mun<-sample(15)
T_i<-paste0("D",rep(1:5))
data<-cbind(Ind_ID,Mun,T_i)
data<-data.frame(data)
mylist<-split(data,data$T_i)
str(mylist)
List of 5
$ D1:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 1 12 3
..$ Mun : Factor w/ 15 levels "1","10","11",..: 3 10 7
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 1 1 1
$ D2:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 8 13 4
..$ Mun : Factor w/ 15 levels "1","10","11",..: 14 11 5
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 2 2 2
...
$ D5:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 11 2 7
..$ Mun : Factor w/ 15 levels "1","10","11",..: 4 12 2
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 5 5 5
I want to add a new column with the same name as the data frame. My expected output is:
$D1
Ind_ID Mun T_i D1
1 1 11 D1 NA
6 6 4 D1 NA
11 11 15 D1 NA
$D2
Ind_ID Mun T_i D2
2 2 8 D2 NA
7 7 5 D2 NA
12 12 13 D2 NA
....
$D5
Ind_ID Mun T_i D5
5 5 12 D5 NA
10 10 6 D5 NA
15 15 10 D5 NA
My failed attempts include:
nam<-as.list(names(mylist))
fun01 <- function(x,y){cbind(x, y = rep(1, nrow(x)))}
a1<-lapply(mylist, fun01,nam)
str(a1) # This generates a new column with the name "y" in all cases
fun02 <- function(x,y){x= cbind(x, a = rep(1, nrow(x)));names(x)[4] <- y}
a2<-lapply(mylist, fun02,nam)
str(a2) # It changes the data frames
Any help? Thanks in advance
You can loop through all the dataframes with a lapply call and create your new column with something like this:
newlist = lapply(1:length(mylist), function(i){
# Get the dataframe and the name
tmp_df = mylist[[i]]
tmp_name = names(mylist)[i]
# Create a new column with all NAs
tmp_df[,ncol(tmp_df) + 1] = NA
# Rename the newly created column
colnames(tmp_df)[ncol(tmp_df)] = tmp_name
# Return the df
return(tmp_df)
})
Option 1: You could use Map(). First we can write a little function for the iteration.
f <- function(df, nm) cbind(df, setNames(data.frame(NA), nm))
Map(f, mylist, names(mylist))
Option 2: You could live dangerously and do
Map("[<-", mylist, names(mylist), value = NA)
This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 8 years ago.
Here is an example that was taken from a fellow SO member.
# define a %not% to be the opposite of %in%
library(dplyr)
# data
f <- c("a","a","a","b","b","c")
s <- c("fall","spring","other", "fall", "other", "other")
v <- c(3,5,1,4,5,2)
(dat0 <- data.frame(f, s, v))
# f s v
#1 a fall 3
#2 a spring 5
#3 a other 1
#4 b fall 4
#5 b other 5
#6 c other 2
(sp.tmp <- filter(dat0, s == "spring"))
# f s v
#1 a spring 5
(str(sp.tmp))
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 3 levels "a","b","c": 1
# $ s: Factor w/ 3 levels "fall","other",..: 3
# $ v: num 5
The df resulting from filter() has retained all the levels from the original df.
What would be the recommended way to drop the unused level(s), i.e. "fall" and "others", within the dplyr framework?
You could do something like:
dat1 <- dat0 %>%
filter(s == "spring") %>%
droplevels()
Then
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 1 level "a": 1
# $ s: Factor w/ 1 level "spring": 1
# $ v: num 5
You could use droplevels
sp.tmp <- droplevels(sp.tmp)
str(sp.tmp)
#'data.frame': 1 obs. of 3 variables:
#$ f: Factor w/ 1 level "a": 1
#$ s: Factor w/ 1 level "spring": 1
# $ v: num 5
I am wondering how I can convert a list with both numeric and string variables to a dataframe:
For example:
aa<-c("a","b","b","b","d")
bb<-c("Yes","No","No","Yes","Yes")
cc<-c(1,2,4,4,3)
x<-list(aa=aa,bb=bb,cc=cc)
How can I convert x to a dataframe, such that when I call x I get:
aa bb cc
1 a Yes 1
2 b No 2
3 b No 4
4 b Yes 4
5 d Yes 3
Thanks!
With large data in a list, it's recommended to change the data "in place".
For reference, see Simon Urbanek's answer here: Quickly reading very large tables as dataframes in R
attr(x, "row.names") <- .set_row_names(unique(lengths(x)))
class(x) <- "data.frame"
x
# aa bb cc
# 1 a Yes 1
# 2 b No 2
# 3 b No 4
# 4 b Yes 4
# 5 d Yes 3
This has two major advantages over as.data.frame One is that it avoids copying the object x. The other is that it keeps the column classes in line with the list classes (see the following). With as.data.frame, character classes would be converted to factors.
sapply(x, class)
# aa bb cc
# "character" "character" "numeric"
A data.table would also have a class of data.frame, so you can make use of the efficient setDT function:
x <- list(aa = aa, bb = bb, cc = cc)
library(data.table)
setDT(x)
is.data.frame(x)
# [1] TRUE
str(x)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# $ aa: chr "a" "b" "b" "b" ...
# $ bb: chr "Yes" "No" "No" "Yes" ...
# $ cc: num 1 2 4 4 3
# - attr(*, ".internal.selfref")=<externalptr>
You can do x<-as.data.frame(x).
Updated based on #Frank's good below, if you want to avoid converting characters to factors do
x<-as.data.frame(x, StringsAsFactors = FALSE).
> x = data.frame(aa,bb,cc)
> x
aa bb cc
1 a Yes 1
2 b No 2
3 b No 4
4 b Yes 4
5 d Yes 3
>
> str(x)
'data.frame': 5 obs. of 3 variables:
$ aa: Factor w/ 3 levels "a","b","d": 1 2 2 2 3
$ bb: Factor w/ 2 levels "No","Yes": 2 1 1 2 2
$ cc: num 1 2 4 4 3
>
or:
> x = data.frame(aa,bb,cc, stringsAsFactors=F)
> x
aa bb cc
1 a Yes 1
2 b No 2
3 b No 4
4 b Yes 4
5 d Yes 3
> str(x)
'data.frame': 5 obs. of 3 variables:
$ aa: chr "a" "b" "b" "b" ...
$ bb: chr "Yes" "No" "No" "Yes" ...
$ cc: num 1 2 4 4 3
>
Users,
I have this data frame:
A<- c(10,2,4,5,3,5,98,65,36,65,6,100,70,54,25,23,22,30,15,23)
B<- c(1,0.1,0.5,0.8,0.2,0.9,3,1.2,5.6,3.5,15.9,10.2,5,5.1,7.1,5,6,10,4,8)
C<- c("a","a","a","a","a","a","b","b","b","b","c","c","c","c","d","d","d","d","d","d")
mydf<- data.frame(A,B,C)
and I did a subset keeping only the level "a".
subset<- subset(mydf, mydf$C=="a")
But when I make a plot (please see the image) the graph shows also the deleted levels.
plot(B~ C, data=subset)
How can I plot the subsetted data frame avoiding deleted levels?
Thank you!
Use droplevels:
subset$C <- droplevels(subset$C)
plot(B~ C, data=subset)
By the way, subset is not a good name for a data.frame.
str(subset)
#'data.frame': 6 obs. of 3 variables:
# $ A: num 10 2 4 5 3 5
# $ B: num 1 0.1 0.5 0.8 0.2 0.9
# $ C: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 1 1
Remove the missing factor levels by means of factor:
subset$C <- factor(subset$C)
str(subset)
#'data.frame': 6 obs. of 3 variables:
#$ A: num 10 2 4 5 3 5
#$ B: num 1 0.1 0.5 0.8 0.2 0.9
#$ C: Factor w/ 1 level "a": 1 1 1 1 1 1
Just do:
plot(B~ droplevels(C), data=subset)