Conditionally creating a new variable - r

I have the following data.frame. I need to create a sixth variable (SAT_NEWS) as follows: If in three of the four variables ($medwell_.) the respondent has answered "Very well" OR "Somewhat well", the value of the new variable is SAT, otherwise it is NON_SAT.
'data.frame': 41953 obs. of 5 variables:
$ trust_gov : Factor w/ 6 levels "A lot","Somewhat",..: 1 2 2 2 1 2 4 2 2 2 ...
$ medwell_accuracy: Factor w/ 7 levels "Very well","Somewhat well",..: 2 4 2 3 4 2 1 1 1 1 ...
$ medwell_leaders : Factor w/ 7 levels "Very well","Somewhat well",..: 2 3 2 4 4 3 1 2 1 1 ...
$ medwell_unbiased: Factor w/ 7 levels "Very well","Somewhat well",..: 4 4 2 4 3 2 1 2 1 3 ...
$ medwell_coverage: Factor w/ 7 levels "Very well","Somewhat well",..: 2 4 1 3 3 2 1 1 2 3 ...
- attr(*, "variable.labels")= Named chr "ID. Respondent ID" "Survey" "Country" "QSPLIT. Split form A or B" ...
..- attr(*, "names")= chr "ID" "survey" "Country" "qsplit" ...
- attr(*, "codepage")= int 65001
Can you help me?

Unfortunately there is no %in% method for data frames, so some extra work is needed. With base R we may use
nm <- grep("medwell_", names(df))
num <- colSums(apply(df[, nm], 1, `%in%`, c("Very well", "Somewhat well")))
df$new <- ifelse(num == 3, "SAT", "NON_SAT")
while with dplyr we have
df %>%
mutate(
new = ifelse(
select(., contains("medwell_")) %>%
map2_dfr(list(c("Very well", "Somewhat well")), `%in%`) %>%
rowSums() == 3, "SAT", "NON_SAT"
)
)

Related

How to perform as.factor function?

I have multiple data frames (namely Accident, Vehicles and Casualties) which are to be merged in a single data frame as Accidents. How do I find the factors of the combined data frame that is how to find factors of Accidents?
$ accident_severity : char "Serious" "Slight" "Slight" "Slight" ...
$ number_of_vehicles : int 1 1 2 2 1 1 2 2 2 2 ...
$ number_of_casualties : int 1 1 1 1 1 1 1 1 1 1 ...
$ date : char "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005" ...
$ day_of_week : char "Tuesday" "Wednesday" "Thursday" "Thursday" ...
$ time : char "17:42" "17:36" "00:15" "00:15" ...
You can convert columns of choice from character to factor using lapply function. See the code below for columns accident_severity and day_of_week conversion:
df <- data.frame(accident_severity= c("Serious", "Slight", "Slight", "Slight"),
number_of_vehicles = c(1, 1, 2, 2),
number_of_casualties = c(1, 1, 1, 1),
date = c("04/01/2005", "05/01/2005", "06/01/2005", "06/01/2005"),
day_of_week = c("Tuesday", "Wednesday", "Thursday", "Thursday"),
time = c("17:42", "17:36", "00:15", "00:15"),
stringsAsFactors = FALSE)
str(df)
# 'data.frame': 4 obs. of 6 variables:
# $ accident_severity : Factor w/ 2 levels "Serious","Slight": 1 2 2 2
# $ number_of_vehicles : num 1 1 2 2
# $ number_of_casualties: num 1 1 1 1
# $ date : chr "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005"
# $ day_of_week : Factor w/ 3 levels "Thursday","Tuesday",..: 2 3 1 1
# $ time : chr "17:42" "17:36" "00:15" "00:15"
df[c("accident_severity", "day_of_week")] <- lapply(df[c("accident_severity", "day_of_week")], factor)
str(df)
# 'data.frame': 4 obs. of 6 variables:
# $ accident_severity : Factor w/ 2 levels "Serious","Slight": 1 2 2 2
# $ number_of_vehicles : num 1 1 2 2
# $ number_of_casualties: num 1 1 1 1
# $ date : chr "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005"
# $ day_of_week : Factor w/ 3 levels "Thursday","Tuesday",..: 2 3 1 1
# $ time : chr "17:42" "17:36" "00:15" "00:15"
To find if a column names which are factors you can use is.factor function:
names(df)[unlist(lapply(df, is.factor))]
# [1] "accident_severity" "day_of_week"

Subsetting certain dataframes that satisfy a condition from a list

I have just started to work with lists and lapply function and I'm experiencing some difficulty. I have a list of multiple dataframes and would like to subset dataframes that satisfy a specific condition and save it as a separate list. For instance,
l <- list(data.frame(PPID=1:5, gender=c(rep("male", times=5))),
data.frame(PPID=1:5, gender=c("male", "female", "male", "male", "female")),
data.frame(PPID=1:3, gender=c("male", "female", "male")))
print(l)
What I want to do is to subset only the lists that have both gender (male and female) and save that as another list. So my outcome should be another list which contains only second and third data frames in l.
Things that I tried include:
ll <- subset(l, lapply(1:length(l), function(i) {
length(levels(l[[i]]$gender)) == 2
}))
ll <- subset(l, lapply(1:length(l), function(i) {
l[[i]]$gender == "male" | l[[i]]$gender == "female"
}))
But this returned me a list of 0.
Any help would be greatly appreciated!!
If you're willing to switch to purrr, you can simply :
> library(purrr)
> keep(l, ~ length(unique(.x$gender)) > 1)
[[1]]
PPID gender
1 1 male
2 2 female
3 3 male
4 4 male
5 5 female
[[2]]
PPID gender
1 1 male
2 2 female
3 3 male
This works in base R:
lapply(l, function(x) if (length(unique(x$gender)) == 2) x)
#[[1]]
#NULL
#
#[[2]]
# PPID gender
#1 1 male
#2 2 female
#3 3 male
#4 4 male
#5 5 female
#
#[[3]]
# PPID gender
#1 1 male
#2 2 female
#3 3 male
If you don't want to keep the NULL entries, you can do
l2 <- lapply(l, function(x) if (length(unique(x$gender)) == 2) x)
Filter(Negate(is.null), l2);
One of the issues with your code is that while gender is a factor, it doesn't have the same levels in all list elements. You can check:
str(l);
#List of 3
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ PPID : int [1:5] 1 2 3 4 5
# ..$ gender: Factor w/ 1 level "male": 1 1 1 1 1
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ PPID : int [1:5] 1 2 3 4 5
# ..$ gender: Factor w/ 2 levels "female","male": 2 1 2 2 1
# $ :'data.frame': 3 obs. of 2 variables:
# ..$ PPID : int [1:3] 1 2 3
# ..$ gender: Factor w/ 2 levels "female","male": 2 1 2

cannot replace factor variable within a list r

I have the following list of dataframes structure:
str(mylist)
List of 2
$ L1 :'data.frame': 12471 obs. of 3 variables:
...$ colA : Date[1:12471], format: "2006-10-10" "2010-06-21" ...
...$ colB : int [1:12471], 62 42 55 12 78 ...
...$ colC : Factor w/ 3 levels "type1","type2","type3",..: 1 2 3 2 2 ...
I would like to replace type1 or type2 with a new factor type4.
I have tried:
mylist <- lapply(mylist, transform, colC =
replace(colC, colC == 'type1','type4'))
Warning message:
1: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
I do not want to read in my initial data with stringAsFactor=F but i have tried adding type4 as a level in my initial dataset (before splitting into a list of dataframes) using:
levels(mydf$colC) <- c(levels(mydf$colC), "type4")
but I still get the same error when trying to replace.
how do I tell replace that type4 is to be treated as a factor?
You can try to use levels options to renew your factor.
Such as,
status <- factor(status, order=TRUE, levels=c("1", "3", "2",...))
c("1", "3", "2",...) is your type4 in here.
As you state, the crucial thing is to add the new factor level.
## Test data:
mydf <- data.frame(colC = factor(c("type1", "type2", "type3", "type2", "type2")))
mylist <- list(mydf, mydf)
Your data has three factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
Now add the fourth factor level, then your replace command should work:
## Change levels:
for (ii in seq(along = mylist)) levels(mylist[[ii]]$colC) <-
c(levels(mylist[[ii]]$colC), "type4")
## Replace level:
mylist <- lapply(mylist, transform, colC = replace(colC,
colC == 'type1','type4'))
The new data has four factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2

How to put different size vectors in data.table column

I have implemented a simple group-by-operation with the ?stats::aggregate function. It collects elements per group in a vector. I would like to make it faster using the data.table package. However I'm not able to reproduce the wanted behaviour with data.table.
Sample dataset:
df <- data.frame(group = c("a","a","a","b","b","b","b","c","c"), val = c("A","B","C","A","B","C","D","A","B"))
Output to reproduce with data.table:
by_group_aggregate <- aggregate(x = df$val, by = list(df$group), FUN = c)
What I've tried:
data_t <- data.table(df)
# working, but not what I want
by_group_datatable <- data_t[,j = paste(val,collapse=","), by = group]
# no grouping done when using c or as.vector
by_group_datatable <- data_t[,j = c(val), by = group]
by_group_datatable <- data_t[,j = as.vector(val), by = group]
# grouping leads to error when using as.list
by_group_datatable <- data_t[,j = as.list(val), by = group]
Is it possible to have vectors of different size in a data.table column? If yes, how do I achieve it?
Here's one way:
data_t[, list(list(val)), by = group]
# group V1
#1: a A,B,C
#2: b A,B,C,D
#3: c A,B
The first list() is used because you want to aggregate the result. The second list is used because you want to aggregate the val column into separate lists per group.
To check the structure:
str(data_t[, list(list(val)), by = group])
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ V1 :List of 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
# - attr(*, ".internal.selfref")=<externalptr>
Using dplyr, you could do the following:
library(dplyr)
df %>% group_by(group) %>% summarise(val = list(val))
#Source: local data frame [3 x 2]
#
# group val
# (fctr) (chr)
#1 a <S3:factor>
#2 b <S3:factor>
#3 c <S3:factor>
Check the structure:
df %>% group_by(group) %>% summarise(val = list(val)) %>% str
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ val :List of 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
Here is another option with dplyr/tidyr
library(dplyr)
library(tidyr)
res <- df %>%
nest(-group)
str(res)
#'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ data :List of 3
# ..$ :'data.frame': 3 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ :'data.frame': 4 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ :'data.frame': 2 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2

How to drop unused levels after filtering by factor? [duplicate]

This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 8 years ago.
Here is an example that was taken from a fellow SO member.
# define a %not% to be the opposite of %in%
library(dplyr)
# data
f <- c("a","a","a","b","b","c")
s <- c("fall","spring","other", "fall", "other", "other")
v <- c(3,5,1,4,5,2)
(dat0 <- data.frame(f, s, v))
# f s v
#1 a fall 3
#2 a spring 5
#3 a other 1
#4 b fall 4
#5 b other 5
#6 c other 2
(sp.tmp <- filter(dat0, s == "spring"))
# f s v
#1 a spring 5
(str(sp.tmp))
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 3 levels "a","b","c": 1
# $ s: Factor w/ 3 levels "fall","other",..: 3
# $ v: num 5
The df resulting from filter() has retained all the levels from the original df.
What would be the recommended way to drop the unused level(s), i.e. "fall" and "others", within the dplyr framework?
You could do something like:
dat1 <- dat0 %>%
filter(s == "spring") %>%
droplevels()
Then
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ f: Factor w/ 1 level "a": 1
# $ s: Factor w/ 1 level "spring": 1
# $ v: num 5
You could use droplevels
sp.tmp <- droplevels(sp.tmp)
str(sp.tmp)
#'data.frame': 1 obs. of 3 variables:
#$ f: Factor w/ 1 level "a": 1
#$ s: Factor w/ 1 level "spring": 1
# $ v: num 5

Resources