I have multiple data frames (namely Accident, Vehicles and Casualties) which are to be merged in a single data frame as Accidents. How do I find the factors of the combined data frame that is how to find factors of Accidents?
$ accident_severity : char "Serious" "Slight" "Slight" "Slight" ...
$ number_of_vehicles : int 1 1 2 2 1 1 2 2 2 2 ...
$ number_of_casualties : int 1 1 1 1 1 1 1 1 1 1 ...
$ date : char "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005" ...
$ day_of_week : char "Tuesday" "Wednesday" "Thursday" "Thursday" ...
$ time : char "17:42" "17:36" "00:15" "00:15" ...
You can convert columns of choice from character to factor using lapply function. See the code below for columns accident_severity and day_of_week conversion:
df <- data.frame(accident_severity= c("Serious", "Slight", "Slight", "Slight"),
number_of_vehicles = c(1, 1, 2, 2),
number_of_casualties = c(1, 1, 1, 1),
date = c("04/01/2005", "05/01/2005", "06/01/2005", "06/01/2005"),
day_of_week = c("Tuesday", "Wednesday", "Thursday", "Thursday"),
time = c("17:42", "17:36", "00:15", "00:15"),
stringsAsFactors = FALSE)
str(df)
# 'data.frame': 4 obs. of 6 variables:
# $ accident_severity : Factor w/ 2 levels "Serious","Slight": 1 2 2 2
# $ number_of_vehicles : num 1 1 2 2
# $ number_of_casualties: num 1 1 1 1
# $ date : chr "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005"
# $ day_of_week : Factor w/ 3 levels "Thursday","Tuesday",..: 2 3 1 1
# $ time : chr "17:42" "17:36" "00:15" "00:15"
df[c("accident_severity", "day_of_week")] <- lapply(df[c("accident_severity", "day_of_week")], factor)
str(df)
# 'data.frame': 4 obs. of 6 variables:
# $ accident_severity : Factor w/ 2 levels "Serious","Slight": 1 2 2 2
# $ number_of_vehicles : num 1 1 2 2
# $ number_of_casualties: num 1 1 1 1
# $ date : chr "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005"
# $ day_of_week : Factor w/ 3 levels "Thursday","Tuesday",..: 2 3 1 1
# $ time : chr "17:42" "17:36" "00:15" "00:15"
To find if a column names which are factors you can use is.factor function:
names(df)[unlist(lapply(df, is.factor))]
# [1] "accident_severity" "day_of_week"
Related
I have a file name "Second"
and it has data "q1","q2",...."q40", and "q40_n1", "q40_n2","q40_n3", ..."q40_n20"
Some of them are "character" vectors and some are "integer"
My question is How can I change integer vector to "factor" at once?
q30:q35 to "factor" ------- (q(30+n))
q40_n1:q40_n4 to "factor" ---------(q40_n#)
q18:q23 to "factor"
With dplyr package:
mutate_at(Second, vars(q30:q35, q40_n1:q40_n4, q18:q23), factor)
You can control all columns on read-in using the colClasses= argument:
str(read.csv(text="a,b\na,1"))
# 'data.frame': 1 obs. of 2 variables:
# $ a: Factor w/ 1 level "a": 1
# $ b: int 1
str(read.csv(text="a,b\na,1", colClasses="factor"))
# 'data.frame': 1 obs. of 2 variables:
# $ a: Factor w/ 1 level "a": 1
# $ b: Factor w/ 1 level "1": 1
str(read.csv(text="a,b\na,1", colClasses="character"))
# 'data.frame': 1 obs. of 2 variables:
# $ a: chr "a"
# $ b: chr "1"
Or you can factorize it later:
dat <- read.csv(text="a,b\na,11")
str(dat)
# 'data.frame': 1 obs. of 2 variables:
# $ a: Factor w/ 1 level "a": 1
# $ b: int 11
dat$b <- factor(dat$b)
str(dat)
# 'data.frame': 1 obs. of 2 variables:
# $ a: Factor w/ 1 level "a": 1
# $ b: Factor w/ 1 level "11": 1
### or all columns, without regard to original class
dat <- read.csv(text="a,b\na,11")
dat[] <- lapply(dat, factor)
str(dat)
# 'data.frame': 1 obs. of 2 variables:
# $ a: Factor w/ 1 level "a": 1
# $ b: Factor w/ 1 level "11": 1
A numeric or integer vector like:
x <- c(1, 2, 3)
> str(x)
num [1:3] 1 2 3
can be converted to a factor vector:
x <- as.factor(x)
> x
[1] 1 2 3
Levels: 1 2 3
> str(x)
Factor w/ 3 levels "1","2","3": 1 2 3
I would like to loop through columns in a data set and use the name of the column to aggregate the data set. However, I am getting an error when I try to feed through the column name into the aggregate function:
"Error in model.frame.default(formula = cbind(SurveyID) ~ Panel + Category + :
variable lengths differ (found for 'i')"
Once I can store this is a temp file, I will add the temp file to a permanent dataset; however, I can't get past this part. Any help would be so much appreciated!
#example of my data:
df <- data.frame("SurveyID" = c('A','B','C','D'), "Panel" = c('E','E','S','S'), "Category" = c(1,1,2,3), "ENG" = c(3,3,1,2), "PAR"
= c(3,1,1,2), "REL" = c(3,1,1,2), "CLC"= c(3,1,1,2))
#for loop to get column name to include as part of the aggregate function
for (i in colnames(df[4:7])) {
print (i)
temp <- data.frame(setNames(aggregate(cbind(SurveyID) ~ Panel + Category + i, data = df, FUN = length), c("Panel","GENDER", "Favlev", "Cnt")))
}
You are making one newbie mistake and one more sophisticated mistake:
Newb mistake: failing to index successive items upon assignment, i.e., overwriting earlier values with new values.
Not so newb mistake. Improper construction of formula objects. Need as.formula
temp=list() # need empty list with a name
for (i in colnames(df[4:7])) {
print (i); form <- as.formula( paste( "SurveyID ~ Panel + Category +", i) )
temp[[i]] <- data.frame(setNames(aggregate(form, data = df, FUN = length), c("Panel","GENDER", "Favlev", "Cnt")))
}
#Output
[1] "ENG"
[1] "PAR"
[1] "REL"
[1] "CLC"
str(temp)
#----------------
List of 4
$ ENG:'data.frame': 3 obs. of 4 variables:
..$ Panel : Factor w/ 2 levels "E","S": 2 2 1
..$ GENDER: num [1:3] 2 3 1
..$ Favlev: num [1:3] 1 2 3
..$ Cnt : int [1:3] 1 1 2
$ PAR:'data.frame': 4 obs. of 4 variables:
..$ Panel : Factor w/ 2 levels "E","S": 1 2 2 1
..$ GENDER: num [1:4] 1 2 3 1
..$ Favlev: num [1:4] 1 1 2 3
..$ Cnt : int [1:4] 1 1 1 1
$ REL:'data.frame': 4 obs. of 4 variables:
..$ Panel : Factor w/ 2 levels "E","S": 1 2 2 1
..$ GENDER: num [1:4] 1 2 3 1
..$ Favlev: num [1:4] 1 1 2 3
..$ Cnt : int [1:4] 1 1 1 1
$ CLC:'data.frame': 4 obs. of 4 variables:
..$ Panel : Factor w/ 2 levels "E","S": 1 2 2 1
..$ GENDER: num [1:4] 1 2 3 1
..$ Favlev: num [1:4] 1 1 2 3
..$ Cnt : int [1:4] 1 1 1 1
I have the following data.frame. I need to create a sixth variable (SAT_NEWS) as follows: If in three of the four variables ($medwell_.) the respondent has answered "Very well" OR "Somewhat well", the value of the new variable is SAT, otherwise it is NON_SAT.
'data.frame': 41953 obs. of 5 variables:
$ trust_gov : Factor w/ 6 levels "A lot","Somewhat",..: 1 2 2 2 1 2 4 2 2 2 ...
$ medwell_accuracy: Factor w/ 7 levels "Very well","Somewhat well",..: 2 4 2 3 4 2 1 1 1 1 ...
$ medwell_leaders : Factor w/ 7 levels "Very well","Somewhat well",..: 2 3 2 4 4 3 1 2 1 1 ...
$ medwell_unbiased: Factor w/ 7 levels "Very well","Somewhat well",..: 4 4 2 4 3 2 1 2 1 3 ...
$ medwell_coverage: Factor w/ 7 levels "Very well","Somewhat well",..: 2 4 1 3 3 2 1 1 2 3 ...
- attr(*, "variable.labels")= Named chr "ID. Respondent ID" "Survey" "Country" "QSPLIT. Split form A or B" ...
..- attr(*, "names")= chr "ID" "survey" "Country" "qsplit" ...
- attr(*, "codepage")= int 65001
Can you help me?
Unfortunately there is no %in% method for data frames, so some extra work is needed. With base R we may use
nm <- grep("medwell_", names(df))
num <- colSums(apply(df[, nm], 1, `%in%`, c("Very well", "Somewhat well")))
df$new <- ifelse(num == 3, "SAT", "NON_SAT")
while with dplyr we have
df %>%
mutate(
new = ifelse(
select(., contains("medwell_")) %>%
map2_dfr(list(c("Very well", "Somewhat well")), `%in%`) %>%
rowSums() == 3, "SAT", "NON_SAT"
)
)
Here a list of list x generated as follow:
list1 <- list(NULL, as.integer(0))
list2 <- list(NULL, as.integer(1))
list3 <- list(1:5, 0:4)
x <- list(a=list1, b=list2, c=list3)
x has the following structure:
str(x)
List of 3
$ a:List of 2
..$ : NULL
..$ : int 0
$ b:List of 2
..$ : NULL
..$ : int 1
$ c:List of 2
..$ : int [1:5] 1 2 3 4 5
..$ : int [1:5] 0 1 2 3 4
I'm trying to convert it to a coerced dataframe. I first used
xc <- data.frame(lapply(x, as.numeric)
I got the following error
Error in lapply(x, as.numeric) :
(list) object cannot be coerced to type 'double
Actually it only works with as.character as an argument.
My goal is to reach the dataframe with the following structure:
str(xc)
'data.frame': 2 obs. of 3 variables:
$ a: int NA 0 ...
$ b: int NA 1 ...
$ c: int [1:5] 1 2 3 4 5 int [1:5] 0 1 2 3 4
I think the columns of the resulting data frame must be lists (this is the type that can handle multiple vectors and NULL values).
Using dplyr or data.table package is probably the easiest way.
You can then convert it back to base data.frame with as.data.frame:
library(data.table)
xc <- as.data.table(x)
or
library(dplyr)
xc <- as_data_frame(x)
After converting to base data.frame, the result is the same:
as.data.frame(xc)
#> a b c
#> 1 NULL NULL 1, 2, 3, 4, 5
#> 2 0 1 0, 1, 2, 3, 4
The columns are lists:
str(as.data.frame(xc))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a:List of 2
#> ..$ : NULL
#> ..$ : int 0
#> $ b:List of 2
#> ..$ : NULL
#> ..$ : int 1
#> $ c:List of 2
#> ..$ : int 1 2 3 4 5
#> ..$ : int 0 1 2 3 4
I want to convert variables into factors using apply():
a <- data.frame(x1 = rnorm(100),
x2 = sample(c("a","b"), 100, replace = T),
x3 = factor(c(rep("a",50) , rep("b",50))))
a2 <- apply(a, 2,as.factor)
apply(a2, 2,class)
results in:
x1 x2 x3
"character" "character" "character"
I don't understand why this results in character vectors instead of factor vectors.
apply converts your data.frame to a character matrix. Use lapply:
lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
In second command apply converts result to character matrix, using lapply:
a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
But for simple lookout you could use str:
str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
Additional explanation according to comments:
Why does the lapply work while apply doesn't?
The first thing that apply does is to convert an argument to a matrix. So apply(a) is equivalent to apply(as.matrix(a)). As you can see str(as.matrix(a)) gives you:
chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"
There are no more factors, so class return "character" for all columns.
lapply works on columns so gives you what you want (it does something like class(a$column_name) for each column).
You can see in help to apply why apply and as.factor doesn't work :
In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.
Why sapply and as.factor doesn't work you can see in help to sapply:
Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.
You never get matrix of factors or data.frame.
How to convert output to data.frame?
Simple, use as.data.frame as you wrote in comment:
a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
But if you want to replace selected character columns with factor there is a trick:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...
columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...
You could use it to replace all columns using:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...