Error in setting up and cleaning a dataframe R

Error in setting up and cleaning a dataframe R - r

I am attempting to generate out of sample predictions and am getting this message after running the following code Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied.
I checked the str to verify that the two variables I am using are both numeric and they appear to be. I did a bunch of hunting around on here and think this might be somewhat related, but I haven't been able to get the suggestions to work.
Here is the code I have so far.
library(foreign)
library(plyr)
library(rvest)
library(stringi)
library(purrr)
library(XLConnect)
library(splitstackshape)
library(tidyr)
library(dplyr)
donner_raw <- read.csv("donner.txt", sep="\t", header = FALSE)
colnames(donner_raw) <- c("age_gen", "survive")
donner_raw <- separate(donner_raw, age_gen, into = c("age", "gender"), "(?<=\\d)(?=[A-Za-z])")
logit <- glm(survive ~ age + dummygen,family=binomial(link='logit'),data=donner_raw)
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
I'm not sure where in the process of what I've done I messed up. Here is a reproducible part of the data.
dput(droplevels(head(donner_raw)))
structure(list(age = structure(c(6L, 4L, 5L, 3L, 2L, 1L), .Label = c("13", "3", "4", "45", "6", "60"), class = "factor"), gender = c("M", "F", "F", "F", "F", "F"), dummygen = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("age", "gender", "survive", "dummygen"), row.names = c(NA, 6L), class = "data.frame")

Let's simply read and think about the error message:
Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied
This error occurs after the line:
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
(Presumably, at least, because you're question isn't very clear about this and provides lots of code that doesn't seem related.)
So R is telling you that when the model was fit the variable dummygen was numeric, but now you've given it a factor.
So let's look:
str(newlogit)
'data.frame': 20 obs. of 2 variables:
$ age : num 1 1.26 1.53 1.79 2.05 ...
$ dummygen: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
Yep!
So your problem was that you inexplicably created the data frame newlogit by specifying:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
which clearly specifies that the variable dummygen is not going to be numeric. Just convert it back, or remove the quotes in the first place. For example:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen= 0)
or
newlogit$dummygen <- as.numeric(newlogit$dummygen)

Related

remove double quotes from factors in a dataframe

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!

vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

Converting factors into numeric format with signs in R

Let, I have such a dataframe(df) where each elements are factors:
df
---
+100.5
+120.2
-30.0
+75.0
-600.3
How can I convert df into a numric df using R? I ill be very glad for any help. Thanks a lot.

The conversion from factors to numerical values is sometimes complicated, and I think that it is usually necessary to convert the factors first into characters, and then into numerical values.
This should work:
df_n <- as.data.frame(as.numeric(as.character(df[,1])))
colnames(df_n) <- "df_n"
head(df_n)
# df_n
#1 100.5
#2 120.2
#3 -30.0
#4 75.0
#5 -600.3
class(df_n[,1])
#[1] "numeric"
data
df <- structure(list(df = structure(c(4L, 5L, 2L, 3L, 1L),
.Label = c("-600.3", "-30", "75", "100.5", "120.2"),
class = "factor")), .Names = "df",
row.names = c(NA, -5L), class = "data.frame")
Hope this helps.

How to import a character - date field as POSIXlt class while importing the file itself in R? [duplicate]

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:

You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).

If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.

In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"

A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

R: read.csv parse dates [duplicate]

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:

You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).

If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.

In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"

A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

Specify custom Date format for colClasses argument in read.table/read.csv

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:

You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).

If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.

In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"

A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error in setting up and cleaning a dataframe R - r

Related

remove double quotes from factors in a dataframe

Converting factors into numeric format with signs in R

How to import a character - date field as POSIXlt class while importing the file itself in R? [duplicate]

R: read.csv parse dates [duplicate]

Specify custom Date format for colClasses argument in read.table/read.csv

Categories

Resources