R: read.csv parse dates [duplicate] - r

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:

You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).

If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.

In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"

A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

Related

How to concatenate rows based on group as quickly as possible

I have a dataframe as follows
ClientVisitGUID LineNum TextCol
1 1 This was a great
1 2 report I did
2 3 was performed today
2 1 Another great report
2 2 for this person
3 2 good stuff
3 1 I really write very
3 3 when I put my
3 4 mind to it
I'd like to concatenate the rows based on the ClientVisitGUID and the line number so i can get the following output
ClientVisitGUID TextCol
1 This was a great report I did
2 Another great report for this person was performed today
3 I really write very good stuff when I put my mind to it
I tried dplyr but it takes a long time and can't deal with thousands of rows which is what I have
resultset2<-resultset %>%
group_by(ClientVisitGUID) %>%
arrange(LineNum) %>%
summarize_all(paste, collapse=",")
Is there a faster way? I'm not really familiar with data.table but is this fast?
A second data.table option, also using stringi for its performance
library(data.table)
library(stringi)
setDT(df)
setkey(df, ClientVisitGUID, LineNum)
df1 <- df[, .(new = stri_c(TextCol, collapse = " ")), by = ClientVisitGUID]
Result
df1
# ClientVisitGUID new
#1: 1 This was a great report I did
#2: 2 Another great report for this person was performed today
#3: 3 I really write very good stuff when I put my mind to it
data (thanks to #ThomasIsCoding)
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
An base R option is using aggregate
result <- aggregate(TextCol~ClientVisitGUID,
df[order(df$ClientVisitGUID,df$LineNum),],
paste0,
collapse = " ")
which gives
> result
ClientVisitGUID TextCol
1 1 This was a great report I did
2 2 Another great report for this person was performed today
3 3 I really write very good stuff when I put my mind to it
Data
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
If you want speed, data.table is indeed a great candidate:
library(data.table)
setDT(resultset)
data.table::setkeyv(resultset, "ClientVisitGUID")
resultset <- resultset[order(ClientVisitGUID, LineNum)]
resultset[, .(lapply(.SD, paste, collapse = ",")), by = "ClientVisitGUID"]
Setting the key takes some times at first but you will end up with faster operations afterwards. Setting the keys reorder rows belonging to the same group in contiguous memory slots
Example
data = data.table("a" = c("aaa","ffff","ttt"), "b" = c(1,1,2))
data[, .(lapply(.SD, paste, collapse = ",")), by = "b"]

remove double quotes from factors in a dataframe

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!
vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

Error in setting up and cleaning a dataframe R

I am attempting to generate out of sample predictions and am getting this message after running the following code Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied.
I checked the str to verify that the two variables I am using are both numeric and they appear to be. I did a bunch of hunting around on here and think this might be somewhat related, but I haven't been able to get the suggestions to work.
Here is the code I have so far.
library(foreign)
library(plyr)
library(rvest)
library(stringi)
library(purrr)
library(XLConnect)
library(splitstackshape)
library(tidyr)
library(dplyr)
donner_raw <- read.csv("donner.txt", sep="\t", header = FALSE)
colnames(donner_raw) <- c("age_gen", "survive")
donner_raw <- separate(donner_raw, age_gen, into = c("age", "gender"), "(?<=\\d)(?=[A-Za-z])")
logit <- glm(survive ~ age + dummygen,family=binomial(link='logit'),data=donner_raw)
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
I'm not sure where in the process of what I've done I messed up. Here is a reproducible part of the data.
dput(droplevels(head(donner_raw)))
structure(list(age = structure(c(6L, 4L, 5L, 3L, 2L, 1L), .Label = c("13", "3", "4", "45", "6", "60"), class = "factor"), gender = c("M", "F", "F", "F", "F", "F"), dummygen = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("age", "gender", "survive", "dummygen"), row.names = c(NA, 6L), class = "data.frame")
Let's simply read and think about the error message:
Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied
This error occurs after the line:
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
(Presumably, at least, because you're question isn't very clear about this and provides lots of code that doesn't seem related.)
So R is telling you that when the model was fit the variable dummygen was numeric, but now you've given it a factor.
So let's look:
str(newlogit)
'data.frame': 20 obs. of 2 variables:
$ age : num 1 1.26 1.53 1.79 2.05 ...
$ dummygen: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
Yep!
So your problem was that you inexplicably created the data frame newlogit by specifying:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
which clearly specifies that the variable dummygen is not going to be numeric. Just convert it back, or remove the quotes in the first place. For example:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen= 0)
or
newlogit$dummygen <- as.numeric(newlogit$dummygen)

How to import a character - date field as POSIXlt class while importing the file itself in R? [duplicate]

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:
You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).
If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.
In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"
A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

Specify custom Date format for colClasses argument in read.table/read.csv

Question:
Is there a way to specify the Date format when using the colClasses argument in read.table/read.csv?
(I realise I can convert after importing, but with many date columns like this, it would be easier to do it in the import step)
Example:
I have a .csv with date columns in the format %d/%m/%Y.
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
This gets the conversion wrong. For example, 15/07/2008 becomes 0015-07-20.
Reproducible code:
data <-
structure(list(func_loc = structure(c(1L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 5L), .Label = c("3076WAG0003", "3076WAG0004", "3076WAG0007",
"3076WAG0009", "3076WAG0010"), class = "factor"), order_type = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L), .Label = c("PM01", "PM02",
"PM03"), class = "factor"), actual_finish = structure(c(4L, 6L,
1L, 2L, 3L, 7L, 1L, 8L, 1L, 5L), .Label = c("", "11/03/2008",
"14/08/2008", "15/07/2008", "17/03/2008", "19/01/2009", "22/09/2008",
"6/09/2007"), class = "factor")), .Names = c("func_loc", "order_type",
"actual_finish"), row.names = c(NA, 10L), class = "data.frame")
write.csv(data,"data.csv", row.names = F)
dataImport <- read.csv("data.csv")
str(dataImport)
dataImport
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
dataImport
And here's what the output looks like:
You can write your own function that accepts a string and converts it to a Date using the format you want, then use the setAs to set it as an as method. Then you can use your function as part of the colClasses.
Try:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
tmp <- c("1, 15/08/2008", "2, 23/05/2010")
con <- textConnection(tmp)
tmp2 <- read.csv(con, colClasses=c('numeric','myDate'), header=FALSE)
str(tmp2)
Then modify if needed to work for your data.
Edit ---
You might want to run setClass('myDate') first to avoid the warning (you can ignore the warning, but it can get annoying if you do this a lot and this is a simple call that gets rid of it).
If there is only 1 date format you want to change, you could use the Defaults package to change the default format within as.Date.character
library(Defaults)
setDefaults('as.Date.character', format = '%d/%M/%Y')
dataImport <- read.csv("data.csv", colClasses = c("factor","factor","Date"))
str(dataImport)
## 'data.frame': 10 obs. of 3 variables:
## $ func_loc : Factor w/ 5 levels "3076WAG0003",..: 1 2 3 3 3 3 3 4 4 5
## $ order_type : Factor w/ 3 levels "PM01","PM02",..: 3 3 1 1 1 1 2 2 3 1
## $ actual_finish: Date, format: "2008-10-15" "2009-10-19" NA "2008-10-11" ...
I think #Greg Snow's answer is far better, as it does not change the default behaviour of an often used function.
In case you need time also:
setClass('yyyymmdd-hhmmss')
setAs("character","yyyymmdd-hhmmss", function(from) as.POSIXct(from, format="%Y%m%d-%H%M%S"))
d <- read.table(colClasses="yyyymmdd-hhmmss", text="20150711-130153")
str(d)
## 'data.frame': 1 obs. of 1 variable:
## $ V1: POSIXct, format: "2015-07-11 13:01:53"
A long time ago, in the meantime the problem has been solved by Hadley Wickham. So nowadays the solution is reduced to a oneliner:
library(readr)
data <- read_csv("data.csv",
col_types = cols(actual_finish = col_datetime(format = "%d/%m/%Y")))
Maybe we want even to get rid of unnecessary stuff:
data <- as.data.frame(data)

Resources