How to handle blank items when converting dates in R - r

I have a csv download of data from a Management Information system. There are some variables which are dates and are written in the csv as strings of the format "2012/11/16 00:00:00".
After reading in the csv file, I convert the date variables into a date using the function as.Date(). This works fine for all variables that do not contain any blank items.
For those which do contain blank items I get the following error message:
"character string is not in a standard unambiguous format"
How can I get R to replace blank items with something like "0000/00/00 00:00:00" so that the as.Date() function does not break? Are there other approaches you might recommend?

If they're strings, does something as simple as
mystr <- c("2012/11/16 00:00:00"," ","")
mystr[grepl("^ *$",mystr)] <- NA
as.Date(mystr)
work? (The regular expression "^ *$" looks for strings consisting of the start of the string (^), zero or more spaces (*), followed by the end of the string ($). More generally I think you could use "^[[:space:]]*$" to capture other kinds of whitespace (tabs etc.)

Even better, have the NAs correctly inserted when you read in the CSV:
read.csv(..., na.strings='')
or to specify a vector of all the values which should be read as NA...
read.csv(..., na.strings=c('',' ',' '))

Related

read.csv Fails Due to Rows With A Trailing Comma

I am reading from an API into a CSV file.
I then use R to perform calculations on that data. I am using read.csv to read the data into R.
In a few cases, the last column of a row has a blank value so the row ends in a comma.
This causes read.csv to fail.
Short of writing a script to fix the file, is there any way to read the CSV with a row or rows ending with a trailing comma?
I see what I did wrong. Some of my CSV fields are enclosed in double quotes, however I failed to define a quote character in my read.csv statement.
Here is my corrected statement:
MyData <<- read.csv(file=“myfile.csv”, header=TRUE, stringsAsFactors=FALSE, sep=“,”, quote=“\””)
Note that the quote parameter is escaped with a backslash.
Thanks to all.

How to convert special characters into unicode in R?

When doing some textual data cleaning in R, I can found some special characters. In order to get rid of them, I have to know their unicodes, for example € is \u20AC. I would like to know if it is possible "see" the unicodes with a function that take into account the string within the special character as an input?
Refering to Cath comment, iconv can do the job :
iconv("é", toRaw = TRUE)
Then, you may want to unlist and paste with \u00.
special_char <- "%"
Unicode::as.u_char(utf8ToInt(special_char))

Subsetting different length strings by spaces in R

In R, I currently have a long vector of dates and times saved as a string. So depending on the given date, the string can be 16 or 17 or 18 characters long and so I cannot just subset the first the 8 or 10 characters in the string, since that would not work for every date. But since there is a space between the date and time values, I am wondering how can I subset this string so that I only get the characters before the space?
Just to show how the string looks like now, here are a couple of examples:
"4/18/1950 0:00:00"
"6/8/1951 0:00:00"
"11/15/1951 0:00:00"
I'm not sure if you are familiar with regular expressions, if not you should learn as they are extremely useful:
tutorial
As akrun pointed out you can use the "sub" command to remove the space and everything after it like this:
sub(" .*","",stringVar)
First argument is the regular expression code which matches the space and everything that follows.
Second argument is what you want to replace the match with, in this case nothing
Third argument is the input string
Alternatively, you can just split the string at the space and select the first half using "strsplit"
strsplit(stringVar," ")[1]

What does the "More Columns than Column Names" error mean?

I'm trying to read in a .csv file from the IRS and it doesn't appear to be formatted in any weird way.
I'm using the read.table() function, which I have used several times in the past but it isn't working this time; instead, I get this error:
data_0910<-read.table("/Users/blahblahblah/countyinflow0910.csv",header=T,stringsAsFactors=FALSE,colClasses="character")
Error in read.table("/Users/blahblahblah/countyinflow0910.csv", :
more columns than column names
Why is it doing this?
For reference, the .csv files can be found at:
http://www.irs.gov/uac/SOI-Tax-Stats-County-to-County-Migration-Data-Files
(The ones I need are under the county to county migration .csv section - either inflow or outflow.)
It uses commas as separators. So you can either set sep="," or just use read.csv:
x <- read.csv(file="http://www.irs.gov/file_source/pub/irs-soi/countyinflow1011.csv")
dim(x)
## [1] 113593 9
The error is caused by spaces in some of the values, and unmatched quotes. There are no spaces in the header, so read.table thinks that there is one column. Then it thinks it sees multiple columns in some of the rows. For example, the first two lines (header and first row):
State_Code_Dest,County_Code_Dest,State_Code_Origin,County_Code_Origin,State_Abbrv,County_Name,Return_Num,Exmpt_Num,Aggr_AGI
00,000,96,000,US,Total Mig - US & For,6973489,12948316,303495582
And unmatched quotes, for example on line 1336 (row 1335) which will confuse read.table with the default quote argument (but not read.csv):
01,089,24,033,MD,Prince George's County,13,30,1040
you have have strange characters in your heading # % -- or ,
For the Germans:
you have to change your decimal commas into a Full stop in your csv-file (in Excel:File -> Options -> Advanced -> "Decimal seperator") , then the error is solved.
Depending on the data (e.g. tsv extension) it may use tab as separators, so you may try sep = '\t' with read.csv.
This error can get thrown if your data frame has sf geometry columns.

Replacing all occurrences of a pattern in a string

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Resources