Technique for finding bad data in read.csv in R - r

I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
Why does the quote="" fix the 'bad data'
What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).
Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
#Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!

The count.fields function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.

One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith, it reads fine.
I saved this data to data.txt:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread and read.csv works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

Related

read.xlsx file with one column consisting "numbers as text"

I have excel file that contains numeric variables, but the first column (index column) uses custom formatting: those are numbers that should be presented as text (or similar to text) and having always fixed number of digits where some are zeroes. Here is my example table from excel:
And here is formatting for bad_col1 (rest are numbers or general):
When I try to import my data by using read.xlsx function from either openxlsx or xlsx package it produces something like this:
read.xlsx(file_dir,sheet=1)#for openxlsx
bad_col1 col2 col3
1 5 11 974
2 230 15 719
3 10250 6 944
4 2340 7 401
So as you can see, zeroes are gone. Is there any way to read 1st column as "text" and as other numeric? I can not convert it to text after, because "front zeroes" are gone arleady. I can think of workaround, but it would be more feasible for my project to have them converted while importing.
Thank you in Advance
You can use a vector to filter your desired format, with library readxl:
library(readxl)
filter <- c('text','numeric','numeric')
the_file <- read_xlsx("sample.xlsx", col_types = filter)
Even more, you can skip columns if you use in your filter 'skip' in the desired position, considering that you might have many columns.
Regards
With this https://readxl.tidyverse.org/reference/read_excel.html you can use paramater col_types so that first column is read as character.

R bad row data not shown when read to data.table, but written to file

Sample input tab-delimited text file, note there is bad data from this source file, the enclosing " at end of line 3 is two lines down. So there is 1 complete blank line, followed by a line with just the double-quote character, then continued good data on the next line.
id ca cb cc cd
1 hi bye hey nope
2 ab cd ef "quoted text here"
3 gh ij kl "quoted text but end quote is 2 lines down
"
4 mn op qr lalalala
when I read this into R, tried using read.csv and fread, with/without 'blank.lines.skip = T' for fread, I get the following data table:
id ca cb cc cd
1 1 hi bye hey nope
2 2 ab cd ef quoted text here
3 3 gh ij kl quoted text but end quote is 2 lines down
4 4 mn op qr lalalala
The data table does not show the 'bad' lines. OK, good! However, when I go to write out this data table, tried both write.table and fwrite, those 2 bad lines of /nothing/, and the double-quote, are written out just like they show in the input file!
I've tried doing:
dt[complete.cases(dt),],
dt[!apply(dt == "", 1, all),]
to clear out empty data before writing out, but it does nothing. The data table still only shows those 4 entries. Where is R keeping this 'missing' data? How can I clear out that bad data?
I hope this is a 'one-off' bad output from the source (good ol' US Govt!), but I think they saved this from an xls file, which had bad formatting in a column, causing the text file to contain this mistake, but they obviously did not check the output.
After sitting back and thinking through the reading functions, because that column (cd) data is quoted, there's actually two newline characters at the end of the string, which is not shown in the data table element! So writing out that element will result in writing those two line breaks.
All I needed to do was:
dt$cd <- gsub("[\r\n","",dt$cd)
and that fixed it, the output written to file now has correct rows of data.
I wish I could remove my question...but maybe someday someone will come across the same "issue". I should have stepped back and thought about it before posting the question.

Import fixed width data file with no line separator

I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:
20141101 77h 3.210 0 3 20141102 76h 3.090 0 3
The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.
With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.
Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?
Thanks in advance.
Maybe not the best idea but this should work:
content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)
The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.
Another approach is doable with regexes and package stringr (still with content resulting from scan above):
library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)
which gives:
V1 V2 V3 V4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].
Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.
you can then name the columns with colnames(d) <- c('date','time','data_point','rest')
If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to #jaap for the remind of this function) like this:
td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)
Output:
X1 X2 X3 X4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):
txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
function(x) substr(txt, x, x+59)))
library(dplyr)
library(tidyr)
dfx %>%
separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>%
mutate_all(trimws)
which gives:
VA VB VC VD VE
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.
If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.
In addition to the other answers, some general info about dbf files:
Unless this is a one time read of a static file, it would be best to check the file/fields structure first in case that changes over time. See here for the internal structure of a dbf file.
But maybe even more important:
Each record in a dbf file is preceded by one byte for the delete flag. If this is a space, the record is not deleted, if it's an asterisk * the record is marked for deletion (records are not removed from a dbf file until the file is packed), and you probably want to skip those records. The first part of the data could also be overwritten with "DELETED" for example.
So, in your record c(8,4,7,41), the last byte of the rest column (41) is actually the delete flag of the record that follows it - and the last record in the file will only have 40 bytes for that field (but if you're lucky, the file has an EOF marker (0x1a), so maybe you didn't have a problem with the size there).
Thus, your record should actually be: c(1,8,4,7,40), where the 1 is the delete flag, and starting one byte sooner.

Skip comment line in csv file using R

I have a csv file which looks like this-
#this is a dataset
#this contains rows and columns
ID value1 value2 value3
AA 5 6 5
BB 8 2 9
CC 3 5 2
I want read the csv file excluding those comment lines. It is possible to read mentioning that when it is '#' skip those line.But here the problem is there is an empty line after comments and also for my different csv file it can be various numbers of comment lines.But the main header will always start with "ID" from where i want to read the csv.
It is possible to specify somehow that when it is ID read from there? if yes then please give an example.
Thanks in advance!!
Use the comment.char option:
read.delim('filename', comment.char = '#')
Empty lines will be skipped automatically by default (blank.lines.skip = TRUE). You can also specify a fixed number of lines to skip via skip = number. However, it’s not possible to specify that it should start reading at a given line starting with 'ID' (but like I’ve said it’s not necessary here).
For those looking for a tidyverse approach, this will make the job, similarly as in #Konrad Rudolph's answer:
readr::read_delim('filename', comment = '#')
If you know in advance the number of line beofre headers, you can use skip option (here 3 lines):
read.table("myfile.csv",skip=3, header=T)

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources