Skip comment line in csv file using R - r

I have a csv file which looks like this-
#this is a dataset
#this contains rows and columns
ID value1 value2 value3
AA 5 6 5
BB 8 2 9
CC 3 5 2
I want read the csv file excluding those comment lines. It is possible to read mentioning that when it is '#' skip those line.But here the problem is there is an empty line after comments and also for my different csv file it can be various numbers of comment lines.But the main header will always start with "ID" from where i want to read the csv.
It is possible to specify somehow that when it is ID read from there? if yes then please give an example.
Thanks in advance!!

Use the comment.char option:
read.delim('filename', comment.char = '#')
Empty lines will be skipped automatically by default (blank.lines.skip = TRUE). You can also specify a fixed number of lines to skip via skip = number. However, it’s not possible to specify that it should start reading at a given line starting with 'ID' (but like I’ve said it’s not necessary here).

For those looking for a tidyverse approach, this will make the job, similarly as in #Konrad Rudolph's answer:
readr::read_delim('filename', comment = '#')

If you know in advance the number of line beofre headers, you can use skip option (here 3 lines):
read.table("myfile.csv",skip=3, header=T)

Related

R bad row data not shown when read to data.table, but written to file

Sample input tab-delimited text file, note there is bad data from this source file, the enclosing " at end of line 3 is two lines down. So there is 1 complete blank line, followed by a line with just the double-quote character, then continued good data on the next line.
id ca cb cc cd
1 hi bye hey nope
2 ab cd ef "quoted text here"
3 gh ij kl "quoted text but end quote is 2 lines down
"
4 mn op qr lalalala
when I read this into R, tried using read.csv and fread, with/without 'blank.lines.skip = T' for fread, I get the following data table:
id ca cb cc cd
1 1 hi bye hey nope
2 2 ab cd ef quoted text here
3 3 gh ij kl quoted text but end quote is 2 lines down
4 4 mn op qr lalalala
The data table does not show the 'bad' lines. OK, good! However, when I go to write out this data table, tried both write.table and fwrite, those 2 bad lines of /nothing/, and the double-quote, are written out just like they show in the input file!
I've tried doing:
dt[complete.cases(dt),],
dt[!apply(dt == "", 1, all),]
to clear out empty data before writing out, but it does nothing. The data table still only shows those 4 entries. Where is R keeping this 'missing' data? How can I clear out that bad data?
I hope this is a 'one-off' bad output from the source (good ol' US Govt!), but I think they saved this from an xls file, which had bad formatting in a column, causing the text file to contain this mistake, but they obviously did not check the output.
After sitting back and thinking through the reading functions, because that column (cd) data is quoted, there's actually two newline characters at the end of the string, which is not shown in the data table element! So writing out that element will result in writing those two line breaks.
All I needed to do was:
dt$cd <- gsub("[\r\n","",dt$cd)
and that fixed it, the output written to file now has correct rows of data.
I wish I could remove my question...but maybe someday someone will come across the same "issue". I should have stepped back and thought about it before posting the question.

Read csv but skip escaped commas in strings

I have a csv file like this:
id,name,value
1,peter,5
2,peter\,paul,3
How can I read this file and tell R that "\," does not indicate a new column, only ",".
I have to add that file has 400mb.
Thanks
You can use readLines() to read the file into memory and then pre-process it. If you're willing to convert the non-separate commas into something else, you can do something like:
> read.csv(text = gsub("\\\\,", "-", readLines("dat.csv")))
id name value
1 1 peter 5
2 2 peter-paul 3
Another option is to utilize the fact that the fread function from data.table can perform system commands as its first argument. Then you can do something like a sed operation on the file before reading it in (which may or may not be faster):
> data.table::fread("sed -e 's/\\\\\\,/-/g' dat.csv")
id name value
1: 1 peter 5
2: 2 peter-paul 3
You can always then use gsub() to convert the temporary - separator back into a comma.

Import fixed width data file with no line separator

I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:
20141101 77h 3.210 0 3 20141102 76h 3.090 0 3
The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.
With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.
Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?
Thanks in advance.
Maybe not the best idea but this should work:
content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)
The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.
Another approach is doable with regexes and package stringr (still with content resulting from scan above):
library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)
which gives:
V1 V2 V3 V4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].
Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.
you can then name the columns with colnames(d) <- c('date','time','data_point','rest')
If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to #jaap for the remind of this function) like this:
td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)
Output:
X1 X2 X3 X4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):
txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
function(x) substr(txt, x, x+59)))
library(dplyr)
library(tidyr)
dfx %>%
separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>%
mutate_all(trimws)
which gives:
VA VB VC VD VE
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.
If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.
In addition to the other answers, some general info about dbf files:
Unless this is a one time read of a static file, it would be best to check the file/fields structure first in case that changes over time. See here for the internal structure of a dbf file.
But maybe even more important:
Each record in a dbf file is preceded by one byte for the delete flag. If this is a space, the record is not deleted, if it's an asterisk * the record is marked for deletion (records are not removed from a dbf file until the file is packed), and you probably want to skip those records. The first part of the data could also be overwritten with "DELETED" for example.
So, in your record c(8,4,7,41), the last byte of the rest column (41) is actually the delete flag of the record that follows it - and the last record in the file will only have 40 bytes for that field (but if you're lucky, the file has an EOF marker (0x1a), so maybe you didn't have a problem with the size there).
Thus, your record should actually be: c(1,8,4,7,40), where the 1 is the delete flag, and starting one byte sooner.

Technique for finding bad data in read.csv in R

I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
Why does the quote="" fix the 'bad data'
What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).
Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
#Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!
The count.fields function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.
One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith, it reads fine.
I saved this data to data.txt:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread and read.csv works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources