Removing punctuation before tabulating data - r

I'm having issues with puling data from clipboard that happens to have lots of punctuation (quotes, commas, etc) in it. I'm attempting to pull in the entirety of Jane Austen's Pride and Prejudice as a plain text document via copying to clipboard into a variable in R for analysis.
If I do a
book <- read.table("clipboard", sep="\n")
I get an "EOF within quoted string" error. If I put the option to not have strings as factors:
book <- read.table("clipboard", sep="\n", stringsAsFactors=F)
I get the same error. This affects the table by putting multiple paragraphs together where quotations are present. If I open the book in a text editor and remove the double quotes and single quotes, then try either read.table option, the result is perfect.
Is there a way to remove punctuation prior to (or during?) the read.table phase? Would I dump the clipboard data into some kind of big vector then read.table off that vector?

you need to disable quoting
this works for me
book <-read.table("http://www.gutenberg.org/cache/epub/1342/pg1342.txt",
sep="\n",quote="",stringsAsFactors=FALSE)

The read.table function is intended to read in data in a rectangular structure and put it into a data frame. I don't expect that the text of a book would fit that pattern in general. I would suggest reading the data with the scan or readLines function in place of read.table. Read the documentation for those functions on how to deal with quotes and separators.
If you still want to remove punctuation, then look at ?gsub, if you also want to convert all the characters to upper or lower case see ?chartr.

Related

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

Exporting large number to csv from R

I came across a strange problem when trying to export an R dataframe to a csv file.
The dataframe contains some big numbers, but when they are written to the csv file, they "lose" the decimal part and are instead written without it.
But not like one would expect, but like this:
Say 3224571816.5649 is the correct value in R. When written to csv, it becomes 32245718165649.
I am using the write.csv2 function to write the csv. The separators are correct, as it works normally for smaller values. Is the problem occurring because the number (with decimals) is bigger than 32bit?
And more importantly, how can I solve this, as I have a whole dataframe with values as big (or bigger) than this? Also, it has to be written in to a csv.
write.csv2 is intended for a different standard of csv (Western European styling, which based on your use of a "." as a decimal indicator, I am guessing you are not looking for). write.csv2 uses a comma as a decimal indicator and a semicolon as the field delimiter, so if you are trying to read the result in as a comma separated file, it will look strange indeed.
I suggest you use write.csv (or even better, write.table) to output your file. write.csv assumes a comma separator and period for decimal marker.
both write.csv and write.csv2 are just wrappers for write.table, which is the underlying method. In general, I recommend use of write.table because it does not assume your region and you can explicitly pass it sep = ",", dec = ".", etc. This not only lets you know what you are using for sure, but it also makes your code a lot more readable.
for more, check the rdocumentation.org site for write.table: https://www.rdocumentation.org/packages/utils/versions/3.5.3/topics/write.table

Reading a CSV file and to tokenize it.

I am a newbie in R. I have been trying to read a CSV file like this.
tweets <- read.csv("tweets.csv")
and I need to be able to remove all of the punctuations, convert to lower cases, remove numbers & stop words & whitespaces from the data frame 'tweets' without having to convert it into a corpus or something. Nothing fancy just straight removing it. Is there any library/function that could help solve this issue?
Reading part of csv is what you have defined
tweets <- read.csv("tweets.csv")
However, for dealing with punctuations, whitespaces the other approach except using corpus is by using regular expressions but that has limited application as it is not generic at all
That is why we prefer corpus as it can become easier to apply to different sources

Deal with escaped commas in CSV file?

I'm reading in a file in R using fread as such
test.set = fread("file.csv", header=FALSE, fill=TRUE, blank.lines.skip=TRUE)
Where my csv consists of 6 columns. An example of a row in this file is
"2014-07-03 11:25:56","61073a09d113d3d3a2af6474c92e7d1e2f7e2855","Securenet Systems Radio Playlist Update","Your Love","Fred Hammond & Radical for Christ","50fcfb08424fe1e2c653a87a64ee92d7"
However, certain rows are formatted in a particular way when there is a comma inside one of the cells. For instance,
"2014-07-03 11:25:59","37780f2e40f3af8752e0d66d50c9363279c55be6","Spotify","\"Hello\", He Lied","Red Box","b226ff30a0b83006e5e06582fbb0afd3"
produces an error of the sort
Expecting 6 cols, but line 5395818 contains text after processing all
cols. Try again with fill=TRUE. Another reason could be that fread's
logic in distinguishing one or more fields having embedded sep=','
and/or (unescaped) '\n' characters within unbalanced unescaped quotes
has failed. If quote='' doesn't help, please file an issue to figure
out if the logic could be improved.
As you can see, the value that is causing the error is "\"Hello\", He Lied", which I want to be read by fread as "Hello, He Lied". I'm not sure how to account for this, though - I've tried using fill=TRUE and quote="" as suggested, but the error still keeps coming up. It's probably just a matter of finding the right parameter(s) for fread; anyone know what those might be?
In read.table() from base R this issue is solvable.
Using Import data into R with an unknown number of columns?
In fread from data.table this is not possible.
Issue logged for this : https://github.com/Rdatatable/data.table/issues/2669

read.csv in R doesn't import all rows from csv file

I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows.
When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).
I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.
Is there any way to look at logs or something which includes reason why R rejected these records?
The OP indicates that the problem is caused by quotes in the CSV-file.
When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote="" option in read.csv. This disables quotes.
data <- read.csv(filename, quote="")
Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.
lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))
A slightly more safe solution, which will only delete quotes when not just before or after a comma:
lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
I had same issue where difference between number of rows present in csv file and number of rows read by read.csv() command was significant. I used fread() command from data.table package in place of read.csv and it solved the problem.
The records rejected was due to presence of double quotes in the csv file. I removed the double quotes on notepad++ before reading the file in R. If you can suggest a better way to remove the double quotes in R (before reading the file), please leave a comment below.
Pointed out by Jan van der Laan. He deserves the credit.
In your last question you want to remove double quotes (that is "") before reading the csv file in R. This probably is best done as a file preprocessing step using a one line Shell scripting "sed" comment (treated in the Unix & Linux forum).
sed -i 's/""/"/g' test.csv

Resources