Reading a CSV file and to tokenize it. - r

I am a newbie in R. I have been trying to read a CSV file like this.
tweets <- read.csv("tweets.csv")
and I need to be able to remove all of the punctuations, convert to lower cases, remove numbers & stop words & whitespaces from the data frame 'tweets' without having to convert it into a corpus or something. Nothing fancy just straight removing it. Is there any library/function that could help solve this issue?

Reading part of csv is what you have defined
tweets <- read.csv("tweets.csv")
However, for dealing with punctuations, whitespaces the other approach except using corpus is by using regular expressions but that has limited application as it is not generic at all
That is why we prefer corpus as it can become easier to apply to different sources

Related

Is there a way to count specific words from a corpus of PDFs in R?

How can I count the number of specific words in a corpus of PDFs?
I tried using text_count but I honestly didn't understand what it was returned.
First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools can help with that and converting to text, but I am not sure that it can handle multiple columns.
https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
Here is another post:
Use R to convert PDF files to text files for text mining
As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.
After you have raw text, you can use a package like tm to obtain word counts in a corpus. Let me know if this works or if you have further questions.

Problem with encoding of character strings when loading json files to RStudio under Windows 10

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

Exporting large number to csv from R

I came across a strange problem when trying to export an R dataframe to a csv file.
The dataframe contains some big numbers, but when they are written to the csv file, they "lose" the decimal part and are instead written without it.
But not like one would expect, but like this:
Say 3224571816.5649 is the correct value in R. When written to csv, it becomes 32245718165649.
I am using the write.csv2 function to write the csv. The separators are correct, as it works normally for smaller values. Is the problem occurring because the number (with decimals) is bigger than 32bit?
And more importantly, how can I solve this, as I have a whole dataframe with values as big (or bigger) than this? Also, it has to be written in to a csv.
write.csv2 is intended for a different standard of csv (Western European styling, which based on your use of a "." as a decimal indicator, I am guessing you are not looking for). write.csv2 uses a comma as a decimal indicator and a semicolon as the field delimiter, so if you are trying to read the result in as a comma separated file, it will look strange indeed.
I suggest you use write.csv (or even better, write.table) to output your file. write.csv assumes a comma separator and period for decimal marker.
both write.csv and write.csv2 are just wrappers for write.table, which is the underlying method. In general, I recommend use of write.table because it does not assume your region and you can explicitly pass it sep = ",", dec = ".", etc. This not only lets you know what you are using for sure, but it also makes your code a lot more readable.
for more, check the rdocumentation.org site for write.table: https://www.rdocumentation.org/packages/utils/versions/3.5.3/topics/write.table

Removing punctuation before tabulating data

I'm having issues with puling data from clipboard that happens to have lots of punctuation (quotes, commas, etc) in it. I'm attempting to pull in the entirety of Jane Austen's Pride and Prejudice as a plain text document via copying to clipboard into a variable in R for analysis.
If I do a
book <- read.table("clipboard", sep="\n")
I get an "EOF within quoted string" error. If I put the option to not have strings as factors:
book <- read.table("clipboard", sep="\n", stringsAsFactors=F)
I get the same error. This affects the table by putting multiple paragraphs together where quotations are present. If I open the book in a text editor and remove the double quotes and single quotes, then try either read.table option, the result is perfect.
Is there a way to remove punctuation prior to (or during?) the read.table phase? Would I dump the clipboard data into some kind of big vector then read.table off that vector?
you need to disable quoting
this works for me
book <-read.table("http://www.gutenberg.org/cache/epub/1342/pg1342.txt",
sep="\n",quote="",stringsAsFactors=FALSE)
The read.table function is intended to read in data in a rectangular structure and put it into a data frame. I don't expect that the text of a book would fit that pattern in general. I would suggest reading the data with the scan or readLines function in place of read.table. Read the documentation for those functions on how to deal with quotes and separators.
If you still want to remove punctuation, then look at ?gsub, if you also want to convert all the characters to upper or lower case see ?chartr.

Resources