Trying to read tweets in R stored in a csv file - r

I have some tweets stored in a csv file on my local computer.There are 1248 rows. Now when I try to read these tweets in R using the read.csv function I get 1816 rows. This is happening because there are some tweets which have commas in them so basically what read.csv does is it splits one tweet into multiple tweets based on the number of commas and hence more number of rows. So what separator should I define to read the file correctly?
Thanks

Use read.table or read.delim instead of read.csv and use the quote parameter. There's a thread on this that will provide all the details
[read.table with comma separated values and also commas inside each element.

convert csv file to xlsx and use the following code:
library(readxl)
dataset <- read_excel('C:/Study/..._Sample1.xlsx')

Related

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

Reading in file gives empty rows and columns

Given this CSV file:
How to read a file so that the extra commas that are not a part of data are excluded?
Seems that the file is ok. Have you tried the correct options for arguments in your importing function?
Would you like to try read_delim() from the readr package?

Does what='char' works for read.csv function?

The basic format for scan function in R to read a file with characters is represented like this
a<- scan(file.choose(),what='char',sep=',').
I have a csv file with names as a separate column. Can i use what='char' in read.csv. If yes, how to use. If not how to read names column?
There is an entire R manual on importing and exporting data
https://cran.r-project.org/doc/manuals/r-release/R-data.html
read.table (or more specifically read.csv, which is read.table with the default separator being a comma) are the functions you are looking for.
a <- read.csv(yourfile)

how do i read a .dxl file in R using read.csv?

I tried opening the file in excel and it is being displayed in proper format. Now how do i read it in R? I tried using read.csv function. It takes all the columns together without any separator.
You cannot directly load it to dataframe, first you have to load it as xml and then you can process it further.
Try following
require(XML)
data <- xmlParse('sample.dxl')
xml_data <- xmlToList(data)
You this list further to make your dataframe.

read.csv in R doesn't import all rows from csv file

I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows.
When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).
I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.
Is there any way to look at logs or something which includes reason why R rejected these records?
The OP indicates that the problem is caused by quotes in the CSV-file.
When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote="" option in read.csv. This disables quotes.
data <- read.csv(filename, quote="")
Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.
lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))
A slightly more safe solution, which will only delete quotes when not just before or after a comma:
lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
I had same issue where difference between number of rows present in csv file and number of rows read by read.csv() command was significant. I used fread() command from data.table package in place of read.csv and it solved the problem.
The records rejected was due to presence of double quotes in the csv file. I removed the double quotes on notepad++ before reading the file in R. If you can suggest a better way to remove the double quotes in R (before reading the file), please leave a comment below.
Pointed out by Jan van der Laan. He deserves the credit.
In your last question you want to remove double quotes (that is "") before reading the csv file in R. This probably is best done as a file preprocessing step using a one line Shell scripting "sed" comment (treated in the Unix & Linux forum).
sed -i 's/""/"/g' test.csv

Resources