Problems with reading a txt file (EOF within quoted string) - r

I am trying to use read.table() to import this TXT file into R (contains informations about meteorological stations provided by the WMO):
However, when I try to use
tmp <- read.table(file=...,sep=";",header=FALSE)
I get this error
eof within quoted string
warning and only 3514 of the 6702 lines appear in 'tmp'. From a quick look at the text file, I couldn't find any seemingly problematic characters.
As suggested in other threads, I also tried quote="". The EOF warning disappeared, but still only 3514 lines are imported.
Any advice on how I can get read.table() to work for this particular txt file?

It looks like your data actually has 11548 rows. This works:
read.table(url('http://weather.noaa.gov/data/nsd_bbsss.txt'),
sep=';', quote=NULL, comment='', header=FALSE)
edit: updated according #MrFlick's comment's below.

The problem is LF. R will not recognize "^M", to load the file, you only need to specify the encoding like this:
read.table("nsd_bbsss.txt",sep=";",header=F,encoding="latin1",quote="",comment='',colClasses=rep("character",14)) -> data
But Line 8638 has more than 14 columns, which is different from other lines and may lead an error message.

Related

Accessing a CSV file hosted on Github with R

I'd like to acces data from Github repositories directly from R.
When importing data, I get this error
cols(<!DOCTYPE html> = col_character()) 60 parsing failures.
¿How can I fix that? My code:
data <- read_csv(curl("https://github.com/datatto/AU25-de-Mayo/blob/master/AU_F_Properati_v2.csv")
The key, as #karthik commented, is to change the URL by replacing https://github.com/with raw.githubusercontent.com/, and skipping the blob/ part.
i.e. changing:
https://github.com/datatto/AU25-de-Mayo/blob/master/AU_F_Properati_v2.csv
to:
https://raw.githubusercontent.com/datatto/AU25-de-Mayo/master/AU_F_Properati_v2.csv
(carefully compare the URLs and you'll spot the differences)
Besides that, it seems your .csv file is formatted using ";" as the field separator and "," as the decimal separator; this is common with data in languages such as Spanish, where the comma is reserved as the decimal separator.
To properly parse the file, simply use read.csv2() or read_csv2() i.e.:
library(tidyverse)
mydata <- read_csv2("https://raw.githubusercontent.com/datatto/AU25-de-Mayo/master/AU_F_Properati_v2.csv")
I had the same problem and found a short answer here: One way is to use this line of code in R:
readr::read_csv("https://raw.github.com/user/repository/branch/file.name")

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Can't read vcf-derived file in R: "no lines available in input"

I have a vcf file and I want to extract the header (which is the only line that has the pattern '#CHROM' on it. So, on the mac terminal I typed the following:
grep '#CHROM' file.vcf > headerlinevcf.txt
I see it on the terminal, or in vi, and I found what I expected. I can see all the columns of that line (= the text in the header). I see it like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 ...
Now, I try to read it into R as a vector, because I have to do some other stuff to it, and the following error comes up:
> headerline<-as.vector(read.table('headerlinevcf.txt'))
Error in read.table("headerlinevcf.txt") : no lines available in input
I tried to read.delim using tab and then a space, but it didn't work. I also tried:
headerline <- read.table('headerlinevcf.txt')
And also gives back the same error.
I also tried readLines command, and it gives me this:
headerline <- readLines('headerlinevcf.txt')
> headerline[1]
[1]
"#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHG00096\tHG00097\tH
G00099\tHG00100\tHG00101\tHG00102\tHG00103\tHG00105\tHG00106\tHG00107\tHG00
... <truncated>
It seems that the VCF file (and thus this VCF-derived) have some strange way of delimitation.
A friend tried in Python to read it, change the '\t' into spaces, and open that new file with R, but the same error came out once again.
I don't know that well this kind of format to find the error that easy, so I've been struggling with this the past couple of days. Please, if someone knows what's happening lend me a hand! Thanks in advance.

CSV separated by ';' have semicolons in some of their attributes and can't parse correctly

I've been downloading Tweets in the form of a .csv file with the following schema:
username;date;retweets;favorites;text;geo;mentions;hashtags;permalink
The problem is that some tweets have semi-colons in their text attribute, for example, "I love you babe ;)"
When i'm trying to import this csv to R, i get some records with wrong schema, as can you see here:
I think this format error is because of the csv parser founding ; in text section and separating the table there, if you understand what i mean.
I've already tried matching with the regex: (;".*)(;)(.*";)
and replacing it with ($1)($3) until not more matches are found, but the error continues in the csv parsing.
Any ideas to clean this csv file? Or why the csv parser is working bad?
Thanks for reading
Edit1:
I think that there is no problem in the structure more than a bad chosen separator (';'), look at these example record
Juan_Levas;2015-09-14 19:59;0;2;"Me sonrieron sus ojos; y me tembló hasta el alma.";Medellín,Colombia;;;https://twitter.com/Juan_Levas/status/643574711314710528
This is a well formatted record, but i think that the semi-colon in the text section (Marked between "") forces the parser to divide the text-section in 2 columns, in this case: "Me sonrieron sus ojos and y me tembló hasta el alma.";.
Is this possible?
Also, i'm using read.csv("data.csv", sep=';') to parse the csv to a data-frame.
Edit2:
How to reproduce the error:
Get the csv from here [~2 MB]: Download csv
Do df <- read.csv('twit_data.csv', sep=';')
Explore the resulting DataFrame (You can sort it by Date, Retweets or Favorites and you will see the inconsistences in the parsing)
Your CSV file is not properly formatted: the problem is not the separator occurring in character fields, it's rather the fact that the " are not escaped.
The best thing to do would be to generate a new file with a proper format (typically: using RFC 4180).
If it's not possible, your best option is to use a "smart" tool like readr:
library(readr)
df <- read_csv2('twit_data.csv')
It does quite a good job on your file. (I can't see any obvious parsing error in the resulting data frame)

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

Resources