Accessing a CSV file hosted on Github with R - r

I'd like to acces data from Github repositories directly from R.
When importing data, I get this error
cols(<!DOCTYPE html> = col_character()) 60 parsing failures.
¿How can I fix that? My code:
data <- read_csv(curl("https://github.com/datatto/AU25-de-Mayo/blob/master/AU_F_Properati_v2.csv")

The key, as #karthik commented, is to change the URL by replacing https://github.com/with raw.githubusercontent.com/, and skipping the blob/ part.
i.e. changing:
https://github.com/datatto/AU25-de-Mayo/blob/master/AU_F_Properati_v2.csv
to:
https://raw.githubusercontent.com/datatto/AU25-de-Mayo/master/AU_F_Properati_v2.csv
(carefully compare the URLs and you'll spot the differences)
Besides that, it seems your .csv file is formatted using ";" as the field separator and "," as the decimal separator; this is common with data in languages such as Spanish, where the comma is reserved as the decimal separator.
To properly parse the file, simply use read.csv2() or read_csv2() i.e.:
library(tidyverse)
mydata <- read_csv2("https://raw.githubusercontent.com/datatto/AU25-de-Mayo/master/AU_F_Properati_v2.csv")

I had the same problem and found a short answer here: One way is to use this line of code in R:
readr::read_csv("https://raw.github.com/user/repository/branch/file.name")

Related

read.csv read csv file with function, cell wrapped with ="{value}"

I export my CSV file with python, numbers are wrapped as ="10000000000" in cells, for example:
name,price
"something expensive",="10000000000",
in order to display the number correctly, I prefer to wrap the big number or string of numbers(so someone could open it directly without reformating the column), like order ID into this format.
It's correct with excel or number, but when I import it with R by using read.csv, cells' values show as =10000000000.
Is there any solution to this?
Thank you
how about:
yourcsv <- read.csv("yourcsv.csv")
yourcsv <- gsub("=", "", yourcsv$price)
Also, in my experience read_csv() from the tidyverse library reads data in much faster than read.csv() and I think also has more logic built into it for nonideal cases encountered, so maybe it's worth trying.

Headers changing when reading data from csv or tsv in R

I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".

CSV separated by ';' have semicolons in some of their attributes and can't parse correctly

I've been downloading Tweets in the form of a .csv file with the following schema:
username;date;retweets;favorites;text;geo;mentions;hashtags;permalink
The problem is that some tweets have semi-colons in their text attribute, for example, "I love you babe ;)"
When i'm trying to import this csv to R, i get some records with wrong schema, as can you see here:
I think this format error is because of the csv parser founding ; in text section and separating the table there, if you understand what i mean.
I've already tried matching with the regex: (;".*)(;)(.*";)
and replacing it with ($1)($3) until not more matches are found, but the error continues in the csv parsing.
Any ideas to clean this csv file? Or why the csv parser is working bad?
Thanks for reading
Edit1:
I think that there is no problem in the structure more than a bad chosen separator (';'), look at these example record
Juan_Levas;2015-09-14 19:59;0;2;"Me sonrieron sus ojos; y me tembló hasta el alma.";Medellín,Colombia;;;https://twitter.com/Juan_Levas/status/643574711314710528
This is a well formatted record, but i think that the semi-colon in the text section (Marked between "") forces the parser to divide the text-section in 2 columns, in this case: "Me sonrieron sus ojos and y me tembló hasta el alma.";.
Is this possible?
Also, i'm using read.csv("data.csv", sep=';') to parse the csv to a data-frame.
Edit2:
How to reproduce the error:
Get the csv from here [~2 MB]: Download csv
Do df <- read.csv('twit_data.csv', sep=';')
Explore the resulting DataFrame (You can sort it by Date, Retweets or Favorites and you will see the inconsistences in the parsing)
Your CSV file is not properly formatted: the problem is not the separator occurring in character fields, it's rather the fact that the " are not escaped.
The best thing to do would be to generate a new file with a proper format (typically: using RFC 4180).
If it's not possible, your best option is to use a "smart" tool like readr:
library(readr)
df <- read_csv2('twit_data.csv')
It does quite a good job on your file. (I can't see any obvious parsing error in the resulting data frame)

Problems with reading a txt file (EOF within quoted string)

I am trying to use read.table() to import this TXT file into R (contains informations about meteorological stations provided by the WMO):
However, when I try to use
tmp <- read.table(file=...,sep=";",header=FALSE)
I get this error
eof within quoted string
warning and only 3514 of the 6702 lines appear in 'tmp'. From a quick look at the text file, I couldn't find any seemingly problematic characters.
As suggested in other threads, I also tried quote="". The EOF warning disappeared, but still only 3514 lines are imported.
Any advice on how I can get read.table() to work for this particular txt file?
It looks like your data actually has 11548 rows. This works:
read.table(url('http://weather.noaa.gov/data/nsd_bbsss.txt'),
sep=';', quote=NULL, comment='', header=FALSE)
edit: updated according #MrFlick's comment's below.
The problem is LF. R will not recognize "^M", to load the file, you only need to specify the encoding like this:
read.table("nsd_bbsss.txt",sep=";",header=F,encoding="latin1",quote="",comment='',colClasses=rep("character",14)) -> data
But Line 8638 has more than 14 columns, which is different from other lines and may lead an error message.

R how to read a .csv file with different separators

ItemID,Sentiment,SentimentSource,SentimentText
1,0,Sentiment140, ok thats it you win.
2,0,Sentiment140, i think mi bf is cheating on me!!! T_T
3,0,Sentiment140," I'm completely useless rt now. Funny, all I can do is twitter. "
How would you read a csv file like this into R?
Read a csv with read.csv(). You can specify sep="" to be whatever you need it to be. But as noted below, , is the default value for the separator.
R: Data Input
For example, csv file with comma as separator to a dataframe, manually choosing the file:
df <- read.csv(file.choose())

Resources