loading raw file from github in R - r

I am trying to load a codebook from github in R studio. The url is [here][1]. It is a md file based on the link, but I want to load its raw file. (As pic1 shows on the top right this is a tab called raw, and when I click that, it shows pic2).I try to use the link provided, but it does not work. Could anyone help to tell how to do that? Thanks a lot!
cddf<-url("https://github.com/HimesGroup/BMIN503/blob/master/DataFiles/NHANES_2007to2008_DataDictionary.md")
cd<-read.table(cddf )
Update:
[![enter image description here][2]][2]
When I changed the code :
codebook<-read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",skip = 4, sep = "|", head = TRUE)
The r successfully read most of them, but the sep "|" did not work for two variables: INDHHIN2 and MCQ010. See pic. Can anyone help to figure out why? Thanks~~!

There are two issues here.
First, the raw file is available at the link https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md. However, read.table is not going to be able to read that file without some help: read.table is used for tab or comma delimited files, and that's a table marked up for Markdown. This comes close:
read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",
skip = 4, sep = "|", head = TRUE)
but it will still need some cleanup, to remove the first and last columns of junk it added, and to delete the first line.

Related

R imports not the whole data from csv. How to fix it?

I faced an issue of importing data from csv file to R.
Some basic information on the file. There are 1941 rows and 78 columns.
When I import data using the following command
data = read.csv("data.csv", header = T, sep = ";")
I get 824 rows only.
But when I convert the file into the xlsx format and then import the xlsx file using this command
data = read_excel("data.xlsx")
everything is ok.
I cannot fix the problem because I don't know where it is.
Can you help me please?
P.S.
Unfortunately I cannot share file eith you as soon as that file is a top secret.
The solution of the problem is to add the parameter quote="" in the code like this:
data = read.csv("data.csv", header = T, sep = ";", quote = "")
That's it.
Post the error/warning message if any.
When you open your data see if you have problematic characters inside columns, like tabs, comas, new lines etc.
I would suggest to read by line as a text file to check the issue.
Without looking onto what in the data causing the problem I guess no one could give you a solution.

reading US census products into R

I am trying to import the SIPP 2014 panel data into r but am having some trouble.
It can be found here:
https://www.census.gov/programs-surveys/sipp/data/2014-panel/wave-1.html
Normally, this would be a pretty simple process and I could just use
data = read.csv("pu2014w1.dat")
The issue stems from the size of the dataset and the fact that I do not know what it is separated by nor how the column headers are done. Sadly, I cannot find documentation for importing this file into R.
Any help would be greatly appreciated.
It seems that the file https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.dat.gz
after unzipping, is a fixed-width format text file. So, to read it, we can use
library(readr)
read_delim("https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.sas",
delim = " ", col_names = FALSE, skip = 6) -> foo
fwf_positions(start = foo$X4, end = foo$X6) -> bar
bar[ - c(5223:5231), ] -> bar2
bar3 <- bar2 %>% mutate(width = end - begin)
foobar <- fwf_widths(bar3$width)
read_fwf("pu2014w1.dat.gz", col_positions = foobar)
Note that when reading a fixed-width text file, we need to specify the positions of the fields. I do this by manipulating the contents of the sas input file, which tells us the positions of the fields (for use with SAS). Also, I had to download the gzipped file before I could successfully read it. Typically, I think, one can read directly from the url. I'm not sure why reading from the url didn't work here.

fread issue with archive package unzip file in R

I am having issues while trying to use fread, after I unzip a file using the archive package in R. The data I am using can be downloaded from https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data
The code is as follows:
library(dplyr)
library(devtools)
library(archive)
library(data.table)
setwd("C:/jc/2017/13.Lafavorita")
hol<-archive("./holidays_events.csv.7z")
holcsv<-fread(hol$path, header = T, sep = ",")
This code gives the error message:
File 'holidays_events.csv' does not exist. Include one or more spaces to consider the input a system command.
Yet if I try:
holcsv1<-read.csv(archive_read(hol),header = T,sep = ",")
It works perfectly. I need to use the fread command because the other data bases I need to open are too big to use read.csv. I am puzzled because my code was working fine a few days ago. I could unzip the files manually, but that is not the point. I have tried to solve this problem for hours, but I cannot seem to find anything useful on the documentation. I found this: https://github.com/yihui/knitr/blob/master/man/knit.Rd#L104-L107 , but I cannot understand it.
Turns out the answer is rather simple, but I found it by luck. So after using the archive function you need to pass it to the archive_extract function. So in my case, I should add the following to the code: hol1<-archive_extract(hol) . Then I have to change the last line to: holcsv<-fread(hol1$path, header = T, sep = ",")

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

Saving text from webpage for word cloud in R

I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!
Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:
Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)

Resources