Download a xls file from url into a dataframe (Rcurl)? - r

I'm trying to download the following url into an R dataframe:
http://www.fantasypros.com/nfl/rankings/qb.php/?export=xls
(It's the 'Export' link on the public page: http://www.fantasypros.com/nfl/rankings/qb.php/)
However, I'm not sure how to 'parse' the data? I'm also looking to automate this and perform it weekly, so any thoughts on how to build this into a weekly-access workflow would be greatly appreciated! Have been google searching and scouring stackoverflow for a couple hours now to no avail... :-)
Thank you,
Justin
Attempted Code:
getURL("http://www.fantasypros.com/nfl/rankings/qb.php?export=xls")
This just gives me a string that starts like:
[1] "FantasyPros.com \t \nWeek 8 - QB Rankings \t \nExpert Consensus Rankings (ECR) \t \n\n Rank \t Player Name \tTeam \t Matchup \tBest Rank \t Worst Rank \t Ave Rank \t Std Dev \t\n1\tPeyton Manning\tDEN\t vs. WAS\t1\t5\t1.2105263157895\t0.58877509625419\t\t\n2\tDrew Brees\tNO\t vs. BUF\t1\t7\t2.6287878787879\t1.0899353819483\t\t\n3\tA...

Welcome to R. It sounds like you love to do your analysis in Excel. Thats completely fine, but the fact that you are asking to crawl data from the web AND are asking about R, I think its safe to assume that you will start to find programming your analyses is the way to go.
That said, what you really want to do is crawl the web. There are tons of examples of how to do this with R, right here on SO. Look for things like "web scraping", "crawling", and "screen scraping".
Ok, dialogue aside. Don't worry about grabbing the data in XL format. You can parse the data directly with R. Most websites use a consistent naming convention, so using a for loop and building the URLs for your datasets will be easy.
Below is an example of parsing your page, directly with R, into a data.frame which acts very similar to tablular data in XL.
## load the packages you will need
# install.packages("XML")
library(XML)
## Define the URL -- you could dynamically build this
URL = "http://www.fantasypros.com/nfl/rankings/qb.php"
## Read the tables form the page into R
tables = readHTMLTable(URL)
## how many do we have
length(tables)
## look at the first one
tables[1]
## thats not it
## lets look at the 2nd table
tables[2]
## bring it into a dataframe
df = as.data.frame(tables[2])
If you are using R for the first time, you can install external packages pretty easily with the command install.packages("PackageNameHere"). However, if you are serious about learning R, I would look into using the RStudio IDE. It really flattened the learning curve for me on a ton of levels.

You can probably just use download.file and read.xls from the gdata library. I don't think you can skip lines reading in .xls files but you can supply a pattern argument so that it will read in the file until that pattern is seen in your row of data.
library(gdata)
download.file("http://www.fantasypros.com/nfl/rankings/qb.php?export=xls", destfile="file.xls")
ffdata<- read.xls("file.xls", header=TRUE, pattern="Rank")

Related

Convert data dictionary from word to excel with R

I got the data dictionary from data provider which contains hundreds vars in different word files and looks like this:
In order to add this dictionary to my current dataset, I need to convert it to certain format in Excel. For example,for first var:"intarm_actual", i would like to create columns in a spreadsheet: col of "variable" puts the left top words, col of "label" store content of "label" (for this var, it is NA, but for second var, it should be "tpe_lab"), col of "type" stors the words of " string(str2), col of "value" stores "4", col of "missing" stores "46/102", col of "tabulation" stores "46 "", 14 "RO",14 "RV",14 "TO",14 "TV"". Ideally, it should look like this:
Could anyone who happens have done this before help to provide some suggestions for this? (I appreciate for any suggestion like what package I should refer and use, any related posts article I should read, similar type of code i can learn...)Can R package "labelled" handle this type of task? Thanks a lot~~!!
update:_________________________________________________
I use package qdapTool to imported one of the docx files, it looks like this:
How can I retrieve the demanded words and assign them to right place in my spreadsheet? Thanks~~!
Update 2:--------------------------------------------
Issue has been solved in another way.
In case someone will encounter the similar situation, 1) This type of codebook file is generated by STATA; 2) Instead of reading this complex text file, the alternative solution is using package of "codebook" in R to generate the new .csv codebook which contains both these information and even more.
assuming that indeed, you have zero clue, I would recommend you to get started with regular expressions in R. I often use the R package stringr to work with regular expressions, and you find the respective cheat sheet here. They will allow you to, e.g., select the word following a ":".
I have never worked with Word Documents in R, but I guess that there are packages out there that allow you to read Word documents into R. Just Google them. :) I am sure they also have good instructions on how to use them.
Another issue you might encounter is encoding. If you have issues with reading the text into read in the correct way, e.g. reading in strange character combinations, that is most likely the source of the problem.
Once you have looked at these things and started working on your own code, you will be able to ask more precise questions.

getting R to recognize a file full of Word files for koRpus analysis

I need some help with loading text-file data into R for analysis with packages like koRpus.
The problem I am facing is getting R to recognize a folder full of Word files (about 4,000) as data which I can then make koRpus perform analyses like Coleman-Liau indexing. If at all possible, I prefer to make this work with Word files. The key problem is the struggle to cause R to recognize the text (Word) files in bulk (that is, all at the same time) so that koRpus can do its thing with those files.
My attempts to make this work have all been in vain, but I know that packages like koRpus would be limited in usefulness if there were no way to get the package to do its work on a large collection of files all at once.
I hope this problem will make sense to someone, and that there is a tenable solution to it.
Thanks,
Gordon
Looks like the readtext package should be able to help you out.
library(readtext)
Just specify the folder in the readtext() call. Like so:
doc_df <-
readtext("doc_files/")
I am not familiar with the koRpus package, but the text column
in the created dataframe should contain what is needed for further
function you want to use.
doc_df$text
#> [1] "Test1: a little bit of text" "Test2: no further text"
#> [3] "Test3: lorem ipsum bla bla"
In response to your comments:
It looks like your folder has several kinds of files in it and you are trying to filter them, so that only docx files are processed. The readtext command seems to support that kind of filtering, but the documentation says, that it is depending on the OS. My suggestion is to rather filter the files in the folder with R's dir() command, before calling readtext():
a <- dir("doc_files/", pattern = "docx", full.names = TRUE)
doc_df <- readtext(a)

How to view workspace in R

I'm trying to learn R (via this video) and immediately ran into problems. As directed, I created a dataset in Excel with column A being numbers 1 through 10 and column B being random integers. Saved as .xlsx and .csv.
Next I tried to read the data in R with
> data1 <- read.table(file.choose(), header=TRUE, sep="\t")
and that's as far as I got. There's no Workspace like in the video, or an option anywhere to view it. There are many windows in the video, but I only have "R Console".
So, how do I get the workspace?
You may be looking for "R Studio." It's a user-friendly shell that sits on top of R... It shows you your current work space, etc.
http://www.rstudio.com/
Also, you want to use sep="," not sep="\t" if you have a CSV. \t is tab-delimited...
I think you are using the R basic program (not basic in core functionality, just basic in terms of user interface features) that you probably downloaded from http://www.r-project.org/
The video you are watching is running a productive user interface called RStudio. You can download it for free from here: http://www.rstudio.com/ Works the same for all your purposes.

Using R, import data from web

I have just started using R, so this may be a very dumb question. I am trying to import the data using:
emdata=read.csv(file="http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV",header=TRUE)
My problem is that it reads the csv file into a single column ( by the way, the lottery data is simply because it is publicly available to download - using as an exercise to understand what I can and can't do in R), instead of formatting it into however many columns of data there are. Would someone mind helping out, please, even though this is trivial
Hm, that's kind of obnoxious for a page purporting to be in csv format. You can skip the first 5 lines, which will cause R to read (most of) the rest of the file correctly.
emdata=read.csv(file=...., header=TRUE, skip=5)
I got the number of lines to skip by looking at the source. You'll still have to remove the cruft in the middle and end, and then clean up the columns (they'll all be factors because of the embedded text).
It would be much easier to save the page to your hard disk, edit it to remove all the useless bits, then import it.
... to answer your REAL question, yes, you can import data directly from the web. In general, wherever you would read a file, you can substitute a fully qualified URL -- R is smart enough to do the Right Thing[tm]. This specific URL just happens to be particularly messy.
You could read text from the given url, filter out the obnoxious lines and then read the result as CSV like so:
lines <- readLines(url("http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"))
read.csv(text=lines[grep("([^,]*,){5,}", lines)])
The above regular expression matches any lines containing at least five commas.

How to write multiple tables, dataframes, regression results etc - to one excel file?

I am looking for an easy way to get objects into MS Excel.
(I am using the preinstalled "Puromycin"-dataset for the examples)
I would like to place the contents of these objects to a single excel file:
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
By "contents" i mean what is shown in the console when i press enter. I dont know what to call it.
I tried to do this:
sink("datafilewhichexcelhopefullyunderstands.csv")
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
sink()
This gives med a file with the CSV-extension, however when i open the file in notepad,
there is comma-separation. That means that i cant get Excel to open it properly. By properly
i mean that each number is in its own cell.
Others have suggested this for a similar problem
https://stackoverflow.com/a/13007555/1831980
But as a novice i feel that the solution is too complex, and I am hoping for a simpler method.
What I am doing now is this:
write.table(Puromycin, file="clipboard" , sep=";" , row.names=FALSE )
write.table(summary(Purymycin$conc), file="clipboard" , sep=";" , row.names=FALSE )
... etc...
But this requires i lot of copy-ing and pasting, which I hope to eliminate.
Any help would appreciated.
write.table and its friends are intended to write out columns of data separated by whatever separator is specified. Your clipboard contains several data types because you are using summary which always gives a unique output.
For writing the data values out, you can use write.csv on a data frame and then open with Excel. For example, Puromycin is already a data frame (which you can see with str(Puromycin)) so you can just write it out directly:
write.csv(file = "some file.csv", x = Puromycin)
Which will go into the current working directory (which can be determined with getwd()).
To write out/save the results of the regression model is a bit more of a challenge. You could definitely use sink as you did, but specify an extension of .txt on your file so a text editor can open it. There are fancier methods (sweave, knitr) which you might want to look into in the long run, as they can write really nice reports automatically.
In the meantime, get to know str(any R object) as it will be your friend. You can see all the objects in your workspace with ls().
This will only be helpful if you are prepared to use Excel's Data/Text to Columns functions:
capture.output( sapply( c(Puromycin,
summary(Puromycin$rate),
summary(Puromycin$conc),
table(Puromycin$state),
lm( conc ~ rate , data=Puromycin) ), FUN=print), file="datafilewhichexcelhopefullyunderstands.csv", append=TRUE)
The problem being that Excel will not read the whitespace as a cell separator unless you specifically tell it to. You can (and I have often done so) use the fixed filed input features offered by the Text-to-Columns dialog interface.
Your simplest option may be to use the RExcel tool, it transfers information between R and Excel. However it is not free software.
The XLConnect package is another option, it can be used to write information directly to an Excel file.
The tricky part is the lm call. lm does not return a simple vector, matrix, or data frame (all of which are easy to convert to csv or send directly) and there is not a clear way to convert the various parts of a list to cells in a spreadsheet. What would be better is to use extractor functions to pull the important parts from the return of lm or the summary of the lm object and send those to Excel using the other tools.
If you can tell us more about why you want the numbers in Excel and what you plan to do with them after, then we may be able to offer better help (you may be able to completely skip excel).
If the main goal is to share output with others then you should really look at the knitr package (or other related packages). This will not create Excel files, but can be used (along with the pandoc program and possibly other tools) to create a report file in a format easy to share with others not familiar with R. You could put everything into a .pdf file or a .docx file (the latter read by MS Word and would have tables wich can be edited using Word). There is not a simple way to get edits back into R, but with the track changes you can easily see what changes have been made and hand edit your R script/template accordingly.

Resources