Using R, import data from web - r

I have just started using R, so this may be a very dumb question. I am trying to import the data using:
emdata=read.csv(file="http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV",header=TRUE)
My problem is that it reads the csv file into a single column ( by the way, the lottery data is simply because it is publicly available to download - using as an exercise to understand what I can and can't do in R), instead of formatting it into however many columns of data there are. Would someone mind helping out, please, even though this is trivial

Hm, that's kind of obnoxious for a page purporting to be in csv format. You can skip the first 5 lines, which will cause R to read (most of) the rest of the file correctly.
emdata=read.csv(file=...., header=TRUE, skip=5)
I got the number of lines to skip by looking at the source. You'll still have to remove the cruft in the middle and end, and then clean up the columns (they'll all be factors because of the embedded text).
It would be much easier to save the page to your hard disk, edit it to remove all the useless bits, then import it.
... to answer your REAL question, yes, you can import data directly from the web. In general, wherever you would read a file, you can substitute a fully qualified URL -- R is smart enough to do the Right Thing[tm]. This specific URL just happens to be particularly messy.

You could read text from the given url, filter out the obnoxious lines and then read the result as CSV like so:
lines <- readLines(url("http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"))
read.csv(text=lines[grep("([^,]*,){5,}", lines)])
The above regular expression matches any lines containing at least five commas.

Related

How do I get EXCEL to interpret character variable without scientific notation in R using fwrite?

I have a relatively simple issue when writing out in R with fwrite from the data.table package I am getting a character vector interpreted as scientific notation by Excel. You can run the following code to create the data issue:
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))
fwrite(samp,"test.csv",row.names = F)
When you read this back into R you get values back no problem if you have scinote disable. My less code capable colleagues work with the csv directly in excel and they see this:
They can attempt to change the variable to text but excel then interprets all the zeros. I want them to see the original "7E39" from the data table created. Any ideas how to avoid this issue?
PS: I'm working with millions of rows so write.csv is not really an option
EDIT:
One workaround I've found is to just create a mock variable with quotes:
samp = data.table(id = c("7E39", "7G32","5D99999"))[,id2:=shQuote(id)]
I prefer a tidyr solution (pun intended), as I hate unnecessary columns
EDIT2:
Following R2Evan's solution I adapted it to data table with the following (factoring another numerical column, to see if any changes occured):
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))[,second_var:=c(1,2,3)]
fwrite(samp[,id:=sprintf("=%s", shQuote(id))],
"foo.csv", row.names=FALSE)
It's a kludge, and dang-it for Excel to force this (I've dealt with it before).
write.csv(data.frame(id=sprintf("=%s", shQuote(c("7E39", "7G32","5D99999")))),
"foo.csv", row.names=FALSE)
This is forcing Excel to consider that column a formula, and interpret it as such. You'll see that in Excel, it is a literal formula that assigns a static string.
This is obviously not portable and prone to all sorts of problems, but that is Excel's way in this regard.
(BTW: I used write.csv here, but frankly it doesn't matter which function you use, as long as it passes the string through.)
Another option, but one that your consumers will need to do, not you.
If you export the file "as is", meaning the cell content is just "7E39", then an auto-import within Excel will always try to be smart about that cell's content. However, you can manually import the data.
Using Excel 2016 (32bit, on win10_64bit, if it matters):
Open Excel (first), have an (optionally empty) worksheet already open
On the ribbon: Data > Get External Data > From Text
Navigate to the appropriate file (CSV)
Select "Delimited" (file type), click Next, select "Comma" (and optionally deselect any others that may default to selected), Next
Click on the specific column(s) and set the "Default data format" to "Text" (this will need to be done for any/all columns where this is a problem). Multiple columns can be Shift-selected (for a range of columns), but not Ctrl-selected. Finish.
Choose the top-left cell to import/paste the data (or a new worksheet)
Select Properties..., and deselect "Save query definition". Without this step, the data is considered a query into an external data source, which may not be a problem but makes some things a little annoying. (For example, try to highlight all data and delete it ... Excel really wants to make sure you know what you're doing there.)
This method provides a portable solution. It "punishes" the Excel users, but anybody/anything else will still be able to consume the files directly without change. The biggest disadvantage with this method is that you won't know if somebody loads it incorrectly unless/until they get odd results when the try to use the data and some fields are silently converted.

"arules" library's "read.transaction()" reads in CSV files with an additional, blank column for every transaction

When you attempt to read CSV files that aren't the default groceries.csv, every transaction has an additional entry in it — a blank space — which will mess up all of the calculations for analysis (and even crash R if your CSV file is big enough). I've tried to insert NA's into all of the blank cells in my CSV file, but I cannot find a way to remove all of them within the read.transactions() command (remove duplicates leaves a single NA). I haven't found a trustworthy way to fix this in any of the other questions on stackoverflow, nor anywhere else on the internet.
Example entry:
> inspect(trans[1:5])
items
1 {,
FACEBOOK.COM,
Google,
Google Web Search}
It is hard to say. I assume you read the data with read.transactions(). Does your CSV file have leading white spaces in some/all lines? You could try to use the cols parameter in read.transactions() to fix the problem.
An example with data and the code to replicate the problem would help.

Excel data organized in multiple nested rows, can R read it?

Please see the picture. I've started using R, and know how/that it can read files from Excel, but can it read something formatted like this?
http://www.flickr.com/photos/68814612#N05/8632809494/
(my apologies, upload was not working for me)
Elaborating on some of what's in the comments:
If you load the file into Excel, you can save it as a fixed-width or comma-delimited text file. Either should be easy to read into R.
The following may be obvious to you already.
(First, a question: Are you sure that you can't get the data in a format that has one set of data per line? Is it possible that the file you're getting was generated from a different file format that is more conducive to loading the data into R?)
Whether you should start rearranging the data in R or instead manipulate the raw text depends on what comes naturally to you (or to people you have around who can help). For me, personally, I would rearrange the text file outside of R before loading it into R. That's what's easiest for me. Perl is a great language for this purpose, but you could also do it with Unix shell scripts if that's accessible to you, or using a powerful editor such as Vim or Emacs. If you have no preference, I'd suggest Perl. If you have any significant programming experience, you'll be able to learn what you need. On the other hand, you're already loading it into R, so maybe it would be better to process the data there.
For example, you could execute a loop that goes the text file line by line and does something like this:
while (still have lines to read) {
read first header line into an vector if this is the first time through the loop
otherwise, read it and throw it away
read data line 1 into an vector
read second header line into vector if this is the first time
otherwise, read it and throw it away
read data line 2 into an vector
read third header line into vector if this is the first time
otherwise, read it and throw it away
read data line 3 into an vector
if this is first time through, concatenate the header vectors; store as next row
in something (a file, a matrix, a dataframe, etc.)
concatenate the data vectors you've been saving, and store as next row in same thing
}
write out the whole 2D data structure
Or if the headers will never change, then you could just embed them literally into the script before the loop, and throw them out no matter what. That will make the code cleaner. Or read the first few lines of the file separately to get the headers, and then have a separate script to read the data and add it to the file with the headers in it. (The headers will probably be useful in R, so I would suggest preserving them at the top of the text file.)

Converting .pdf files to excel (.xls)

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt

Resources