Import XLS, readxl / gdata bring in DF with NA - r

I am trying to bring this .xls file into R: https://www.reit.com/sites/default/files/returns/MonthlyHistoricalReturns.xls
I've tried to bring it in directly from the url on a Windows machine. I've already run across the https versus http issues as well as the perl issue for Windows. To get around this, I've tried to run on ubuntu as well as downloading the file first.
My latest two attempts with readxl and gdata both produce a data frame, though neither one has any data in it. There are no error messages.
NAREIT <- readxl::read_xls("~/Downloads/MonthlyHistoricalReturns.xls")
This produces 38 observations of one variable, all NA.
NAREIT <- gdata::read.xls("~/Downloads/MonthlyHistoricalReturns.xls")
And this one produces 0 observations of 1 variable, "No data available in table" is the text written inside the only cell.
The file is admittedly ugly, with multiple unneeded header rows, merged cells, frozen views, etc. I've tried specifying ranges, columns, rows, rows to skip, col names, etc.--everything I could think of from readxl and gdata documentation.
I can just cut the range I need, save as CSV, and work with it. But, as I am likely to have to come back to this regularly, I am looking for the'right' way to open this file. Any thoughts are much appreciated.

It looks like there are several rows of header, so you would need to figure out what you would like as a header, or actually consult a few pages on stack overflow that show you how to deal with 2 line headers.
Anyways, I can import it like this, and it seems to be just fine
library(readxl)
MonthlyHistoricalReturns <- read_excel("MonthlyHistoricalReturns.xls", sheet = "Index Data", skip = 7)
I skipped to line 7 to start your header there

Related

When I upload my excel file to R, the column titles are in the rows and the data seems all jumbled. How do I fix this?

hi literally day one new coder
On the excel sheet, my data looks organized, but when I upload my file to R, it's not able to read the excel properly and the column headers are in the rows and the data seems randomized.
So far I have tried:
library(readxl)
dataset <-read_excel("pathname")
View(dataset)
Also tried:
dataset <-read_excel("pathname", sheet=1, colNames=TRUE)
Also tried to use the package openxlsx
but nothing is giving me the correct, organized data set.
I tried formatting my Excel to a CSV file, and the CSV file looks exactly like the data that shows up on R (both are messed up).
How should I approach this problem?
I deal with importing .xlsx into R frequently. It can be challenging due to the flexibility of the excel platform. I generally use readxl::read_xlsx() to fetch data from .xlsx files. My suggestions:
First, specify exactly the data you want to import with the range argument.
A cell range to read from, as described in cell-specification. Includes typical Excel
ranges like "B3:D87", possibly including the sheet name like "Budget!B2:G14"
Second, if there are there merged cells or other formatting challenges in column headers, I resort to setting col_names = FALSE. And supplying clean names after import with names(df) <- c("first_col", "second_col")
Third, if there are merged cells elsewhere in the spreadsheet I generally I resort to "fixing" them in excel (not ideal but easier for my use case), however, others may have suggestions on a programmatic fix.
It may be helpful to provide a screenshot of your spreadsheet.

Trying to merge columns from multiple csv's, but the merged dataframe is coming up NULL

My problem seems to be two-fold. I am using code that has worked before. I re-ran my scripts and got similar outputs, but saved to a new location. I have changed all of my setwd lines accordingly. But, there may be an error with either setwd or the do.call function.
In R, I want to merge 25 csv's that are located in a folder- only certain columns
My path is
/Documents/CODE/merge_file/2sp
So, I do:
setwd("/Documents/CODE")
but then I get an error saying cannot change working directory (usually works fine). So then I manually set working directory in the Session in RStudio.
The next script seems to run fine:
myMergedData2 <-
do.call(rbind,
lapply(list.files(path = "/Documents/CODE/merge_file/2sp"),
read.csv))
myMergedData2 ends up in the global environment, but it says it is NULL (empty), though the console makes it look like everything is ok.
I would then like to save just these columns of information but I can't even get to this point.
myMergedData2<-myMergedData2[c(2:5),c(10:12)]
And then add this
myMergedData2<-myMergedData2 %>% mutate(richness = 2)%>% select(richness,
everything())
And then I would like to save
setwd("/Documents/CODE/merge_file/allsp")
write.csv(myMergedData2, "/Documents/CODE/merge_file/allsp/2sp.csv")
I am trying to merge these data so I can use ggplot 2 and show how my response variables (columns 2-5) according to my independent variables (columns 10-12). I have 25 different parameter sets with 50 observations in each csv.
Ok, so the issue was that my dropbox didn't have enough space and I weirdly don't have permissions to do what I was trying on my university's H drive. Bizarre, but easy fix with the increase in space on Dropbox to allow for complete syncing of csv's.
Sometimes the issue is minor!

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Fread unusual line ending causing error

I am attempting to download a large database of NYC taxi data, publicly available at the NYC TLC website.
library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)
Executing the above code successfully downloads the data (which takes a few minutes), but then fails to parse due to an internal error. I have tried removing header = T as well.
Is there a workaround in order to deal with the "unusual line endings" in fread ?
Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.
It seems that the issues might be caused due the presence of a blank line between the header and data in the original .csv file. Deleting the line from the .csv using notepad++ seemed to fix it for me.
Sometimes other options like read.csv/read.table can behave differently... so you can always try that. (Maybe the source code tells why, havent looked into that).
Another option is to use readLines() to read in such a file. As far as I know, no parsing/formatting is done here. So this is, as far as I know, the most basic way to read a file
At last, a quick fix: use the option 'skip = ...' in fread, or control the end by saying 'nrows = ...'.
There is something fishy with fread. data.table is the faster, more performance oriented for reading large files, however in this case the behavior is not optimal. You may want to raise this issue on github
I am able to reproduce the issue on downloaded file even with nrows = 5 or even with nrows = 1 but only if stick to the original file. If I copy paste the first few rows and then try, the issue is gone. The issue also goes away if I read directly from the web with small nrows. This is not even an encoding issue, hence my recommendation to raise an issue.
I tried reading the file using read.csv and 100,000 rows without an issue and under 6 seconds.
feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)
header = T is a redundant argument so would not make a difference for fread but is needed for read.csv.

importing txt into R

I am trying to read an ftp file from the internet ("ftp://ftp.cmegroup.com/pub/settle/stlint") into R, using the following command:
aaa<-read.table("ftp://ftp.cmegroup.vom/pub/settle/stlint", sep="\t", skip=2, header=FALSE)
the result shows the 8th, 51st, 65th, 71st, 72nd, 73rd, 74th, etc etc rows of the resulting dataset as including add-on rows appended at the end. Basically instead of returning
{row8}
{row9}
etc
{row162}
{row163}
It returns (adding in the quotes around the \n)
{row8'\n'row9'\n'etc...etc...'\n'row162}
{row163}
If it seems like i'm picking arbitrary numbers then run the code above, take a look at the actual ftp file on the internet (as of mid-day feb18) and you'll see i'm not, it really adding 155x rows onto the end of the 8th row. So what i'm looking for is simply I'm looking for a way to read it in without the random appending of rows. Thanks, and apologize in advance i'm new to R and was not able to find this fix after a while of searching.

Resources