Importing to R an Excel file saved as web-page - r

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!

I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.

I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Related

How to open mht file via R?

I met a problem to open mht files in R. There is an approach in which first I need to open it in Excel then save as .xlsx and after that read it in R.
But this way doesn't correspond to my requirements as soon as I need the program which must work automatically (no manual work needed).
But unfortunately I didn't found in the Internet how to do this.
Can someone advise me the way in which I can open file with mht format with some data in R?

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.
The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.
Any help you can provide would be much appreciated!
arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset().
One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.
You can do it in this way:
library(arrow)
library(dplyr)
csv_file <- "obs.csv"
dest <- "obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string())
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite")
In my case (56GB csv file), I had a really weird situation with the resulting parquet tables, so double check your parquet tables to spot any funky new rows that didn't exist in the original csv. I filed a bug report about it:
https://issues.apache.org/jira/browse/ARROW-17432
If you also experience the same issue, use the Python Arrow library to convert the csv into parquet and then load it into R. The code is also in the Jira ticket.

Import information from .doc files into R

I've got a folder full of .doc files and I want to merge them all into R to create a dataframe with filename as one column and content as another column (which would include all content from the .doc file.
Is this even possible? If so, could you provide me with an overview of how to go about doing this?
I tried starting out by converting all the files to .txt format using readtext() using the following code:
DATA_DIR <- system.file("C:/Users/MyFiles/Desktop")
readtext(paste0(DATA_DIR, "/files/*.doc"))
I also tried:
setwd("C:/Users/My Files/Desktop")
I couldn't get either to work (output from R was Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.) but I'm not sure if this is necessary for what I want to do.
Sorry that this is quite vague; I guess I want to know first and foremost if what I want to do can be done. Many thanks!

R Copying to and Reading from csv Files

When I go to save Excel data that I've pasted into a .csv file, I get a formatting issue and often the saved file has all the numbers in each row as one long string.
My read statement is
resids<-read.csv("C:\\Projects\residuals_Parts3.csv",header=TRUE)
Any ideas on how to fix this?
The warning you are getting is fairly standard in Excel - any formatting you've added to the file (e.g. widening columns) will get lost if you don't save the file as an excel file.. and the warning is supposed to remind you of this. Personally, the extra click or two annoys me too.
If you would like to avoid converting excel files to CSV before bringing them into R, try the openxls package. It's saved me from a lot of that monkey business.

Opening an .R files EDIT: gzip files saved with .R extension. Is there any other way to access them?

I have a number of R files with an .R extension. I've tried various ways to see what is inside these file, including Xcode, vim, etc.
What I find is utterly indecipherable. For example, it looks like this Lçæ§o‡dµ’Ò6ÇìùëfiFŒÀ±y2Â8á∫˝É, but pages of it.
Is it safe to say that these files are fundamentally corrupt? Or should I be using R to open these files to see what's actually in them?
EDIT: I've never worked with a file like this. After using load() in R, how would I read the data? I have used
> data <- load("~/filename.RData")
> data
The output is [1] "filename".
EDIT2: It appears these are gzip files saved with an .R extension. I can using load() to read the data into R. Is there any other way I can access these data files?
"filename" is now loaded and it is stored in an object of the same name. You should be able to see what it is inside by running:
filename

Resources