How to download uri data:text file using R? - r

I'm trying to download what I believe is called a URI data:text file using R. The "URL" is of this form:
URL <- "data:text/csv;charset=utf8,Supply%2001%2F02%2F2020%2C0%3A00%2C0%3A05%2C0%3A10%2C0%3A15%2C0..."
followed by hundreds more characters. When I type the "URL" into a web browser and press enter, the desired CSV file downloads without a problem. When I try to use something like download.file or curl_download on the above string, however, I get an error like this:
Error in curl_download(URL, destfile = "test1234.csv") :
Port number ended with 't'
Any insights on how I can download a csv data: file like this using R? Thanks!
If it's any use, the file I'm trying to download is pasted below. I had to first save the string as a .txt file and then import that .txt file with read_file in order to store it as a string in R.

The data URL isn't really a normal URL, it contains all the data inside of the text rather than pointing to the data at a different location. It is made up of two parts: a "header" and then data itself. The header consist of "data:text/csv;charset=utf8," and then the data follows, but it's been HTML (or URL) encoded. You can read the data by removing the header, decoding the values, and then reading the text as a CSV file with read.csv. For example:'
read.csv(text=URLdecode(gsub("^data:text/csv;charset=utf8,","", URL)),
check.names = FALSE)

Related

Missing delimiter error when importing html text

Playing with Azure Machine Learning using the Designer and am getting a "Delimiter not found" error when importing my data.
I originally started with a few hundred html files stored as azure blobs. Each file would be considered a single row of text, however, I had no luck importing these files for further text analytics.
I created a Data Factory job that imported each file, stripped all the tabs, quotes, cr/lf from the text, added a column for the file name and stored it all as a combined tab-delimited file. In notepad++ I can confirm that the format is FileName tab HtmlText. This is the file I'm trying to import into ML and getting the missing delimiter message as I'm trying to define the import module.
Here is the error when I try and create a dataset:
{
"message": "'Delimiter' is not specified or invalid."
}
Question 1: Is there a better way to do text analytics on a large collection of html files?
Question 2: Is there a format I need to use in my combined .tsv file that works?
Question 3: Is there maybe a max length to the string column? My html can be 10's of thousands of characters long.
you're right that it might be line length, but my guess is that there are still some special characters (i.e. anything starting with \ that aren't properly escaped or removed. How did you scrape and strip the text data? Have you tried using beautifulsoup?

Is there a way to avoid creating a temporary file in R?

I have a database in which VCF files have been inserted as a blob variable. I am able to retrieve it without issue. However, I then need to pass it to some various functions (VariantAnnotation, etc.) that expect a VCF file name. Is there a way to "fake" a file object to pass to these functions if I already have all the data in a character string?
I'm currently writing it out to a file so I can pass it on:
#x contains the entire vcf file as a character string
temp_filename = tempfile(fileext = ".vcf")
writeChar(x, temp_filename)
testVcf = readVcf(temp_filename)
unlink(temp_filename)
This works ok, but I would like to avoid the unnecessary file I/O if possible.

Import information from .doc files into R

I've got a folder full of .doc files and I want to merge them all into R to create a dataframe with filename as one column and content as another column (which would include all content from the .doc file.
Is this even possible? If so, could you provide me with an overview of how to go about doing this?
I tried starting out by converting all the files to .txt format using readtext() using the following code:
DATA_DIR <- system.file("C:/Users/MyFiles/Desktop")
readtext(paste0(DATA_DIR, "/files/*.doc"))
I also tried:
setwd("C:/Users/My Files/Desktop")
I couldn't get either to work (output from R was Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.) but I'm not sure if this is necessary for what I want to do.
Sorry that this is quite vague; I guess I want to know first and foremost if what I want to do can be done. Many thanks!

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Converting df to csv format, without creating a file

I am creating a process to converting an API data into a df.
My problem is:
The data just appears correct after exporting to a csv file, using ' df.to_csv("df.csv", sep=','). If I don't do that, the first column appears a big data list.
Is there a way to do this process of convert to csv format without creating an external file ?
From the documentation of DataFrame.to_csv:
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a
string.
So simply doing:
csv_string = df.to_csv(None, sep=",")
Gives you a string containing a csv representation of your dataframe without creating an external file.

Resources