Missing delimiter error when importing html text - azure-machine-learning-studio

Playing with Azure Machine Learning using the Designer and am getting a "Delimiter not found" error when importing my data.
I originally started with a few hundred html files stored as azure blobs. Each file would be considered a single row of text, however, I had no luck importing these files for further text analytics.
I created a Data Factory job that imported each file, stripped all the tabs, quotes, cr/lf from the text, added a column for the file name and stored it all as a combined tab-delimited file. In notepad++ I can confirm that the format is FileName tab HtmlText. This is the file I'm trying to import into ML and getting the missing delimiter message as I'm trying to define the import module.
Here is the error when I try and create a dataset:
{
"message": "'Delimiter' is not specified or invalid."
}
Question 1: Is there a better way to do text analytics on a large collection of html files?
Question 2: Is there a format I need to use in my combined .tsv file that works?
Question 3: Is there maybe a max length to the string column? My html can be 10's of thousands of characters long.

you're right that it might be line length, but my guess is that there are still some special characters (i.e. anything starting with \ that aren't properly escaped or removed. How did you scrape and strip the text data? Have you tried using beautifulsoup?

Related

Export into text in R, problem with html special characters

I am trying to export parts of dataframe in r into .txt file, using rio::export
the exported part is a text string that contains some characters as >, ≥ and ", some of these show in the text file as < " or > which is not what I need ( I want them to transfer into the text as greater than or less than, etc...).
when I looked these up online, I found these to be "html special characters" but I could not find a solution. I believe it has something to do with the encoding. is there a way to fix that by adding arguments to rio::export or do you suggest using other package?
I am using the code:
rio::export(df, "df.txt")
I also tried the following from this question but still gives the same problem
How to save R output data to a text file
cat(df$text, file = "myfile.txt") where df$text is the column that has the items I want to export.
thank you

Import data csv with particular quotes in R

I have a csv like this:
"Data,""Ultimo"",""Apertura"",""Massimo"",""Minimo"",""Var. %"""
"28.12.2018,""86,66"",""86,66"",""86,93"",""86,32"",""0,07%"""
What is the solution for importing correctly please?
I tried with read.csv("IT000509408=MI Panoramica.csv", header=T,sep=",", quote="\"") but it doesn't work.
Each row in your file is encoded as a single csv field.
So instead of:
123,"value"
you have:
"123,""value"""
To fix this you can read the file as csv (which will give you one field per row without the extra quotes), and then write the full value of that field to a new file as plain text (without using a csv writer).

R read a csv file without carriage return

I'm trying to read a .csv downloaded on Ishares' website containing all holdings of an ETF (and additional info).
Here is the csv.
My issue is that I can't get a neat dataframe containing Tickers, company name, asset class, etc. using the read.csv function.
When I opened the csv file in notepad I saw that all rows are on the same line and no special character is used to separate rows from each others.
Do you have any clues on how should I deal with this file?
That csv file uses the "LF" line break convention found on Linux and other Unixes - Windows usually expects text files to use the "CRLF" convention, which is why the line breaks are not showing in notepad. Another issue with that file is that before the csv content there are 10 header lines. Does this work in R?
read.csv(<fname>, skip=10)
If the line breaks are still an issue, there must be a Windows tool to convert from LF linebreaks to CRLF. On Linux it's "unix2dos" - I'm not sure what it would be on Windows.

locate invalid character causing error in R xmlToDataFrame()

For background I am very new to R, and have almost no experience with XML files.
I wrote a webscraper using the RSelenium package that downloads XML files for multiple states and years from this website, and then wrote code that reads in each file and appends it to one file and exports a CSV. My webscraper successfully downloads all of the files I need, and the next segment of code is able to successfully read all but two of the downloaded xml files.
The first file that I am unable to read into an R dataframe can be retrieved by selecting the following options on this page: http://www.slforms.universalservice.org/DRT/Default.aspx
Year=2013
State=PA
Click radio button for "XML Feed"
Click checkbox for "select data points"
Click checkbox for "select all data points"
Click "build data file"
I try to read the resulting XML file into R using xmlToDataFrame:
install.packages("XML")
require("XML")
data_table<-xmlToDataFrame("/users/datafile.xml")
When I do, I get an error:
xmlParseCharRef: invald xmlChar value 19
Error: 1: xmlParseCharRef: invalid xmlChar value 19
The other examples I've seen of invalid character errors using xmlToDataFrame usually give two coordinates for the problematic character, but since only the value "19" is given, I'm not sure how to locate the problematic character.
Once I do find the invalid character, would there be a way to alter the text of the xml file directly to escape the invalid character, so that xmlToDataFrame will be able to read in the altered file?
It's a bad encoding on this line of XML:
31 to $26,604.98 to remove: the ineligible entity MASTERY CHARTER SCHOOLS 
but the document seems to have other encoding issues as well.
The TSV works fine, so you might think abt using that instead.

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Resources