locate invalid character causing error in R xmlToDataFrame() - r

For background I am very new to R, and have almost no experience with XML files.
I wrote a webscraper using the RSelenium package that downloads XML files for multiple states and years from this website, and then wrote code that reads in each file and appends it to one file and exports a CSV. My webscraper successfully downloads all of the files I need, and the next segment of code is able to successfully read all but two of the downloaded xml files.
The first file that I am unable to read into an R dataframe can be retrieved by selecting the following options on this page: http://www.slforms.universalservice.org/DRT/Default.aspx
Year=2013
State=PA
Click radio button for "XML Feed"
Click checkbox for "select data points"
Click checkbox for "select all data points"
Click "build data file"
I try to read the resulting XML file into R using xmlToDataFrame:
install.packages("XML")
require("XML")
data_table<-xmlToDataFrame("/users/datafile.xml")
When I do, I get an error:
xmlParseCharRef: invald xmlChar value 19
Error: 1: xmlParseCharRef: invalid xmlChar value 19
The other examples I've seen of invalid character errors using xmlToDataFrame usually give two coordinates for the problematic character, but since only the value "19" is given, I'm not sure how to locate the problematic character.
Once I do find the invalid character, would there be a way to alter the text of the xml file directly to escape the invalid character, so that xmlToDataFrame will be able to read in the altered file?

It's a bad encoding on this line of XML:
31 to $26,604.98 to remove: the ineligible entity MASTERY CHARTER SCHOOLS 
but the document seems to have other encoding issues as well.
The TSV works fine, so you might think abt using that instead.

Related

Extract hyperlink from xlsx file - rstudio

I have an xlsx file that has a column with values "Link". Every cell has attached a hyperlink to a website. When I read the file with readxl::read_xlsx, the "Link" column becomes a character vector, and loses the hyperlink info. What I need is to read the file and get a column with all the URLs that are hyperlink on each row.
I've tried what Sean Yang posted here but I'm having some trouble, when running the function it returns:
Error: XML content does not seem to be XML: 'rels'.
I don't understand what seems to be the problem. Also can't comment on his answer because I don't have enough points, so I'm sorry if this post is redundant.

BlueSky Statistics - Character Encoding Problem

I am loading a data set, characters of which was encoded in ISO 8859-9 ("Latin 5") using Windows 10 OS (Microsoft has assigned code page 28599 a.k.a. Windows-28599 to ISO-8859-9 in Windows).
The data set is originally in Excel.
Whenever I run an analysis, or any operation with a variable name containing a character specific to this code page (ISO 8859-9), I get an error like:
Error: undefined columns selected
BSkyFreqResults <- BSkyFrequency(vars = c("MesleÄŸi"), data = Turnudep_raw_data_5)
Error: object 'BSkyFreqResults' not found
BSkyFormat(BSkyFreqResults)
The characters ÄŸ within "MesleÄŸi" are originally one character in Turkish (g with an inverted hat on) ğ
Those variable names that contain only letters from US code page work normally in BlueSky operations.
If I try to use save as in Excel and use web option UTF-8, to convert the data to UTF-8, this does not work either. If I export it to csv file, it does not work as is, or saved as UTF-8.
How can I load this data into BlueSky so that it works?
This same data set works in Rstudio:
> Sys.getlocale('LC_CTYPE')
[1] "Turkish_Turkey.1254"
And also in SPSS:
Language is set to Unicode
Picture of Language settings in SPSS
It also works in Jamovi
I also get an error when I start BlueSky, that may be relevant to this problem:
Python-CFFI error
From cffi callback <function _consolewrite_ex at 0x000002A36B441F78>:
Traceback (most recent call last):
File "rpy2\rinterface_lib\callbacks.py", line 132, in _consolewrite_ex
File "rpy2\rinterface_lib\conversion.py", line 133, in _cchar_to_str_with_maxlen
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 15: invalid start byte
Since then I re-downloaded and re-installed BlueSky, but I still get this Python-CFFI error every time I start the software.
I want to work with BlueSky and will appreciate any help in resolving this problem.
Thanks in advance
Here is a link for reproducing the problem.
The zip file contains a data source of 2 cases both in Excel and BlueSky format, a BlueSky Markdown file to show how the error is produced and an RMarkdown file for redundancy (probably useless).
UPDATE: The Python error (Python-CFFI error) appears to be related to the Region settings in Windows.
If the region is USA (Turnudep_reprex_Windows_Region_USA-Settings.jpg) , the python error does NOT appear.
If the region is Turkey (Turnudep_reprex_Windows_Region_Turkey-Settings.jpg) the python error DOES appear.
Unfortunately, setting the region and language to USA does eliminate the python error message but not the other problem. Still all the operations with the Turkish variable names end up with an error.
This may be a problem only the BlueSky developers may solve ...
Any help or suggestion will be greatly appreciated.
UPDATE FOR VERSION 10.2: The Python error (Python-CFFI error) is eliminated in this version. All others persist. I also notice that I can not change the variable names that have characters not in US code page. Meaning, if a variable name is something like "HastaNo", I can do analysis with that variable and change the name of the variable in the editor. If the variable name is something like "Mesleği" I can not do analysis with that variable AND I CANNOT CHANGE THAT NAME in the editor to "Meslegi" or anything else, so that it is usable in analysis.
UPDATE FOR VERSION: BlueSky Statistics Version 10.2.1, R package version 8.70
No change from Version 10.2. Variable names that contain a character outside of ASCII, cause an error AND can not be changed in BlueSky Statistics.
For version 10, according to user manual chapter 15.1.3 you can adjust the encoding setting. (answer has been edited for more clarity)

Missing delimiter error when importing html text

Playing with Azure Machine Learning using the Designer and am getting a "Delimiter not found" error when importing my data.
I originally started with a few hundred html files stored as azure blobs. Each file would be considered a single row of text, however, I had no luck importing these files for further text analytics.
I created a Data Factory job that imported each file, stripped all the tabs, quotes, cr/lf from the text, added a column for the file name and stored it all as a combined tab-delimited file. In notepad++ I can confirm that the format is FileName tab HtmlText. This is the file I'm trying to import into ML and getting the missing delimiter message as I'm trying to define the import module.
Here is the error when I try and create a dataset:
{
"message": "'Delimiter' is not specified or invalid."
}
Question 1: Is there a better way to do text analytics on a large collection of html files?
Question 2: Is there a format I need to use in my combined .tsv file that works?
Question 3: Is there maybe a max length to the string column? My html can be 10's of thousands of characters long.
you're right that it might be line length, but my guess is that there are still some special characters (i.e. anything starting with \ that aren't properly escaped or removed. How did you scrape and strip the text data? Have you tried using beautifulsoup?

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Retrieving GWAS information with R

I am trying to get specific disease-related information from the GWAS catalog. This can be done directly from the website via a spreadsheet download. But I was wondering if I could possibly do it programmatically in R. Any suggestions will be greatly appreciated.
Thanks.
Avoks
Checkout the function download.file() and the package rcurl (http://cran.r-project.org/web/packages/RCurl/index.html) - this should do what you are looking for
You will have to download .tsv file(s) first and manually edit them.
This is because GWAS Catalog files contain HTML symbols, like &#x000A7 in "Behçet's disease" (defining that special fourth letter). The # in these symbols will be interpreted by R as an end of line, thus you will get an error message, e.g.:
line 2028 did not have 34 elements
So you downlad it first, open in plain text editor, automatically replace every # with empty character, and only then load it into R with:
read.table("gwas_catalog_v1.0-associations_e91_r2018-02-21.tsv",sep="\t",h=T,stringsAsFactors = F,quote="")

Resources