Extract hyperlink from xlsx file - rstudio - r

I have an xlsx file that has a column with values "Link". Every cell has attached a hyperlink to a website. When I read the file with readxl::read_xlsx, the "Link" column becomes a character vector, and loses the hyperlink info. What I need is to read the file and get a column with all the URLs that are hyperlink on each row.
I've tried what Sean Yang posted here but I'm having some trouble, when running the function it returns:
Error: XML content does not seem to be XML: 'rels'.
I don't understand what seems to be the problem. Also can't comment on his answer because I don't have enough points, so I'm sorry if this post is redundant.

Related

RStudio: Save data from Viewer

Due to a stupid mistake and a defective USB stick I lost a bunch of data and I am now trying to recover it.
Some of the the data is still displayed in the Viewer tabs when I open RStudio. However, I can only save R Scripts and R Markdownfiles out of the Viewer. The displayed data frames are nice and complete, I can sort and filter them in the Viewer, however, I cannot find a "save" option. Is there a possibility to save this displayed data into Rdata or csv or something similar?
I would suggest three different approaches, but none of them will necessarily work. I sort them according to my prior expectations of success.
1) You can copy all your data frame from the viewer and paste it into an external spreadsheet software to obtain a .csv file. E.g. through the "convert text to columns" button in MS Excel.
2) You can copy and paste the character string into an object that is passed to the text option of read.table or to dput(). Check out the "Copy your data" section of this famous SO question
3) Finally, you can get google Chrome's "Inspect Element" function to inspect the html code of the object in the viewer. Once you find the table you can copy paste and scrape with an html parser, e.g. using the rvest package. Good luck!
Thanks everybody, there is a way to access the data as Rdata files, which was kindly explained to me here
I used the second method and located the files in %localappdata%\RStudio-Desktop\viewer-cache.

locate invalid character causing error in R xmlToDataFrame()

For background I am very new to R, and have almost no experience with XML files.
I wrote a webscraper using the RSelenium package that downloads XML files for multiple states and years from this website, and then wrote code that reads in each file and appends it to one file and exports a CSV. My webscraper successfully downloads all of the files I need, and the next segment of code is able to successfully read all but two of the downloaded xml files.
The first file that I am unable to read into an R dataframe can be retrieved by selecting the following options on this page: http://www.slforms.universalservice.org/DRT/Default.aspx
Year=2013
State=PA
Click radio button for "XML Feed"
Click checkbox for "select data points"
Click checkbox for "select all data points"
Click "build data file"
I try to read the resulting XML file into R using xmlToDataFrame:
install.packages("XML")
require("XML")
data_table<-xmlToDataFrame("/users/datafile.xml")
When I do, I get an error:
xmlParseCharRef: invald xmlChar value 19
Error: 1: xmlParseCharRef: invalid xmlChar value 19
The other examples I've seen of invalid character errors using xmlToDataFrame usually give two coordinates for the problematic character, but since only the value "19" is given, I'm not sure how to locate the problematic character.
Once I do find the invalid character, would there be a way to alter the text of the xml file directly to escape the invalid character, so that xmlToDataFrame will be able to read in the altered file?
It's a bad encoding on this line of XML:
31 to $26,604.98 to remove: the ineligible entity MASTERY CHARTER SCHOOLS 
but the document seems to have other encoding issues as well.
The TSV works fine, so you might think abt using that instead.

loading data and replacement in R

Hi sorry first post here my apologies if I made a mistake.
So I'm fairly new to R and I was given an assignment where I am loading a CSV file into R. When i read.csv the whole file I get a ton of blank spots where values should be. The only info printed out is the N/A in the cells which is actually what I am trying to replace.
So I took a small sample of the file only the first couple rows and the info came up correctly in my read.csv comand. My question is is the layout of the .csv too large to display the original data in my main.csv file?
Also, How would I go about replacing all the N/A and NA's in the file to change them to blank cells or ""
Sorry if i painted my scenario poorly
first make sure that all of you data in the csv file is in GENERAL format!
there should be a title for each of the columns too
if you have an empty cell in your csv file then input a 0 into it
and make sure that around the data you CLEAR ALL the cells around them just incase there is anything funny in them
hope that helps if not then you could send me your file to sgreenaway#vmware.com and i will check it out for you :)

CSV column formatting while spooling a csv file in sqlplus

How do i extract a number formatted column when i spool to a csv file from unix when the column is varchar in database?
Number format in CSV is 5.05291E+12
should actually be 5052909272618
This problem is not a Unix or ksh problem, but an Excel problem. Assuming Excel is the default application for a .csv file, When you double-click to open in Excel, it makes the column type "General" by default and you get the scientific notation instead of the text value as expected. If your numeric data has leading zeroes they will be removed also.
One way to fix is to start Excel (this example in the 2010 version), the go to Data/get external data/from text, follow the wizard making sure to set the "column data format" to "text" for the column that contains your large number (click on the column in the "data preview" section first). Leading zeroes will be preserved also.
You could code a vba macro that would open with all columns as text (a little searching will show you some examples) but there seems to be no place to tell Excel to treat columns as "text" by default.
There was need to develop report and I was also facing the same issue.i found that there is one workaround/solution. e.g.
your table name --> EMPLOYEE
contains 3 columns colA,colB,colC. the problem is with colB. after connecting to sqlplus you can save the the data in spool file as :
select colA|| ','''||colB||''',' ||colC from employee
sample result:
3603342979,'8938190128981938262',142796283 .
when excel opens the file it displays the result "as-it-is"

HTML data in Excel cell. Excel changing values, causing invalid import into database

I am currently trying to import an excel spreadsheet that contains product data for products on my website.
Of the various fields available for product import, one of them is the product description. In this cell, I paste the HTML data that I wish to be displayed on that product.
However, it would seem that sometimes excel decides to format this data on it's own, and after importing, the HTML is all screwed up. It's strange, because this does not happen on every product.
I've tried formatting the cells as "text", but that seems to have no effect. When I paste the values from the cell into notepad, I can see that every singe quote symbol (") has been doubled. For what reason, I have no idea.
The data looks correct when viewed in the cell, but it does not import properly, and the data is visibly changed when copied and pasted out of excel.
Any ideas on how to remedy this? How can I get excel to just ignore the data?
Try putting the HTML code in a single line with no breaks. For example:
<html><body><p><div id="PDG-Team-Image">Test</body></html>
I ran a quick test using the above simple HTML snippet. In Column A, I put the code with line breaks. In column B, I put the code with no breaks. I then copied the cell A1 and pasted it into Notepad, and then copied B1 and pasted it into Notepad. You'll see the first paste (A1) has the double quotes. The second paste (B1) has no double quotes.

Resources