Can R download text from inside this sort of web app? - r

Trying do download text as in company profiles from this website
http://www.evca.eu/about-evca/members/member-search/#lsearch
In the past I had good success with similar tasks using for example the XML package, but this won't work here because the data I am trying to grasp is inside some sort of dynamic and the single elements in the list don't have own URLs or something.
Unfortunately I don't know much about web-design, so I am not really sure how to address this. Any suggestions, it would really suck to do this manually. Thanks

First download Fiddler Web Debugger or some other similar tool. It places itself between your browser and web server, then you can see what is going on (also dynamic/AJAX communication).
Run it, go to the website you are trying to understand and execute actions you want to do automatically.
For example if you open http://www.evca.eu/about-evca/members/member-search/#lsearch, enter "a" in the search box and then choose "All" (to get all results), you will see in the Fiddler that browser opens http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999 URL and sends "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest".
You can do the same with R, parse the result, get some text, maybe some links to other stuff.
Below code in R will do the same as described above:
require('XML')
require(stringr)
library(httr)
r <- POST("http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999",
body = "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest")
stop_for_status(r)
txt=content(r,"text")
library(stringr)
matches <- str_match_all(txt,"Full company details.*?</h2>")
# remove some rubish from match
companies=gsub("(Full company details)|\t|\n|\r|<[^>]+>",'',matches[[1]])
#remove trainling spaces
companies=gsub("^[ ]+",'',companies)
Result:
> length(companies)
[1] 1148
> head(companies)
[,1]
[1,] "350 Investment Partners"
[2,] "350 Investment Partners LLP"
[3,] "360° Capital Management SA"
[4,] "360° Capital Partners France - Advisory Company"
[5,] "360° Capital Partners Italia - Advisory Company"
[6,] "3i Deutschland Gesellschaft für Industriebeteiligungen mbH"

Related

Converting a dataframe which contains list into a csv with r

I am new to R and I am facing difficulties to convert my dataframe (named dffinal) which contains list into a csv.
I tried the following code which gave a csv that is not usable:
dput(dffinal, file="out.txt")
new <- source("out.txt")
write.csv2(dffinal,"C:/Users\\final.csv", row.names = FALSE)
I tried all the option but I found nothing! Here is a sample of my dataframe:
dput(head(dffinal[1:2]))
structure(list(V1 = list("I heard about your products and I would like to give it a try but I'm not sure which product is better for my dry skin, Almond products or Shea Butter products? Thank you",
"Hi,\n\nCan you please tell me the difference between the shea shower oil limited edition and the other shower gels? I got a sample of one in a kit that had a purple label on it. (Please see attached photo.) I love it!\nBut, what makes it limited edition, the smell or what? It is out of stock and I was wondering if it is going to be restocked or not?\n\nAlso, what makes it different from the almond one?\n\nThank you for your help.",
"Hello, Have you discontinued Eau de toilette", "I both an eGift card for my sister and she hasn't received anything via her email\n\nPlease advise \n\nThank you \n\n cann",
"I do not get Coco Pillow Mist. yet. When are you going to deliver it? I need it before January 3rd.",
"Hello,\nI wish to follow up on an email I just received from Lol, notifying\nme that I've \"successfully canceled my subscription of bun Complete.\"\nHowever, I didn't request a cancelation and was expecting my next scheduled\nfulfillment later this month. Could you please advise and help? I'd\nappreciate it if you could reinstate my subscription.\n"),
V2 = list("How long can I keep a product before opening it? shea butter original hand cream large size 5oz, i like to buy a lot during sales promotions, is this alright or should i only buy what i'll use immediately, are these natural organic products that will still have a long stable shelf life? thank you",
"Hi,\nI recently checked to see if my order had been delivered, and I only received my gift box and free sample. Can you please send the advent calendar? Does not seem to have been included in the shipping. Thank you",
"Is the gade fragrance still available?", "I previously contacted you because I purchased your raspberry lip scrub. When I opened the scrub, 25% of the product was missing. Your customer service department agreed to send me a replacement, but I never received the replacement rasberry lip scrub. Could you please tell me when I will receive the replacement product? Thanks, me",
"To whom it may concern:\n\nI have 3 items in my order: 1 Shea Butter Intensive Hand Balm and 2 S‚r‚nit‚ Relaxing Pillow Mist. I have just received the hand balm this morning. I was wondering when I would receive the two bottles of pillow mist.\n\nThanks and regards,\n\nMe",
"I have not received 2X Body Scalp Essence or any shipment information regarding these items. Please let me know if and when you will be shipping these items, otherwise please credit my card. Thanks")), row.names = c(NA,
6L), class = "data.frame")
We can do this in tidyverse
library(dplyr)
library(readr)
dffinal %>%
mutate(across(everything(), unlist)) %>%
write_csv('result.csv')
If you have list of only length 1 for all the rows as shared in the example using unlist will work -
dffinal[] <- lapply(dffinal, unlist)
If the length of list is greater than 1 use -
dffinal[] <- lapply(dffinal, sapply, toString)
Write the data with write.csv -
write.csv(dffinal, 'result.csv', row.names = FALSE)

Why is the Same Google Search API Producing Different Results in R vs Browser

Sumbitting the exact same Google Search API query in the browser (Chrome) and in R returns a different number of results. What could be the reason for this? The only obvious difference is that I am submitting the query in the browser from my computer (UK based) while R results come from a GCE VM based in NL. Can this be the cause even though I have specified the country for the search in the query string?
# Pasted in the browser address bar
https://www.googleapis.com/customsearch/v1?q=%22KALLIGIANNIS%22%20Rethymno&num=10&lr=lang_en&cx=SSS&gl=gr&cr=countryGR&dateRestrict=date:r:20150831:20170831&key=XXX&alt=json
# Get request in R
httr::GET('https://www.googleapis.com/customsearch/v1?q=%22KALLIGIANNIS%22%20Rethymno&num=10&lr=lang_en&cx=SSS&gl=gr&cr=countryGR&dateRestrict=date:r:20150831:20170831&key=XXX&alt=json')
The results in the browser show:
"searchInformation": {
"searchTime": 0.133114,
"formattedSearchTime": "0.13",
"totalResults": "109",
"formattedTotalResults": "109"
The results in R
oneresult <- GET('https://www.googleapis.com/customsearch/v1?q=%22KALLIGIANNIS%22%20Rethymno&num=10&lr=lang_en&cx=SSS&gl=gr&cr=countryGR&dateRestrict=date:r:20150831:20170831&key=XXX&alt=json')
content(oneresult)[[5]]
$searchTime
[1] 0.584238
$formattedSearchTime
[1] "0.58"
$totalResults
[1] "59"
$formattedTotalResults
[1] "59"
The google search algorithm is a black box. It yields different results depending on geo-location, and additional parameters, not all of them are known.
For example, using the browser (not via googleapis) in regular mode versus incognito might also yield different results.
My guess is that you are right in your hypothesis (that the difference is caused by the location the search is originating from).

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

Web Scraping with R: Truncation Issue [duplicate]

This question already has an answer here:
avoid string printed to console getting truncated (in RStudio)
(1 answer)
Closed 6 years ago.
As a beginner, I am currently working with web scraping with R, using the 'rvest' package. My goal is to get the lyrics of any song from 'www.musixmatch.com'. This is my attempt:
library(rvest)
url <- "https://www.musixmatch.com/lyrics/Red-Hot-Chili-Peppers/Can-t-Stop"
musixmatch <- read_html(url)
lyrics <- musixmatch%>%html_nodes(".mxm-lyrics__content")%>%html_text()
This code creates a vector 'lyrics' with 2 rows, containing the lyrics:
[1] "Can't stop addicted to the shindig\nChop top he says I'm gonna win big\nChoose not a life of imitation"
[2] "Distant cousin to the reservation\n\nDefunkt the pistol that you pay for\nThis punk the feeling that you stay for\nIn time I want to be your best friend\nEastside love is living on the Westend\n\nKnock out but boy you better come to\nDon't die you know the truth is some do\nGo write your message on the pavement\nBurn so bright I wonder what the wave meant\n\nWhite heat is screaming in the jungle\nComplete the motion if you stumble\nGo ask the dust for any answers\nCome back strong with 50 belly dancers\n\nThe world I love\nThe tears I drop\nTo be part of\nThe wave can't stop\nEver wonder if it's all for you\nThe world I love\nThe trains I hop\nTo be part of\nThe wave can't stop\n\nCome and tell me when it's time to\n\nSweetheart is bleeding in the snow cone\nSo smart she's leading me to ozone\nMusic the great communicator\nUse two sticks to make it in the nature\nI'll get you into penetration\nThe gender of a generation\nThe birth of every other nation\nWorth your weight the gold ... <truncated>
The problem is that the 2nd row gets truncated at some point. From what I know about rvest, there is no parameter to adjust truncation. Also, I could not find anything on the internet about this issue. Does anybody know how to adjust/ disable truncation for this feature? Thanks a lot in advance!
Best regards,
Jan
I think its better to copy and paste the lyrics into your Notepad or Wordpad. Save as a .txt file
Then use the readLines function, it prints our a warning message but I was able to have the entire lyrics in 84x1 chacacter vector which you can clean or do whatever you please.
words <- readLines("redhot.txt")
> head(words)
[1] "Can't stop addicted to the shindig"
[2] "Chop top he says I'm gonna win big"
[3] "Choose not a life of imitation"
[4] "Distant cousin to the reservation"
[5] "Defunkt the pistol that you pay for"
[6] "This punk the feeling that you stay for"
No truncation problem here.

How to control the echo width using Sweave

I have a problem with the width of the output from echo within sweave, I have a list with a large amount of text. The problem is the echo response from R runs off the page within the pdf. I have tried using
<<>>=
options(width=40)
#
but this has not changed anything.
An example: Set up the list (not showing in latex).
<<echo=FALSE>>=
my_list <- list(example="Site location was fixed using a Silvia Navigator handheld GPS in October 2003. Point of reference used was the station Bench Mark. If the bench mark location was remote from the site then the point of reference used was changed to the 0-1 metre gauge. Bench Mark location was then recorded as a separate entry in the Site History section [but not used as the site location].\r\nFor a Station location map and all digital photograph's of the station, river reach, and site details see H:\\hyd\\dat\\doc. For non digital photo's taken prior to October 2003 please see the relevant station file at Tumut office.")
#
And show the entry of the list.
<<>>=
my_list
#
Is there any way that I can get this to work without having to break up the list with cat statements.
You can use capture.output() to capture the printed representation of the list and then use writeLines() and strwrap() to display this output, nicely wrapped. As capture.output() returns a vector of strings containing the printed representation of the object, we can cat each of them to the screen/page but wrapped using strwrap(). The benefit of this approach is that the result looks like it was printed by R. Here's the solution:
writeLines(strwrap(capture.output(my_list)))
which produces:
$example
[1] "Site location was fixed using a Silvia Navigator
handheld GPS in October 2003. Point of reference used
was the station Bench Mark. If the bench mark location
was remote from the site then the point of reference used
was changed to the 0-1 metre gauge. Bench Mark location
was then recorded as a separate entry in the Site History
section [but not used as the site location].\r\nFor a
Station location map and all digital photograph's of the
station, river reach, and site details see
H:\\hyd\\dat\\doc. For non digital photo's taken prior
to October 2003 please see the relevant station file at
Tumut office."
From a 2010 posting to rhelp by Mark Schwartz:
cat(paste(strwrap(x, width = 70), collapse = "\\\\\n"), "\n")

Resources