I'm using twitteR to get the followers for a few handles. When fetching a single user, this code works:
test <- getUser("BarackObama")
test_friends <- test$getFriends(10) %>%
twListToDF() %>%
tibble::rownames_to_column() %>%
mutate(id = rowname) %>%
select(name, everything())
However, I'm not sure what's the cleanest way to iterate over a list of handles. The main obstacle I see at the moment is that I don't know how to pipe/vectorize over the getFriends() method (contra a getFriends() function). Plus, the object returned by getFriends() is not a DF, but has to be flattened (?) by twListToDF(), to then use rbind().
For looping, this is as far as I got:
handles <- c("BarackObama", "ThePresObama")
while (i < length(handles)) {
user <- getUser(handles[i])
friends <- user$getFriends() %>%
twListToDF()
}
With a bit more tinkering, I think I could get this to work, but I'm not sure if it's the best approach.
Alternatively, using rtweet, it seems like there is a more elegant solution that might accomplish your goal. It extracts the followers of specified users into a dataframe, looks up followers by user, then binds that result to the original dataframe using left_join so that you can distinguish which users correspond to which followers.
library(rtweet)
handles <- c("BarackObama", "ThePresObama")
handles.friends <- get_friends(handles)
handles.data <- lookup_users(handles.friends$user_id) %>%
left_join(handles.friends)
The pmap_* functions from purrr might also help implement a solution using the twitteR library, and have generally helped me to implement non-vectorized functions, but unfortunately I'm unable to get twitteR authentication working.
Related
I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.
You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).
My first post and a beginner with R so patience requested if I should have found an answer to my question elsewhere.
I'm trying to cobble together a table with data pulled from multiple sites from CME (https://www.cmegroup.com/trading/energy/crude-oil/western-canadian-select-wcs-crude-oil-futures.html is one).
I've tried using rvest but get a blank table.
I think this is because of the Javascript that is being used to populate the table in real time? I've fumbled my way around this site to look for similar problems and haven't quite figured out how best to pull this data. Any help is much appreciated.
library(rvest)
library(dplyr)
WCS_page <- "https://www.cmegroup.com/trading/energy/crude-oil/canadian-heavy-crude-oil-net-energy-index-futures_quotes_globex.html"
WCS_diff <- read_html(WCS_page)
month <- WCS_diff %>%
rvest::html_nodes('th') %>%
xml2::xml_find_all("//scope[contains(#col, 'Month')]") %>%
rvest::html_text()
price <- WCS_diff %>%
rvest::html_nodes('tr') %>%
xml2::xml_find_all("//td[contains(#class, 'quotesFuturesProductTable1_CLK0_last')]") %>%
rvest::html_text()
WTI_df <- data.frame(month, price)
knitr::kable(
WTI_df %>% head (10))
Yes, the page is using JS to load the data.
The easy way to check is to view source and then search for some of the text you saw in the table. For example the word "May" never shows up in the raw HTML, so it must have been loaded later.
The next step is to use something like the Chrome DevTools to inspect the network requests that were made. In this case there is a clear winner, and your structured data is coming down from here:
https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/6038/G
I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong):
data_frame <- sdf_persist(data_frame)
Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr. Note that I have tried:
dplyr::db_drop_table(sc, "data_frame")
dplyr::db_drop_table(sc, data_frame)
unpersist(data_frame)
sdf_unpersist(data_frame)
But none of those work.
Also, I am trying to avoid using tbl_cache (in which case it seems that db_drop_table works) as it seems that sdf_persist offers more liberty on the storage level. It might be that I am missing the big picture of how to use persistence here, in which case, I'll be happy to learn more.
If you don't care about granularity then the simplest solution is to invoke Catalog.clearCache:
spark_session(sc) %>% invoke("catalog") %>% invoke("clearCache")
Uncaching specific object is much less straightforward due to sparklyr indirection. If you check the object returned by sdf_cache you'll see that the persisted table is not exposed directly:
df <- copy_to(sc, iris, memory=FALSE, overwrite=TRUE) %>% sdf_persist()
spark_dataframe(df) %>%
invoke("storageLevel") %>%
invoke("equals", invoke_static(sc, "org.apache.spark.storage.StorageLevel", "NONE"))
[1] TRUE
That's beacuase you don't get registered table directly, but rather a result of subquery like SELECT * FROM ....
It means you cannot simply call unpersist:
spark_dataframe(df) %>% invoke("unpersist")
as you would in one of the official API's.
Instead you can try to retrieve the name of the source table, for example like this
src_name <- as.character(df$ops$x)
and then invoke Catalog.uncacheTable:
spark_session(sc) %>% invoke("catalog") %>% invoke("uncacheTable", src_name)
That is likely not the most robust solution, so please use with caution.
I'm attempting to scrape a realtor.com for a project for school. I have a working solution, which entails using a combination of the rvest and httr packages, but I want to migrate it to using the RCurl package, specifically using the getURLAsynchronous() function. I know that my algorithm will scrape much faster if I can migrate it to a solution that will download multiple URLs at once. My current solution to this problem is as follows:
Here's what I have so far:
library(RCurl)
library(rvest)
library(httr)
urls <- c("http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-1?pgsz=50",
"http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-2?pgsz=50")
prop.info <- vector("list", length = 0)
for (j in 1:length(urls)) {
prop.info <- c(prop.info, urls[[j]] %>% # Recursively builds the list using each url
GET(add_headers("user-agent" = "r")) %>%
read_html() %>% # creates the html object
html_nodes(".srp-item-body") %>% # grabs appropriate html element
html_text()) # converts it to a text vector
}
This gets me an output that I can readily work with. I'm getting all of the information off of the webpages, then reading all of the html from the GET() output. Next, I'm finding the html nodes, and converting it to text. The trouble I'm running into is when I attempt to implement something similar using RCurl.
Here is what I have for that using the same URLs:
getURLAsynchronous(urls) %>%
read_html() %>%
html_node(".srp-item-details") %>%
html_text
When I call getURIAsynchronous() on the urls vector, not all of the information is downloaded. I'm honestly not sure exactly what is being scraped. However, I know it's considerably different then my current solution.
Any ideas what I'm doing wrong? Or maybe an explanation on how getURLAsynchronous() should be working?
I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!
It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())