I am trying to use search_fullarchive from the rtweet package on sandbox PREMIUM with these exact search operators park OR parks, lang:en and point_radius:[51.5047 0.1278 25mi]. I have tried the following
test2 <- search_fullarchive(q = "park OR parks lang:en point_radius:[51.5074 0.1278 25mi]", n = 100, fromDate = "202003150000", toDate = "202003172359", env = "research", parse = TRUE, token = ActiveTravel_token)
The returned test2 object is a tbl_df filtered only by park OR parks. I've checked here and as a sandbox PREMIUM user I should be able to filter by lang: and point_radius:
Could someone please help me get the filtering to also match the other two operators lang:en and point_radius:[51.5047 0.1278 25mi].
Thanks in advance!
Best wishes,
Irena
This should be as simple as wrapping the text in parentheses, with the whitespace acting as a logical AND for the other fields.
q = "(park OR parks) lang:en point_radius:[51.5074 0.1278 25mi]"
However, I've just tried this search and at the moment, it returns zero Tweets within that point radius over that date range. I substituted in another point radius (the Boulder, CO example from the Twitter API documentation, point_radius:[-105.27346517 40.01924738 10.0mi], and it successfully brought back Tweets that matched the search parameters.
As to finding very few tweets. The point radius-operator will only return tweets that were geotagged manually by the user at the time of the tweet, and then only within a small area of maximum 25 miles. Only a small fraction of tweets are geo-tagged. You will probably have more luck with the place: operator. It will also return tweets by people who have the "place" you search for, set in their profile.
Related
I need to get tweets that contain at least of the following hashtags: #EUwahl #Euwahlen #Europawahl #Europawahlen. This means, I am looking for tweets containing at least one of those hashtags but they can also contain more of them. Furthermore, in each of these tweets one out of seven specific user (eg #AfD) must be mentioned as well in the tweet.
So far I only know how to search Twitter for one hashtag only or several ones. Meaning, I am familiar with the operator and using a + between the hashtags but not with the operator for or.
This is an example of a code I have used so far to do any searches in Twitter:
euelection <- searchTwitter("#EUwahl", n=1000, since = "2019-05-01",until = "2019-05-26")
I can install twitteR but it requires some authentication key which is not very easy for me to get.
The principle is to search using OR with space in between. I provide you an example with rtweet
library(rtweet)
# your tags
TAGS = c("#EUwahl","#Europawahl")
# make the search term
SEARCH = paste(TAGS,collapse=" OR ")
# do the search
# you can also use twitteR
test <- search_tweets(SEARCH, n=100)
# your found tweet text
head(test$text)
## check which tweet contains which tag
tab = sapply(TAGS,function(i)as.numeric(grepl(i,test$text,ignore.case=T)))
# all of them contain either #EUwahl or #Europawahl
I'm trying to scrape information from https://www.kff.org/interactive/subsidy-calculator. For instance, put state=California, zip=90001, income=20000, no coverage, 1 people, 1 adult, no children, age=21, no tobacco.
We get the following:
https://www.kff.org/interactive/subsidy-calculator/#state=ca&zip=94704&income-type=dollars&income=20000&employer-coverage=0&people=1&alternate-plan-family=individual&adult-count=1&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=0
I would like to get the numbers for "estimated financial help" and "your cost for a silver plan" (they are bolded-blue in the "Results" grey box, for some reason I can't upload the screenshot). When I use the xpath for the numbers, I get back empty string. This is not the case if I were to retrieve some other text (not in the grey box). I wonder what could be wrong with this. I have attached code below. Please forgive me if this is a stupid question since I'm very new to web-scraping. Thank you!
state = tolower('CA')
zip = 94704
income = 20000
people = 1
adult = 1
children = 0
url = paste0("https://www.kff.org/interactive/subsidy-calculator/#state=", state, "&zip=", zip, "&income-type=dollars&income=", income, "&employer-coverage=0&people=", people, "&alternate-plan-family=individual&adult-count=", adult, "&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=", children)
# This returns empty string
r = read_html(url) %>%
html_nodes(xpath ='//*[#id="subsidy-calculator-new"]/div[5]/div/div/dl/dd[1]/span') %>% html_text()
# This returns "Number of children (20 and younger) enrolling in Marketplace coverage", a line that's not in the grey box.
r = read_html(url) %>%
html_nodes(xpath = '//*[#id="subsidy-form"]/div[2]/div[3]/div[3]/p') %>%
html_text()
The values are generated through scripts that run on the page. Your current method won't allow for this hence your result. You are likely better off using a method which allows scripts to run such as RSelenium.
The form you complete #subsidy-form feeds values into a template in a script tag #results-template. The associated calculations are covered in this script https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/calculator.js?ver=1.7.7 where you will find the logic and the pre-set values such as poverty lines per year.
The simplest quick view is probably to inspect the javascript variables when the new SubsidyCalculator object is created to process the form i.e. js starting with var sc = new SubsidyCalculator. You could 'reverse engineer' those variables with your values plus the values returned from the json below which I think, but haven't confirmed, feed the 6 variables that begin with kff_sc, according to zipcode, into the calculator e.g. silver: kff_sc.silver . You get an idea of the ballpark figures given there are default values given at top of script.
Figures in relation to zipcode are retrieved from this: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/94.json where the last two numbers before .json are the first two numbers of zipcode. You can determine this from the input validation script: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/shared.js?ver=1.7.7
var bucket = $( this ).val().substring( 0, 2 );
if ( kff_sc.buckets[bucket] ) return;
$.ajax( '/wp-content/themes/vip/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/' + bucket + '.json',
The first two digits determine the bucket.
All in all you could likely implement your own calculator but you would be re-inventing the wheel. Seems easier to just automate the browser and then extract the resultant values.
I need to retrieve data from google analytics using R
I write the following code with GoogleAnalyticsR:
df <- google_analytics(viewId = my_id,
date_range=c(start,end),
metrics = c("pageViews"),
dimensions = "pagePath",
anti_sample = TRUE,
filtersExpression ="ga:pagePath==RisultatoRicerca?nomeCasa",
max=100000)
I need to set correctly the FiltersExpression parameters.
I 'd like to have data from pagePath that contains RisultatoRicerca?nomeCasa. This code returns me a dataframe with 0 rows, which i know it's impossible ( data from an e-commerce with more than ten thousand interaction per day). So i 've begun to think that my FiltersExpression is incorrect.
Thanks in advance
I managed to solve the problem using filtersExpression
filtersExpression = "ga:pagePath=#RisultatoRicerca?nomeCasa
this filter works on pagePath dimension and filter every path that contain RisultatoRicerca
donald_tweets <- searchTwitter("Donald + Trump Republicans exclude:retweets",
n=50, lang = "en", since = "2016-03-16", until = "2016-03-17")
donald_tweets
But this gives me error.
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit
50 tweets were requested but the API can only return 0
and somewhere I have seen that this is the problem with since and until that these since and until search for fewer days. As it is 2018 not 2016. But what can I do in this regard? Please help! This is the project in R.
The Twitter Search Documentation contains two useful pieces of information.
It says that until:
Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
It also shows that there is no since parameter. It is since_id:
Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
So there are the two errors in your code. You cannot search for anything older than a week. If you want to use a "since" parameter, you have to give it an ID, not a date.
For this query using R’s twitterR::searchTwitter:
search_t <- searchTwitter("#netanyahu", n = 1000, since = '2015-09-13')
df <- do.call("rbind", lapply(search_t, as.data.frame))
View(df[, c('text', 'created', 'favoriteCount', 'retweetCount', 'favorited', 'retweeted', 'isRetweet')])
… I get the following results:
What does favorited column mean? It obviously doesn’t mean that since tweet has been favorited, since it’s been done so 6 times. I also went on Twitter and favorited that particular tweet and then re-ran the query. It still shows FALSE.
From the twitter api overview: Object tweets link here
Favorited: Nullable. Perspectival. Indicates whether this Tweet has been favorited by the authenticating user.