Trying to dabble in doing some basic sentiment analysis using twitteR library and searchTwitter function. Say I'm searching for tweets specific to "Samsung". I can retrieve the tweets with the below command:
samsung_t = searchTwitter("#samsung", n=1500, lang="en",cainfo="cacert.pem")
This I know will return all the tweets containing the hash-tag #samsung. However, if I wanted to search for tweets containing "samsung" in them: I give the same command but without the "#"
samsung_t = searchTwitter("samsung", n=1500, lang="en",cainfo="cacert.pem")
This however will return all the tweets containing the term "samsung" in them including the handle. For example: it will return a tweet: "#I_Love_Samsung: I like R programming", which is completely irrelevant to my criteria. If I wanted to do a sentiment analysis on say, "Samsung phones", I'm afraid that data like this can skew the results.
Is there a way I can force searchTwitter to only look in the "Tweet" but not the "Handle"?
Thanks a lot in advance.
Looking at the search API documentation and the listing of available search operators, I don't think the twitter search API offers this specific search capability (which seems kind of strange, frankly). I think your best bet is probably to run your search with the tools available to you and filter out the tweets that don't match your criteria from the results you get back.
Related
I have a list of local election candidates and I would like to find out
(i) if these individuals have a twitter account
(ii) if so what are their screen names/ user names.
search_users seemed to be the best option but it does not do a good job. Here is an example:
y1 <- search_users(q="suleyman kilinc", n=5, parse=TRUE)
This gives me a list of 5 users and non of them is the one that I am looking for. This is often the case. But when I do the same search on Google with the key words "suleyman+kilinc+twitter", the first option that Google offers is what I exactly need. This is true for 95% of the random names that I manually searched. Is there a good way to automatize the name to user name search through R or a better option than search_users function.
Any help is appreciated.
It is a very interesting question. the q parameter accepts a string as indicated above. When you pass a word with space as a value of q then you are instructing the function to search for "suleyman" & "kilinc" hence "suleyman kilinc" is the same as "suleyman AND kilinc". The REST API for twitter in this case will return any user with both "suleyman" and "kilinc" irregardless of the order.
For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.
Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.
I use Rstudio, and I've noticed that the internal search function in that program is much more effective:
As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".
However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:
So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.
Assuming I have a data frame that looks like this:
Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))
What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?
From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl
library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))
where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?
I'm new to R and was wondering if is possible using R,
to get a list of users who tweet using the word cats for example and then
go through their timeline and see did they tweet using the word dogs for example.
I have managed using the twitteR package to get a list of user names and their tweets and put them into a dataframe. I just don't know how to go about doing the rest or if it is even possible.
Any help at all would be greatly appreciated!
John I am not sure if I understand correctly what you are trying to achieve. But I am assuming that the dataframe also contains a time stamp of the tweet. If that is the case, then you can group by the user and arrange in ascending order as per the timestamp. Thereafter you could use grepl() for 'dogs' or any other word you are searching for.
With the package twitteR, it is possible to search tweets as follows:
tweets <- searchTwitter("term", n=100,lang="en",resultType="recent",
since="2016-06-10", until="2016-06-26")
When the resultType="recent" we can get the big number of tweets, but they are ranked with created time, so we begin with a lot of 2016-06-25 23:59:59.
I wanted to search for popular tweets first, so I use resultType="popular" :
tweets <- searchTwitter("term", n=100,lang="en",resultType="popular",
since="2016-06-10", until="2016-06-26")
But then I got this warning :
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
100 tweets were requested but the API can only return 93
I understand that Twitter limit the resquests, but since they can return 100 tweets in the order of created time, I hoped that I could get the same number of tweets in the order of popularity. Apparently it is not true.
Or maybe I didn't use the function in right way.
So I would like find a way to search tweets efficiently:
How to get more popular tweets, in a day ?
How to specify the an hour for the search, for example 10am ? so that they are not tweeted at 2016-06-25 23:59:59, which can have a bias.
Maybe we have to pay, in order to get more tweets and more information ? For example, I noticed that my tweets are never geocoded.
Usually I save them in a data.frame, after that play with # of RT's, etc... I don't think you can do it directly. Hope it helps.
Don't believe Twitter will return the most popular Tweets in order. Either most recent or popular (however Twitter determines it) tweets are returned. Since Twitter only returned 93 Tweets, I'd suggest you try broadening your search terms and then looking at number of favorites, retweets, replies, etc. for each tweet.
I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:
query("myoglobins","K=myoglobin AND SP=Turdus merula")
Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this.
The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.
I'd be happy about any links, explanations and help. Thank you! :)
There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.
This would pull out the annotation for the first gene:
choosebank("emblTP")
query("ACexample", "sp=Turdus merula")
getName(ACexample$req[[1]])
annotations <- getAnnot(ACexample$req[[1]])
cat(annotations, sep = "\n")
I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.