I'm new to R and was wondering if is possible using R,
to get a list of users who tweet using the word cats for example and then
go through their timeline and see did they tweet using the word dogs for example.
I have managed using the twitteR package to get a list of user names and their tweets and put them into a dataframe. I just don't know how to go about doing the rest or if it is even possible.
Any help at all would be greatly appreciated!
John I am not sure if I understand correctly what you are trying to achieve. But I am assuming that the dataframe also contains a time stamp of the tweet. If that is the case, then you can group by the user and arrange in ascending order as per the timestamp. Thereafter you could use grepl() for 'dogs' or any other word you are searching for.
Related
For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.
Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.
I use Rstudio, and I've noticed that the internal search function in that program is much more effective:
As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".
However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:
So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.
Assuming I have a data frame that looks like this:
Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))
What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?
From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl
library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))
where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?
So, sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this?
Fictional exercise:
Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents
Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola).
I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries).
Which steps should I follow in R in order to be able to reach to the final data frame?
This will probably get you most of the way. You'll want to play around with the different nodes and probably do some string manipulation (clean up) after you download what you need.
I'm very new with R. I'm using it mostly for marketing purposes so the TwitteR package is very useful.
What I'm trying to do is find the frequency of #mentions and #hashtags within my data after I've got all the data through the searchTwitter command.
I'm not sure what type of vector it turns it into right away and if I need to convert it into a data.frame or another type of vector or a corpus or what
How do I break the data down into total number of mentions/hashtags and what the frequency of each #mention or #hashtag is?
This will give me a good idea of who the key influencers and key hashtags in a specific market are and how valueable those influencers/hashtags are.
Please help.
Thanks
For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.
Trying to dabble in doing some basic sentiment analysis using twitteR library and searchTwitter function. Say I'm searching for tweets specific to "Samsung". I can retrieve the tweets with the below command:
samsung_t = searchTwitter("#samsung", n=1500, lang="en",cainfo="cacert.pem")
This I know will return all the tweets containing the hash-tag #samsung. However, if I wanted to search for tweets containing "samsung" in them: I give the same command but without the "#"
samsung_t = searchTwitter("samsung", n=1500, lang="en",cainfo="cacert.pem")
This however will return all the tweets containing the term "samsung" in them including the handle. For example: it will return a tweet: "#I_Love_Samsung: I like R programming", which is completely irrelevant to my criteria. If I wanted to do a sentiment analysis on say, "Samsung phones", I'm afraid that data like this can skew the results.
Is there a way I can force searchTwitter to only look in the "Tweet" but not the "Handle"?
Thanks a lot in advance.
Looking at the search API documentation and the listing of available search operators, I don't think the twitter search API offers this specific search capability (which seems kind of strange, frankly). I think your best bet is probably to run your search with the tools available to you and filter out the tweets that don't match your criteria from the results you get back.