extracting hashtags from tweets - r

I am trying to perform sentiment analysis and facing a small problem. I am using a dictionary which has hashtags and some other junk value(shown below). It also has associated weight of the hashtag. I want to extract only the hashtags and its corresponding weight into a new data frame. Is there any easy way to do it?
I have tried using regmatches, but some how its giving output in list format and is messing things up.
Input:
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
3 superb 7.199
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
Output:
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
3 #perfection 7.099
4 #terrific 6.922
5 #magnificent 6.672

To select only the entries that are hashtags you can use the simple regex ^# (meaning "anything that starts with a #"):
> input[grepl("^#",input[,1]),]
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
Otherwise from your original data, the regex #[[:alnum:]]+ (meaning: "an hashtag, followed by 1 or more alphanumerical characters") should help you grab the hashtags:
> tweets <- c("New R job: Statistical and Methodological Consultant at the Center for Open Science http://www.r-users.com/jobs/statistical-methodological-consultant-center-open-science/ … #rstats #jobs","New R job: Research Engineer/Applied Researcher at eBay http://www.r-users.com/jobs/research-engineerapplied-researcher-ebay/ … #rstats #jobs")
> match <- regmatches(tweets,gregexpr("#[[:alnum:]]+",tweets))
> match
[[1]]
[1] "#rstats" "#jobs"
[[2]]
[1] "#rstats" "#jobs"
> unlist(match)
[1] "#rstats" "#jobs" "#rstats" "#jobs"

This code should work and will give you desired output as data.frame
Input<- data.frame(V1 = c("#fabulous","#excellent","superb","#perfection","#terrific","#magnificent"), V2 = c("7.526", "7.247" , "7.199", "7.099", "6.922", "6.672"))
extractHashtags <- Input[which(substr(Input$V1,1,1) == "#"),]
View(extractHashtags)

Related

Store API outcomes into a new column in R

I have a data frame on name (df) as follows.
ID name
1 Xiaoao
2 Yukata
3 Kim
4 ...
Examples of API are like this.
European-SouthSlavs,0.2244 Muslim-Pakistanis-Bangladesh,0.0000 European-Italian-Italy,0.0061 ...
And I would like to add a new column using API that returns nationality scores up to 39 nationalities and I would like to list up to top 3 scores per name. My desired outcomes as follows.
ID name score nat
1 Xiaoao 0.7361 Chinese
1 Xiaoao 0.1721 Korean
1 Xiaoao 0.0721 Japanese
2 Yukata 0.8121 Japanese
2 Yukata 0.0811 Chinese
2 Yukata 0.0122 Korean
3 Kim 0.6532 Korean
3 Kim 0.2182 Chinese
3 Kim 0.0981 Japanese
4 ... ... ...
Below is my some of scratch to get it done. But I failed to get the desired outcomes for a number errors.
df_result <- purrr::map_dfr(df$name, function(name) {
result <- GET(paste0("http://www.name-prism.com/api_token/nat/csv/",
"API TOKEN","/", URLencode(df$name)))
if(http_error(result)){
NULL
}else{
nat<- content(result, "text")
nat<- do.call(rbind, strsplit(strsplit(nat, split = "(?<=\\d)\n", perl=T)[[1]],","))
#first three nationalities
top_nat <- nat[order(as.numeric(nat[,2]), decreasing = T)[1:3],]
c(df$name,as.vector(t(top_nat)))
}
})
First, the results of top scores were based on the entire data rather than per name.
Second, I faced an error saying "Error in dplyr::bind_rows():! Argument 1 must have names."
If you can add any comments on my codings, I will appreciate it!
Thank you in advance.
The output of each iteration of the map_dfr should be a dataframe for which to bind rows:
library(tidyverse)
library(httr)
df <- data.frame(name = c("Xiaoao", "Yukata", "Kim"))
map_dfr(df$name, function(name) {
data.frame(name = df$name, score = sample(1:10, 1))
})
Instead of concatenating name with top_nat at the end of your function, you should be making it a data.frame!

Read txt file selectively in R

I'm looking for an easy fix to read a txt file that looks like this when opened in excel:
IDmaster By_uspto App_date Grant_date Applicant Cited
2 1 19671106 19700707 Motorola Inc 1052446
2 1 19740909 19751028 Gen Motors Corp 1062884
2 1 19800331 19820817 Amp Incorporated 1082369
2 1 19910515 19940719 Dell Usa L.P. 389546
2 1 19940210 19950912 Schueman Transfer Inc. 1164239
2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
EDIT: Opening the txt file in notepad looks like this (with commas). The last two rows exhibit the problem.
IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336
The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't.
Is there a simple way to
a) "teach" R to keep string variables together, regardless of commas in between
b) read in the first 4 columns, and then add an extra column for everything behind the last comma?
Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative.
If your example is written in a "Test.csv" file, try with:
read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
quote="'",
stringsAsFactors=FALSE)
It returns:
# IDmaster By_uspto App_date Grant_date Applicant Cited
# 1 2 1 19671106 19700707 Motorola Inc 1052446
# 2 2 1 19740909 19751028 Gen Motors Corp 1062884
# 3 2 1 19800331 19820817 Amp Incorporated 1082369
# 4 2 1 19910515 19940719 Dell Usa L.P. 389546
# 5 2 1 19940210 19950912 Schueman Transfer Inc. 1164239
# 6 2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.
Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).
bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)
library(data.table)
sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
bc$Cited[is.na(bc$Cited)] <- 0
bc$V1 <- as.numeric(bc$V1)
bc$V2 <- as.numeric(bc$V2)
bc$V3 <- as.numeric(bc$V3)
bc$V4 <- as.numeric(bc$V4)
bc$V1[is.na(bc$V1)] <- 0
bc$V2[is.na(bc$V2)] <- 0
bc$V3[is.na(bc$V3)] <- 0
bc$V4[is.na(bc$V4)] <- 0
head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)
It's a silly patch but it does the trick in this particular context

How can I convert gene names (hgnc_symbol) to Ensemble IDs in R? "bioconductor-biomaRt"

I have a list of genes as rownames of my eset and I want to convert them to Ensembl gene ID.
I used getGene in bioMart package but it took the same name twice for some genes!
here is a small example for my code:
library (biomaRt)
rownames(eset)
[1] "EPC1" "MYO3A" "PARD3" "ATRNL1" "GDF2" "IL10RA" "GAD2" "CCDC6"
getGene(rownames(eset),type='hgnc_symbol',mart)[c(1,9)]
# [1] is the hgnc_symbol to recheck the matched data
# [9] is the ensemble_gene_id
hgnc_symbol ensembl_gene_id
1 ATRNL1 ENSG00000107518
2 CCDC6 ENSG00000108091
3 EPC1 ENSG00000120616
4 GAD2 ENSG00000136750
5 GDF2 ENSG00000263761
6 IL10RA ENSG00000110324
7 IL10RA LRG_151
8 MYO3A ENSG00000095777
9 PARD3 ENSG00000148498
As you can see there are two entries for "IL10RA" in the hgnc_symbol column; but I only had one "IL10RA" in the rownames(eset); this causes a problem at the end when I wanted to add the Ensembl_ID to the fData(eset)!
How can I solve this problem?
to have result like this:
hgnc_symbol ensembl_gene_id
1 ATRNL1 ENSG00000107518
2 CCDC6 ENSG00000108091
3 EPC1 ENSG00000120616
4 GAD2 ENSG00000136750
5 GDF2 ENSG00000263761
6 IL10RA ENSG00000110324
7 MYO3A ENSG00000095777
8 PARD3 ENSG00000148498
Thanks in advance,
I've found the solution by !duplicated in the eset.
Something like this:
g_All <- getGene(id = rownames(eset)),type='hgnc_symbol',mart)
g_All <- g_All[!duplicated(g_All[,1]),]

Why subset cut decimal part?

Hi this is a sample of data.frame / list with two columns containing X and Y. And my problem is when I call subset it will cut decimal part. Can you help me figure why?
(row.names | X | Y)
> var
...
9150 4246838.57 5785639.07
9152 4462019.15 5756344.11
9153 4671745.07 5791092.53
9154 4825699.93 5767058.37
9155 4935126.99 5839357.55
> typeof(var)
[1] "list"
> var = subset(var, Y>10980116 & X>3217133)
...
6569 15163607 11323070
6572 15102381 11079465
6573 16462260 11272569
6577 19028175 11095784
It's the same when I use:
> var = var[var$Y>10980116 & var$X>3217133,]
Thank you for your help.
This is not a subsetting issue, it's a formatting/presentation issue. You're in the first circle of Burns's R Inferno ("[i]f you are using R and you think you’re in hell, this is a map for you"):
another aspect of virtuous pagan beliefs—what is printed is all
that there is
If we just print this bit of the data frame exactly as entered, we "lose" digits.
> df <- read.table(text="
4246838.57 5785639.07
4462019.15 5756344.11
4671745.07 5791092.53
4825699.93 5767058.37
4935126.99 5839357.55",
header=FALSE)
> df
## V1 V2
## 1 4246839 5785639
## 2 4462019 5756344
## 3 4671745 5791093
## 4 4825700 5767058
## 5 4935127 5839358
Tell R you want to see more precision:
> print(df,digits=10)
## V1 V2
## 1 4246838.57 5785639.07
## 2 4462019.15 5756344.11
## 3 4671745.07 5791092.53
## 4 4825699.93 5767058.37
## 5 4935126.99 5839357.55
Or you can set options(digits=10) (the default is 7).

Converting scraped R data using readHTMLTable()

I'm trying to scrape this website http://www.hockeyfights.com/fightlog/ but having hard time putting the into a nice data frame. So far I have this:
> asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
> asdf.asdf <- readHTMLTable(asdf)
Then I get this giant list. How do I convert this into a 2 column dataframe that has only player names (who were in a fight) with n rows (number of fights)?
Thanks for your help in advance.
Is this the output you're after?
require(RCurl); require(XML)
asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
asdf.asdf <- readHTMLTable(asdf)
First, make a table of each player and the count of fights they've been in...
# get variable with player names
one <- as.character(na.omit(asdf.asdf[[1]]$V3))
# get counts of how many times each name appears
two <- data.frame(table(one))
# remove non-name data
three <- two[two$one != 'Away / Home Player',]
# check
head(three)
one Freq
1 Aaron Volpatti 1
3 Brandon Bollig 1
4 Brian Boyle 1
5 Brian McGrattan 1
6 Chris Neil 2
7 Colin Greening 1
Second, make a table of who is in each fight...
# make data frame of pairs by subsetting the vector of names
four <- data.frame(away = one[seq(2, length(one), 3)],
home = one[seq(3, length(one), 3)])
# check
head(four)
away home
1 Brian Boyle Zdeno Chara
2 Tom Sestito Chris Neil
3 Dale Weise Mark Borowiecki
4 Brandon Bollig Brian McGrattan
5 Scott Hartnell Eric Brewer
6 Colin Greening Aaron Volpatti

Resources