I am working on a project where I have used tweets with Emojis and Emoticons. My main goal is to get the combined sentiment score of the tweets( text + Emoticons ) and as we know these emoticons are probably the most meaningful part of the data and that's they can not be neglected. I have converted the encoding structure of the emojis and emoticons via iconv but I am only getting the sentiment score for the text, not the emojis. I am using Vader sentiment in this process but if there is another Sentiment library/Lexicon that can be used which will give me the senti score for all the emojis too it will be a lot helpful and highly appreciated.
Tweets:
dput(df_emoji$Description)
c("DoorDash or Uber method asap<f0><9f><98><ad> cause I be starving<f0><9f><98><ad><f0><9f><98><ad>",
"such a real ahh niqq cuz I be having myself weak asl<f0><9f><98><82>",
"shii made me laugh so fuccin hard bro<f0><9f><98><82><f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>",
"Hart and Will Ferrell made a Gem in Get hard fr<f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>",
"#NigerianAmazon Chill<f0><9f><a4><a3><f0><9f><98><ad>", "so bomedy <f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>",
"is that ass Gotdam<f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>",
"wild<f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>",
"them late night DoorDash<e2><80><99>s be goin crazy<f0><9f><a4><a3>",
"of the week<f0><9f><98><82><f0><9f><98><82><f0><9f><98><82><f0><9f><98><82>"
)
Code:
emoji_senti <- data.frame(text = iconv(data_sample$text, "latin1", "ASCII", "byte"),
stringsAsFactors = FALSE)
column1 <- separate(emoji_senti, text, into = c("Bytes", "Description"), sep = "\\ ")
column2 <- separate(emoji_senti, text, into = c("Bytes", "Description"), sep = "^[^\\s]*\\s")
df_emoji <- data.frame(Bytes = column1$Bytes, Description = column2$Description)
allvals_emoji <- NULL
for (i in 1:length(df_emoji$Description)){
outs <- vader_df(df_emoji$Description[i])
allvals_emoji <- rbind(allvals_emoji,outs)
}
allvals_emoji
See this that the first tweet has only 9 English words which have their scores but it misses the score for converted Unicode for emojis.
# word_scores compound pos neu neg but_count
# 1 {0, 0, 0, 0, 0, 0, 0, 0, 0} 0.000 0.000 1.000 0.000 0
# 2 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.9, 0, 0} -0.440 0.000 0.805 0.195 0
# 3 {0, 0, 0, 2.6, 0, 0, -0.67835, 0, 0} 0.444 0.293 0.570 0.137 0
# 4 {0, 0, 0, 0, 0, 0, 0, 0, 0, -0.4, 0} -0.103 0.000 0.877 0.123 0
# 5 {0, 0} 0.000 0.000 1.000 0.000 0
# 6 {0, 0, 0, 0} 0.000 0.000 1.000 0.000 0
# 7 {0, 0, -2.5, 0, 0} -0.542 0.000 0.533 0.467 0
# 8 {0, 0} 0.000 0.000 1.000 0.000 0
# 9 {0, 0, 0, 0, 0, 0, 0} 0.000 0.000 1.000 0.000 0
# 10 {0, 0, 0, 0} 0.000 0.000 1.000 0.000 0
Check this discussion: VaderSentiment: unable to update emoji sentiment score
"Vader transforms emojis to their word representation prior to extracting sentiment"
Basically from what I tested out emoji's values are hidden but part of the score and can influence it. If you need the score for a specific emoji you can check library(lexicon) and run data.frame(hash_emojis_identifier) (dataframe that contains identifiers for emojis and matches them to a lexicon format) and data.frame(hash_sentiment_emojis) to get each emoji sentiment value. It is not possible though to determine from that what was the impact of a series of emojis over the total message score without knowing how vader calculates their cumulative impact on the score itself using libraries such as vader, lexicon.
You can evaluate the impact of the emoji though by doing a simple difference between the total score value of the message with emojis and the score without it:
allvals <- NULL
for (i in 1:length(data_sample)){
outs <- vader_df(data_sample[i])
allvals <- rbind(allvals,outs)
}
allvalswithout <- NULL
for (i in 1:length(data_samplewithout)){
outs <- vader_df(data_samplewithout[i])
allvalswithout <- rbind(allvalswithout,outs)
}
emojiscore <- allvals$compound-allvalswithout$compound
Then:
allvals <- cbind(allvals,emojiscore)
Now for large datasets it would be ideal to automate the process of removing emojis out of texts. Here i just removed it manually to propose this kind of approach to the problem.
Related
I estimated VECM and would like to make 4 separate tests of weak exogeneity for each variable.
library(urca)
library(vars)
data(Canada)
e prod rw U
1980 Q1 929.6105 405.3665 386.1361 7.53
1980 Q2 929.8040 404.6398 388.1358 7.70
1980 Q3 930.3184 403.8149 390.5401 7.47
1980 Q4 931.4277 404.2158 393.9638 7.27
1981 Q1 932.6620 405.0467 396.7647 7.37
1981 Q2 933.5509 404.4167 400.0217 7.13
...
jt = ca.jo(Canada, type = "trace", ecdet = "const", K = 2, spec = "transitory")
t = cajorls(jt, r = 1)
t$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
e.dl1 0.812608320 -0.063226620 -0.36178542 -0.60482042
prod.dl1 0.208945048 0.275454380 -0.08418285 -0.09031236
rw.dl1 -0.045040603 0.094392696 -0.05462048 -0.01443323
U.dl1 0.218358784 -0.538972799 0.24391761 -0.16978208
t$beta
ect1
e.l1 1.00000000
prod.l1 0.08536852
rw.l1 -0.14261822
U.l1 4.28476955
constant -967.81673980
I guess that my equations are:
and I would like to test whether alpha_e, alpha_prod, alpha_rw, alpha_U (they marked red in the picture above) are zeros and impose necessary restrictions on my model. So, my question is: how can I do it?
I guess that my estimated alphas are:
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
I guess that I should use alrtest function from urca library:
alrtest(z = jt, A = A1, r = 1)
and probably my A matrix for alpha_e should be like this:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
The results of the test:
jt1 = alrtest(z = jt, A = A1, r = 1)
summary(jt1)
The value of the likelihood ratio test statistic:
0.48 distributed as chi square with 1 df.
The p-value of the test statistic is: 0.49
Eigenvectors, normalised to first column
of the restricted VAR:
[,1]
RK.e.l1 1.0000
RK.prod.l1 0.1352
RK.rw.l1 -0.1937
RK.U.l1 3.9760
RK.constant -960.2126
Weights W of the restricted VAR:
[,1]
[1,] 0.0000
[2,] 0.0084
[3,] -0.1342
[4,] -0.0315
Which I guess means that I can't reject my hypothesis of weak exogeneity of alpha_e. And my new alphas here are: 0.0000, 0.0084, -0.1342, -0.0315.
Now the question is how can I impose this restriction on my VECM model?
If I do:
t1 = cajorls(jt1, r = 1)
t1$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005754775 0.007717881 -0.13282970 -0.02848404
e.dl1 0.830418381 -0.049601229 -0.30644063 -0.60236338
prod.dl1 0.207857861 0.272499006 -0.06742147 -0.08561076
rw.dl1 -0.037677197 0.102991919 -0.05986655 -0.02019326
U.dl1 0.231855899 -0.530897862 0.30720652 -0.16277775
t1$beta
ect1
e.l1 1.0000000
prod.l1 0.1351633
rw.l1 -0.1936612
U.l1 3.9759842
constant -960.2126150
the new model don't have 0.0000, 0.0084, -0.1342, -0.0315 for alphas. It has -0.005754775 0.007717881 -0.13282970 -0.02848404 instead.
How can I get reestimated model with alpha_e = 0? I want reestimated model with alpha_e = 0 because I would like to use it for predictions (vecm -> vec2var -> predict, but vec2var doesn't accept jt1 directly). And in general - are calculations which I made correct or not?
Just for illustration, in EViews imposing restriction on alpha looks like this (not for this example):
If you have 1 cointegrating relationship (r=1), as it is in t = cajorls(jt, r = 1),
your loading matrix can not have 4 rows and 3 columns:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
Matrix A can only have 4 rows and 1 column, if you have 4 variables and 1 cointegrating relationship.
I am currently using the Monte Carlo method here.
Although the code (with some minor adaptations) worked with my 2x2 or 3x3 matrix, I keep getting the following error code for my 4x4 matrix:
Error in matrix(c(0.0461705, 0, 0, 0, 0, 0.0028639, 0, 0, 0, 0,
0.0740766, : 'dimnames' must be a list
What am I doing wrong and how do I address this error message?
################################################
# This code can be edited in this window and #
# submitted to Rweb, or for faster performance #
# and a nicer looking histogram, submit #
# directly to R. #
################################################
require(MASS)
a=1.1727132
b=0.2171818
c=1.3666784
d=0.1850852
rep=20000
conf=95
pest=c(a,b,c,d)
acov <- matrix(c(
0.0461705, 0, 0, 0,
0, 0.0028639, 0, 0,
0, 0, 0.0740766, 0,
0, 0, 0, 0.0013694
),4,4,4,4)
mcmc <- mvrnorm(rep,pest,acov,empirical=FALSE)
abcd <- mcmc[,1]*mcmc[,2]*mcmc[,3]*mcmc[,4]
low=(1-conf/100)/2
upp=((1-conf/100)/2)+(conf/100)
LL=quantile(abcd,low)
UL=quantile(abcd,upp)
LL4=format(LL,digits=4)
UL4=format(UL,digits=4)
################################################
# The number of columns in the histogram can #
# be changed by replacing 'FD' below with #
# an integer value. #
################################################
hist(abcd,breaks='FD',col='skyblue',xlab=paste(conf,'% Confidence Interval ','LL',LL4,' UL',UL4),
main='Distribution of Indirect Effect')
Thank you!
As #Remko told, please specify the arguments correctly. The R matrix can be created as:
acov <- matrix(c(
0.0461705, 0, 0, 0,
0, 0.0028639, 0, 0,
0, 0, 0.0740766, 0,
0, 0, 0, 0.0013694
),nrow = 4, ncol = 4, byrow = T,dimnames = list(c("r","o","w","s"),c("c","o","l","s")))
You cans set byrow = F if you want the data to be arranged column wise. The length of rownames and colnames vector must match the number of rows and number of columns respectively.
I have developed a linear programming model in R and I would like to know the command to set a variable to a value, here is my code and the results:
install.packages("lpSolveAPI")
library(lpSolveAPI)
#want to solve for 6 variables, these correspond to the number of bins
lprec <- make.lp(0, 6)
lp.control(lprec, sense="max")
#MODEL 1
set.objfn(lprec, c(13.8, 70.52,122.31,174.73,223.49,260.65))
add.constraint(lprec, c(13.8, 70.52, 122.31, 174.73, 223.49, 260.65), "=", 204600)
add.constraint(lprec, c(1,1,1,1,1,1), "=", 5000)
Here are the results:
> solve(lprec)
[1] 0
> get.objective(lprec)
[1] 204600
> get.variables(lprec)
[1] 2609.309 2390.691 0.000 0.000 0.000 0.000
I would like to set the first result (2609) to 3200,and the last result to 48, and then optimize on the other variables, any help would be much appreciated.
Ideally your expectation is for constrained optimization for which you should add more constraints as per your requirement. I am not familiar with lpSolveAPI and so not able to do correct coding but you need something like:
add.constraint(lprec, c(1, 0, 0, 0, 0, 0), "=", 3200)
add.constraint(lprec, c(0, 0, 0, 0, 0, 1), "=", 48)
Along with your existing constraints.
I applied two unsupervised algorithms to the same data, and would like to make a confusion matrix out of results, how should I achieve it in R?
An example with R codes like following:
xx.1 <- c(41, 0, 4, 0, 0, 0, 0, 0, 0, 7, 0, 11, 8, 0, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 4)
xx.2 <- matrix(xx.1, nrow = 5)
rownames(xx.2) <- paste("Algo1", 1:5, sep = "_")
colnames(xx.2) <- paste("Algo2", 1:5, sep = "_")
xx.2
xx.2 is the predicting results of two algorithms, the numbers show how many observation are classified as Algo1_X and Algo2_X:
Algo2_1 Algo2_2 Algo2_3 Algo2_4 Algo2_5
Algo1_1 41 0 0 0 0
Algo1_2 0 0 11 0 1
Algo1_3 4 0 8 0 1
Algo1_4 0 0 0 3 0
Algo1_5 0 7 0 0 4
The problem is, how should I rearrange the matrix to get a confusion matrix, by using the results of Algo1 as reference? There are two questions:
Determine the corresponding between two algorithms, i.e., the method I think is that the most similar classification should be paired;
The matrix is rearranged so that dialog line have the largest intersect value.
Here, Algo2_1 and Algo1_1 have the largest intersect value and their are a pair; then Algo1_2 and Algo2_3 should be a pair since they have the second largest value in the left column/rows, so Algo2_3 should be moved to the second column.
How could I do it easily in R? or there are packages available for this purpose?
Thanks!
I'm using PMML to transfer my models (that I develop in R) between different platforms. One issue I often face is that given input data I need to do a lot of pre-processing. Most times this is rather straightforward in PMML but I cannot figure out how to do it when I need a Koyck lag transformation. Now the first few lines of the input data set looks like this:
Y Z S Xa Xb Xc
1 11.37738 1 0.8414710 0.0 0.0 581102.6
2 21.29848 2 0.9092974 700254.1 0.0 35695.1
3 14.30348 3 0.1411200 0.0 384556.3 0.0
4 18.07305 4 0.0000000 413643.2 0.0 0.0
5 29.02756 5 0.0000000 604453.3 0.0 350888.2
6 20.73336 6 0.0000000 0.0 0.0 168961.2
and is generated by:
df<-structure(list(Y = c(11.3773828021943, 21.2984762226498, 14.3034834956969,
18.0730530464578, 29.0275566937015, 20.7333617643781, 30.9707039948106,
30.2428379202751, 22.1677291047936, 19.7450403054104, 18.4642890388219,
28.4145184014117, 28.5224574661743, 40.5073319897728, 40.8853498146471,
20.7173713186907, 35.8080372291603, 37.6213598048788, 38.3123458040493,
25.143519382411),
Z = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
S = c(0.841470984807897, 0.909297426825682, 0.141120008059867,
0, 0, 0, 0.656986598718789, 0.989358246623382,
0.412118485241757, 0, 0, 0, 0.420167036826641, 0.99060735569487,
0.650287840157117, 0, 0, 0, 0.149877209662952, 0.912945250727628),
Xa = c(0, 700254.133201206, 0, 413643.212229974, 604453.339408554,
0, 623209.174415675, 1042574.05046884, 0, 0, 397257.053501325,
441408.09060313, 0, 0, 597980.888163467, 0, 121672.230528635,
199542.274825303, 447951.083632432, 84751.5842557032),
Xb = c(0, 0, 384556.309344495, 0, 0, 0, 0, 0, 0, 0, 0,
179488.805498654, 31956.7161910341, 785611.676606721,
65452.7295721654, 0, 231214.563631705, 0, 0,
176249.685091327),
Xc = c(581102.615208462, 35695.0974169599, 0, 0, 350888.245086195,
168961.239749307, 458076.400377529, 218707.589596171,
0, 506676.223324812, 0, 25613.8139087091, 429615.016105429,
410675.885159107, 0, 229898.803944166, 2727.64268459058,
711726.797796325, 354985.810664457, 0)),
.Names = c("Y", "Z", "S", "Xa", "Xb", "Xc"),
row.names = c(NA, -20L),
class = "data.frame")
I want to create a new variable M using koyck lags of the variables Xa, Xb and Xc like this:
lagIt<-function (x, d, ia = mean(x))
{
y <- x
y[1] <- y[1] + ia*d
for (i in 2:length(x)) y[i] <- y[i] + y[i-1] * d
y
}
df2<-transform(df, M=(lagIt(tanh(Xa/300000), 0.5) +
lagIt(tanh(Xb/100000), 0.7) + lagIt(tanh(Xc/400000), 0.3)))
> head(df2)
# Y Z S Xa Xb Xc M
# 1 11.37738 1 0.8414710 0.0 0.0 581102.6 1.460318
# 2 21.29848 2 0.9092974 700254.1 0.0 35695.1 1.637388
# 3 14.30348 3 0.1411200 0.0 384556.3 0.0 1.767136
# 4 18.07305 4 0.0000000 413643.2 0.0 0.0 1.960151
# 5 29.02756 5 0.0000000 604453.3 0.0 350888.2 2.796750
# 6 20.73336 6 0.0000000 0.0 0.0 168961.2 1.761774
and finally build a model:
fit<-lm(Y~Z+S+M, data=df2)
Using the pmml library in R I can get the PMML XML output like this.
library(pmml)
pmml(fit)
However, I want to include a section of where the creation of the variable M takes place. How can I write that section conforming to PMML? Again the input data is the df data.frame and I want all pre-processing of data to be defined in PMML.
PMML operates on single-valued data records, but you're trying to use vector-valued data records. Most certainly, you cannot do (for-)loops in PMML.
Depending on your deployment platform, you might be able to use extension functions. Basically, this involves 1) programming Koyck lag transformation, 2) turning it into a standalone extension library and 3) making the PMML engine aware of this extension library. This extension function can be called by name just like all other built-in and user-defined functions.
The above should be doable using the JPMML library.