sentiment analysis with different number of documents - r

I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.
The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.
Is there anyway I can take the different number of articles each day into consideration?
library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)

I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.
One approach would be to use number of words as in the following example (I like the first approach moreso here):
poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])
output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2
output[, -c(5:6)]
## sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1 f.cap.FALSE 158 1641 0.083 0.143583793 0.082504197
## 2 f.cap.TRUE 24 206 0.044 0.060969157 0.005564434
## 3 f.mont.TRUE 4 29 0.079 0.060996614 0.001397106
## 4 m.cap.FALSE 73 651 0.031 0.049163984 0.012191207
## 5 m.cap.TRUE 17 160 -0.176 -0.231357933 -0.017135804
## 6 m.escal.FALSE 9 170 -0.164 -0.218126656 -0.016977931
## 7 m.escal.TRUE 27 590 -0.067 -0.106080866 -0.024092720
## 8 m.mont.FALSE 70 868 -0.047 -0.078139272 -0.025099276
## 9 m.mont.TRUE 114 1175 -0.002 -0.003389105 -0.001433481
## 10 m.none.FALSE 7 71 0.066 0.072409049 0.002862997
## 11 none.none.FALSE 5 16 -0.300 -0.147087026 -0.002925046

Related

Issue in creating contingency table in R

I am using ISLR package for my statistics practice. I am using OJ dataset. I am trying to create a contingency table for Purchase column and specialPrice columns for weach of the population.
I am trying to find the likelihood of CH being sold if there is a special price.
Here is my code so far.
library(ISLR)
CH <- table(OJ[OJ$Purchase == 'CH', "SpecialCH"])
MM <- table(OJ[OJ$Purchase == 'MM', "SpecialMM"])
table (MM, CH)
The out put that I get is a bit weird.
CH
MM 121 532
101 1 0
316 0 1
I am trying to find the odds ration and eventually apply McNemar's test. But I am unable to generate the contingency table. I can do it by hand but need to do it in R.
You are trying to work with 3 variables, but a contingency table only uses 2. I recommend using xtabs since the formula method saves some typing and it does a better job of labeling the table:
xtabs(~SpecialMM+SpecialCH, OJ) # Only 4 weeks are both on special
# SpecialCH
# SpecialMM 0 1
# 0 743 154
# 1 169 4
xtabs(~Purchase+SpecialCH, OJ) # When CH is on special ca 75% CH
# SpecialCH
# Purchase 0 1
# CH 532 121
# MM 380 37
# xtabs(~Purchase+SpecialMM, OJ) # When MM is on special ca 58% MM
# SpecialMM
# Purchase 0 1
# CH 581 72
# MM 316 101
The first table asks the question. Are specials for one brand associated with the other brand. There are 1070 purchases of OJ represented. CH was on special 158 times and MM was on special 173 times. But only 4 times are both brands on special. This table suggests that MM and CH are not on special at the same time. You could use Chi Square or another test to see if that is a significant deviation from random assignment of specials.
The second and third tables look at purchase of OJ to see if one brand is more likely to be purchased relative to the other brand when it is on sale. Notice that most OJ purchases occur when neither is on sale, but it could be that sales boost the purchase of the brand on sale. Again the statistical tests would tell you if this could just be random chance or unlikely to be chance.

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Word frequency over time by user in R

I'm aiming to make a bump chart of word frequency over time. I have about 36000 individual entries of a user's comment and an associated date. I have a 25 user sample available here: http://pastebin.com/kKfby5kf
I'm trying to get the most frequent words (maybe top 10?) on a given date. I feel like my methodology is close, but not quite right:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
An explanation of my madness: the pastebin example goes into sampledf. For each unique date in the sample, I'm trying to get a word frequency. I'm then attempting to store that tabulated word frequency in a list (might not be the best approach, though). First, I subset by date, then strip whitespace, common English words, punctuation, and lowercase it all. I then do another pass of word removal for "game" since it's not too interesting but very common. To get the word frequency, I then pass it into a document term matrix and do a simple colSums(). Then I append the date for that table and try to store it in a list.
I'm not sure if my strategy is valid to begin with. Is there a simpler, better approach to this problem?
The commenters are correct in that there are better ways to set up a reproducible example. In addition, your answer could be more specific in what you are trying to accomplish as an output. (I could not get your code to execute without error.)
However: You asked for a simpler, better approach. Here is what I think is both. It uses the quanteda text package and exploits the groups feature when creating the document-feature matrix. Then it performs some rankings on the "dfm" to get what you need in terms of daily term rankings.
Note that this is based on my having loaded your linked data using read.delim("sampledf.tsv", stringsAsFactors = FALSE).
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
BTW interesting spellings of niple and testical.

R - Rank and Group [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This is going to be a long shot but i'll try anyway. I want to build a centile (100 groups) or decile (10 groups) based on the data.frame available.
In this example, I have a data frame with 891 records. In this data.frame, I have the following variables.
Unique_ID (numerical). i.e. unique member number
xbeta (numerical) Given credit score. (which allows ranking to be performed)
Good (numerical). Binary Flag (0 or 1). An indicator if member is delinquent
Bad (numerical). Binary Flag (0 or 1) inverse of good
I need your help to build an equivalent table below. By changing the number of groups, i'd be able to split it either 10 or by 100 using xbeta. With the top row being the total (identifiable via TYPE), i'd like to produce the following table (see table below for more details)
r_xbeta is just row number based on the # of groups.
TYPE to identify total or group rank
n = Total Count
count of Good | Bad flag within the rank
xbeta stats, min | max | mean | median
GB_Odds = GOOD / BAD for the rank
LN_GB_ODDs = Log(GB_Odds)
rest should be self explanatory
Your help is much appreciated.
Jim learning R
r_xbeta _TYPE_ n GOOD BAD xbeta_min xbeta_max xbeta_mean xbeta_MEDIAN GB_ODDS LN_GB_ODDS Cummu_Good Cummu_Bad Cummu_Good_pct Cummu_Bad_pct
. 0 891 342 549 -4.42 3.63 -0.7 -1.09 0.62295 -0.47329 342 549 100% 100%
0 1 89 4 85 -4.42 -2.7 -3.6 -3.57 0.04706 -3.05636 4 85 1.20% 15%
1 1 89 12 77 -2.69 -2.37 -2.55 -2.54 0.15584 -1.8589 16 162 4.70% 30%
2 1 87 12 75 -2.35 -1.95 -2.16 -2.2 0.16 -1.83258 28 237 8.20% 43%
3 1 93 14 79 -1.95 -1.54 -1.75 -1.79 0.17722 -1.73039 42 316 12% 58%
4 1 88 10 78 -1.53 -1.09 -1.33 -1.33 0.12821 -2.05412 52 394 15% 72%
5 1 89 27 62 -1.03 -0.25 -0.67 -0.69 0.43548 -0.8313 79 456 23% 83%
6 1 89 44 45 -0.24 0.33 0.05 0.03 0.97778 -0.02247 123 501 36% 91%
7 1 89 54 35 0.37 1.07 0.66 0.63 1.54286 0.43364 177 536 52% 98%
8 1 88 77 11 1.08 2.15 1.56 1.5 7 1.94591 254 547 74% 100%
9 1 90 88 2 2.18 3.63 2.77 2.76 44 3.78419 342 549 100% 100%
A reproducible example would be great, i.e. something we can copy-paste to our terminal that demonstrates your problem. For example, here is the dataframe I'll work with:
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
# Unique_ID xbeta Good Bad
# 1 1 -6.264538 1 0
# 2 2 1.836433 1 0
# 3 3 -8.356286 0 1
# 4 4 15.952808 1 1
# 5 5 3.295078 1 0
# 6 6 -8.204684 1 1
(The particular numbers don't matter to your question which is why I made up random ones).
The idea is to:
work out which quantile each row belongs to: see ?quantile. You can specify which quantiles you want (I've shown deciles)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# -30.0804860 -13.3880074 -8.7326454 -5.1121923 -3.0097613 -0.4493361 2.3680366 5.3732613 8.7867326 13.2425863 38.1027668
This gives the quantile cutoffs; if you use cut on these you can add a variable that says which quantile each row is in (?cut):
my_dataframe$quantile <- cut(my_dataframe$xbeta,
quantile(my_dataframe$xbeta, seq(0, 1, by=.1)))
Have a look at head(my_dataframe) to see what this did. The quantile column is a factor.
split up your dataframe by quantile, and calculate the stats for each. You can use the plyr, dplyr or data.table packages for this; I recommend one of the first two as you are new to R. If you need to do massive merges and calculations on huge tables efficiently (thousands of rows) use data.table, but the learning curve is much steeper. I will show you plyr purely because it's the one I find easiest. dplyr is very similar, but just has a different syntax.
# The idea: `ddply(my_dataframe, .(quantile), FUNCTION)` applies FUNCTION
# to each subset of `my_dataframe`, where we split it up into unique
# `quantile`s.
# For us, `FUNCTION` is `summarize`, which calculates summary stats
# on each subset of the dataframe.
# The arguments after `summarize` are the new summary columns we
# wish to calculate.
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta),
GB_ODDS=GOOD/BAD) # you can calculate the rest yourself,
# "the rest should be self explanatory".
> head(output, 3)
quantile n GOOD BAD xbeta_min xbeta_max GB_ODDS
1 (-30.1,-13.4] 89 41 39 -29.397737 -13.388007 1.0512821
2 (-13.4,-8.73] 89 49 45 -13.353714 -8.732645 1.0888889
3 (-8.73,-5.11] 89 46 48 -8.667335 -5.112192 0.9583333
Calculate the other columns. See (E.g.) ?cumsum for cumulative sums. e.g. output$cummu_good <- cumsum(output$GOOD).
Add the 'total' row. You should be able to do this. You can add an extra row to output using rbind.
Here is the final version my script with math coffee's guidance. I had to use .bincode instead of the suggested cut due to "'breaks' are not unique" error.
Thanks everyone.
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
my_dataframe$quantile = .bincode(my_dataframe$xbeta,quantile(my_dataframe$xbeta,seq(0,1,by=.1)))
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta), xbeta_median=median(xbeta), xbeta_mean=mean(xbeta),
GB_ODDS=GOOD/BAD, LN_GB_ODDS = log(GOOD/BAD))
output$cummu_good = cumsum(output$GOOD)
output$cummu_bad = cumsum(output$BAD)
output$cummu_n = cumsum(output$n)
output$sum_good = sum(output$GOOD)
output$sum_bad = sum(output$BAD)
output$cummu_good_pct = cumsum(output$GOOD/output$sum_good)
output$cummu_bad_pct = cumsum(output$BAD/output$sum_bad)
output[["sum_good"]]=NULL
output[["sum_bad"]]=NULL
output

Aggregate columns in data.table for descriptive statistics

I am looking at a student data set at the individual student level.
What I want to do is do some descriptive analysis at the faculty degree level.
That is some students are doing two degrees (double degrees eg Bachelor of IT and Bachelor of Science) so some students generate two degrees.
My data looks something like the below. The Faculty assignments (whether FAC1 or FAC2) are arbitrary.
studid FAC1 FAC2 SUCCESS SEX AVE_MARK
1 IT ARTS 0 Male 65
2 SCIENCE 1 Male 35
3 LAW 0 Male 98
4 IT SCIENCE 0 Female 55
5 COMMERCE IT 0 Female 20
6 COMMERCE IT 1 Male 80
This was generated with
students<-data.table(studid=c(1:6) ,FAC1 = c("IT","SCIENCE", "LAW","IT","COMMERCE","COMMERCE"), FAC2 = c("ARTS","","","SCIENCE","IT","IT"), SUCCESS = c(0,1,0,0,0,1), SEX=c("Male","Male","Male","Female","Female","Male"), AVE_MARK=c(65,35,98,55,20,80))
How would I go about producing something like this (made up figures) to create a Faculty column that incorporates both FAC1 and FAC2 columns? I have been trying to use the lapply function across FAC1 and FAC2 but keep hitting dead ends (ie students[, lapply(.SD, mean), by=agg.by, .SDcols=c('FAC1', 'FAC2')]
FACULTY MEAN_SUCCESS AVE_MARK
IT 0.65 65
SCIENCE 1 50
LAW 0.76 50
ARTS 0.55 50
COMMERCE 0.40 10
Any assistance would be greatly appreciated.
This seems like what you are looking for.
library(reshape2)
DT <- melt(students,measure.vars=c("FAC1","FAC2"),value.name="FACULTY")[nchar(FACULTY)>0]
DT[,list(mean_success=mean(SUCCESS),ave_mark=mean(AVE_MARK)),by=FACULTY]
# FACULTY mean_success ave_mark
# 1: IT 0.25 55
# 2: SCIENCE 0.50 45
# 3: LAW 0.00 98
# 4: COMMERCE 0.50 50
# 5: ARTS 0.00 65
So this uses the melt(...) function in package reshape2 to collapse the two faculty columns, replicating all the other columns. Unfortunately, this results in some columns with blank faculty, so we have to get rid of those using [nchar(FACULTY)>0]. Then it's simple to aggregate based on the (new) FACULTY column.

Resources