code issue with developing a sentiment analysis scoring model - r

I am trying to do some sentiment analysis on twitter data. I have a dictionary (afinn_list) which is something like below
good 5
bad -5
awesome 6
I have been able to generate a character variable which contains the location of each matched word. Now I want to generate a score variable which will contain the corresponding score for these matches. I am having hard time coming up with a for loop logic.
class(afinn_list)
[1] "data.frame"
vPosMatches <- match(words, afinn_list$word)
vPosMatches
[1] NA NA NA NA 1104 NA NA NA NA NA NA NA NA NA NA NA NA 1836 NA
I am sorry if the question is too naive. I am just trying to learn sentiment analysis using R.

Sentiment analysis is a complex task. Assuming you have clean up your data from twitter and storing it as 1 word in each cell, I guess what you are lacking now is score your cleaned up data in words with your scoring "dictionary" afinn_list.
Assuming that your words is a afinn_list looks like this
dictionary <-data.frame(grade=c('bad','not good', 'ok', 'good','very good'), score=1:5))
# grade score
1 bad 1
2 not good 2
3 ok 3
4 good 4
5 very good 5
and your mock_data ( clean up data from twitter) is
mock_data<-data.frame(data=rep(x=c('good','bad','rubbish','hello','very good'),10))
# data
1 good
2 bad
3 rubbish
4 hello
5 very good
6 good
You will do a merge between 2 data frame. In SQL world, it will be an left outer join . In R, it is impletemed with the function merge and providing the column you wish to join by and all.x=True
Hence your code will look like this
merge(mock_data, dictionary, by='data', all.x=TRUE)
I hope this answer you question.
Cheers

Related

Sum-product in R for specific conditions

I'm looking to do sumproduct in r as we do in excel.
It's a little challenging as i have to apply some logical conditions meanwhile.
Excel code looks like this
SUMPRODUCT(--(ID=A2),--(INDIRECT(A1)<>"-"),INDIRECT(B1),C1)
here ID, A1 ,B1 are name ranges on other sheet of same workbook.
ID $ Quantity
1 23 34
2 4 55
3 NA 6
4 6 45
5 7 NA
6 8 NA
I want logical operators because some values are NA and i don't want to take them in consideration. I want this process to be automated without much manual work.
I've done this upto some extent using deplyr but it's not giving satisfactory results.

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Replaing NAs with correlated values in rows

Hey All I have data frame with 5 Samples A,B,C,D,E. and what I want to do is firstly search for a mirna which is overall highly correlated with the miRNA having the missing value and taking a value derived from that mirna .. for example
miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10
==> replace the missing value derived from the second miRNA by 4.
This is what I want to do for my data frame in R
Any help would be really appreciated :)
A B C D
hsa-miR-199a-3p, hsa-miR-199b-3p NA 13.13892 5.533703 25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p 15.70536 52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p NA 21.41597 5.964772 NA
hsa-miR-3689b-3p, hsa-miR-3689c 9.58696 44.56490 10.102051 13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865 28.06991 NA NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA 10.77471 8.039662 NA
E
hsa-miR-199a-3p, hsa-miR-199b-3p NA
hsa-miR-365a-3p, hsa-miR-365b-3p 31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA
Have you had a look at this answer (esp Akrun's short cut from zoo)? I appreciate it's not quite what you want, but might give some leads. It is for means of neighbours in a row, so would suggest 1 2 3 NA 5 would be 4 (average 3 and 5).
Replacing NA's in R numeric vectors with values calculated from neighbours
Trying to find a correlation between pairs with just 4 data points, as one is missing, is a challenge.

merge rows by subject number

What I am trying to do is merge my dataframe by rows. For instance let's say my data.frame is called data and it looks like this: I have 5 columns- subject contains 5s and 6s, Phase contains Post-Lure and Pre-Lure, Type contains Visual and Auditory and Memory contains a list of scores. Ex:
Subject Phase Type Memory
1 5 Post-Lure Visual 0.80000000
2 5 Post-Lure Auditory 0.70666667
3 5 Pre-Lure Visual 0.40000000
4 5 Pre-Lure Auditory 0.61333333
5 6 Post-Lure Visual 0.80000000
6 6 Post-Lure Auditory 0.54666667
As you can see from the code above, the subject is repeated (subject 5 is the same person but the phase and/or type are now different). Thus, I am looking for a code that will make all of the data for each subject on the same row. Hence, the memory scores, and the different types and phases each subject were exposed to will just now become additional columns on the same row. I feel aggregate may do the trick but is it possible to use that code without applying a function to each of the numbers. Any help would be greatly appreciated. Thank you.
As mentioned in the comment, you need to add an "indicator" variable of some sort (for example, how many "times" there are for each subject).
That can be done with ave and seq_along:
mydf$time <- with(mydf, ave(Subject, Subject, FUN=seq_along))
Next, you can use reshape() to go from "long" to "wide".
reshape(mydf, direction = "wide",
idvar="Subject", timevar="time")
# Subject Phase.1 Type.1 Memory.1 Phase.2 Type.2 Memory.2
# 1 5 Post-Lure Visual 0.8 Post-Lure Auditory 0.7066667
# 5 6 Post-Lure Visual 0.8 Post-Lure Auditory 0.5466667
# Phase.3 Type.3 Memory.3 Phase.4 Type.4 Memory.4
# 1 Pre-Lure Visual 0.4 Pre-Lure Auditory 0.6133333
# 5 <NA> <NA> NA <NA> <NA> NA
If you wanted to use the "reshape2" or "tidyr" packages, you would first have to get the data into a "long" form using melt or gather, but note that in the process, your variable types would be converted since a single column would be containing several data types.
Do you just want to reshape your data? The question isn't clear. Let's call your dataframe df. Then
library(reshape2)
dcast(df, Subject ~ Phase + Type)
will produce
Subject Post-Lure_Auditory Post-Lure_Visual Pre-Lure_Auditory Pre-Lure_Visual
1 5 0.7066667 0.8 0.6133333 0.4
2 6 0.5466667 0.8 NA NA

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

Resources