I feel that this is a somewhat complex issue which may not necessarily have a simple solution and may require machine learning or other advanced techniques to resolve.
Firstly, to explain the issue at hand, say we have a runner who participates in a number of outdoor races where the elements (ie wind) affect the athletes speed. If we know the baseline speed of the runner it’s easy to determine the percentage affect that the elements have had in each race, for example:
Name Baseline Race1 Race2 Race3
1 Runner 100 102 98 106
The contributing element_factors for Race1, Race2 and Race3 are:
[1] 1.02 0.98 1.06
In this example we can see that the runner in Race 1 has had a tail wind which has increased his baseline speed by 2%, etc.
However, in the real world we don’t necessarily know what the runners baseline speed is because all we have are their race results to go on and we don’t necessarily know how the elements are affecting the baseline.
Take for example the race results as listed in the following dataframe
df<-data.frame(Name = c("Runner 1","Runner 2","Runner 3","Runner 4","Runner 5"),
Baseline = c("unknown","unknown","unknown","unknown","unknown"),
Race1 = c(101,"NA",80.8,111.1,95.95),
Race2 = c(102,91.8,"NA",112.2,"NA"),
Race3 = c(95,85.5,76,"NA",90.25),
Race4 = c("NA",95.4,74.8,116.6,100.7))
Name Baseline Race1 Race2 Race3 Race4
1 Runner 1 unknown 101 102 95 NA
2 Runner 2 unknown NA 91.8 85.5 95.4
3 Runner 3 unknown 80.8 NA 76 74.8
4 Runner 4 unknown 111.1 112.2 NA 116.6
5 Runner 5 unknown 95.95 NA 90.25 100.7
What I want to be able to do is calculate (approximate) from this dataframe each runners baseline speed value and the factors relating to each race. The solutions in this case would be:
Baseline<-c(100,90,80,100,95)
[1] 100 90 80 100 95
element_factors<-c(1.01,1.02,0.95,1.06)
[1] 1.01 1.02 0.95 1.06
Setting the baseline speed as the runners average is overly simplistic as we can see that some runners only race in events that have a tail wind and therefore their baseline will fall lower than all their race results.
Related
I would like to be able to acquire overall estimates from GAMs of the type provided by emmeans, in order to plot these fitted values and their confidence intervals, and then do some subsequent analysis.
I am working on a similar dataset to the one described here: https://rpubs.com/bbolker/ratgrowthcurves . I see at the end of the document the author notes that how best to get out overall estimates from the model is not resolved, but that one option might be emmeans. So here I am posting an example to see if people think this approach is correct, or if they could suggest a better package and method.
I will use the 'Orange' dataset as an example, but to make it fit my question let's first add a 'variety' factor:
data(Orange)
temp<-as.vector(1:31)
temp[c(8:14,22:28)]<-"Tall"
temp[c(1:7,15:21,29:35)]<-"Short"
Orange$variety<-as.factor(temp)
head(Orange)
# Tree age circumference variety
#1 1 118 30 Short
#2 1 484 58 Short
#3 1 664 87 Short
#4 1 1004 115 Short
#5 1 1231 120 Short
#6 1 1372 142 Short
Create a GAM with 'variety' as a factor and 'tree' as a random effect:
library(mgcv)
ex.mod<-gam(circumference~s(age,k=7,by=variety)+s(Tree,bs="re")+variety,method="REML",data=Orange)
The ggeffects package seems to provide nice functionality for a plot via emmeans:
library(ggeffects)
library(emmeans)
plotme<-ggemmeans(ex.mod,terms=c("age","variety"))
plot(plotme)
Next I can extract the estimated marginal means themselves, for example over a range of ages:
emmeans(ex.mod,specs=c("age","variety"),at=list(age=seq(from=300,to=1500,by=500)))
# age variety emmean SE df lower.CL upper.CL
# 300 Short 44.8 4.37 27.1 35.9 53.8
# 800 Short 90.4 3.27 27.1 83.7 97.1
# 1300 Short 136.0 3.68 27.1 128.5 143.6
# 300 Tall 51.5 6.24 27.1 38.7 64.3
# 800 Tall 127.3 6.09 27.1 114.8 139.8
# 1300 Tall 189.8 5.56 27.1 178.4 201.2
#Results are averaged over the levels of: Tree
#Confidence level used: 0.95
My questions are:
1) If I am interested in using this GAM model to e.g. comparing the estimated mean difference between orange trees of variety 'Tall' and variety 'Short' at 800 days of age, is it appropriate to base this pairwise comparison on the emmeans?
2) If several studies have been done on Orange tree growth in different places, and I am interested in meta-analysing the mean difference in circumference between 'Tall' and 'Short' variety trees at certain ages, is it appropriate to use mean differences and variance in the emmeans for meta-analysis?
(emmeans provides the SE, I think this would need to be converted to standard deviation...)
3) ... or does someone have a better suggestion for either of the above?
I am new to coding as well as posting on forum but I will do my best to explain the problem and give enough background so that you're able to help me work through it. I have done a lot of searching for solutions to similar errors but all of the code that produces it and the format of the data behind it are very different.
I am working with biological data that consists of various growth categories but all that I am interested in is length (SCL in my code) and age (Age in my code). I have many lengths and age estimates for each individual through time and I am fitting a linear nlme model to the juvenile ages and a Von Bert curve to the mature ages. My juvenile model works just fine and I extracted h (slope of the line) and t (x intercept). I now need to use those parameters as well as T (known age at maturity) to fit the mature stage. The mature model will estimate K (this is my only unknown). I have included a subset of my data for one individual (ID50). This is information for only the mature years with the h and t from it's juvenile fit appended in the farthest right columns.
Subset of my data:
This didn't format very well but I'm not sure how else to display it
Grouped Data: SCL ~ Age | ID
ID SCL Age Sex Location MeanSCL Growth Year Status T h t
50 86.8 27.75 Female VA 86.8 0.2 1994 Mature 27.75 1.807394 -19.83368
50 86.9 28.75 Female VA 87.1 0.4 1995 Mature 27.75 1.807394 -19.83368
50 87.3 29.75 Female VA 87.5 0.5 1996 Mature 27.75 1.807394 -19.83368
50 87.8 30.75 Female VA 88 0.4 1997 Mature 27.75 1.807394 -19.83368
50 88.1 31.75 Female VA 88.1 0 1998 Mature 27.75 1.807394 -19.83368
50 88.1 32.75 Female VA 88.2 0 1999 Mature 27.75 1.807394 -19.83368
50 88.2 33.75 Female VA 88.3 0.2 2000 Mature 27.75 1.807394 -19.83368
50 88.4 34.75 Female VA 88.4 0.1 2001 Mature 27.75 1.807394 -19.83368
50 88.4 35.75 Female VA 88.4 0 2002 Mature 27.75 1.807394 -19.83368
50 88.5 36.75 Female VA 88.5 0 2003 Mature 27.75 1.807394 -19.83368
This is the growth function:
vbBiphasic = function(Age,h,T,t,K) {
y=(h/(exp(K)-1))*(1-exp(K*((T+log(1-(exp(K)-1)*(T-t))/K)-Age)))
}
This is the original growth model that SHOULD have fit:
ID50 refers to my subsetted dataset with only individual 50
VB_mat <- nlme(SCL~vbBiphasic(Age,h,T,t,K),
data = ID50,
fixed = list(K~1),
random = K~1,
start = list(fixed=c(K~.01))
)
However this model produces the error:
Error in pars[, nm] : incorrect number of dimensions
Which tells me that it's trying to estimate a different number of parameters than I have (I think). Originally I was fitting it to all mature individuals (bur for the sake of simplification I'm now trying to fit to one). Here are all of my variations to the model code, ALL of them produced the same error:
inputting averaged values of (Age, h, T,t,K) of the whole population
instead of the variables.
using a subset of 5 individuals and both (Age, h, T,t,K) as well as the averaged values for those individuals for each variable.
using 5 different individuals separately with both (Age, h, T,t,k) as well as their actual values for those variables (all ran
individually i.e. 10 different strings of code just in case some
worked and others didn't... but none did).
Telling the model to estimate all parameters, not just K
eliminating all parameters except K
Turning all values into vectors (that's what one forum with a similar error said to do)
Most of these were in an effort to change the number of parameters that R thought it needed to estimate, however none have worked for me.
I'm no expert on nlme and often have similar problems when fitting models, especially when you cannot use nlsList to get started. My guess is that you have 4 parameters in your function (h, T, t, k), but you are only estimating one of them as both a fixed effect and with a random effect. I believe this then constrains the other parameters to zero which would in effect eliminate them from the model (but you still have them in the model!). Usually you include all the parameters as fixed, and then try to decide how many of them you also want to have a random effect. So I would include all 4 in the fixed argument and the start argument. Since you have 4 parameters, each one has to be either fixed or random, or both - otherwise, how can they be in the model?
I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This is going to be a long shot but i'll try anyway. I want to build a centile (100 groups) or decile (10 groups) based on the data.frame available.
In this example, I have a data frame with 891 records. In this data.frame, I have the following variables.
Unique_ID (numerical). i.e. unique member number
xbeta (numerical) Given credit score. (which allows ranking to be performed)
Good (numerical). Binary Flag (0 or 1). An indicator if member is delinquent
Bad (numerical). Binary Flag (0 or 1) inverse of good
I need your help to build an equivalent table below. By changing the number of groups, i'd be able to split it either 10 or by 100 using xbeta. With the top row being the total (identifiable via TYPE), i'd like to produce the following table (see table below for more details)
r_xbeta is just row number based on the # of groups.
TYPE to identify total or group rank
n = Total Count
count of Good | Bad flag within the rank
xbeta stats, min | max | mean | median
GB_Odds = GOOD / BAD for the rank
LN_GB_ODDs = Log(GB_Odds)
rest should be self explanatory
Your help is much appreciated.
Jim learning R
r_xbeta _TYPE_ n GOOD BAD xbeta_min xbeta_max xbeta_mean xbeta_MEDIAN GB_ODDS LN_GB_ODDS Cummu_Good Cummu_Bad Cummu_Good_pct Cummu_Bad_pct
. 0 891 342 549 -4.42 3.63 -0.7 -1.09 0.62295 -0.47329 342 549 100% 100%
0 1 89 4 85 -4.42 -2.7 -3.6 -3.57 0.04706 -3.05636 4 85 1.20% 15%
1 1 89 12 77 -2.69 -2.37 -2.55 -2.54 0.15584 -1.8589 16 162 4.70% 30%
2 1 87 12 75 -2.35 -1.95 -2.16 -2.2 0.16 -1.83258 28 237 8.20% 43%
3 1 93 14 79 -1.95 -1.54 -1.75 -1.79 0.17722 -1.73039 42 316 12% 58%
4 1 88 10 78 -1.53 -1.09 -1.33 -1.33 0.12821 -2.05412 52 394 15% 72%
5 1 89 27 62 -1.03 -0.25 -0.67 -0.69 0.43548 -0.8313 79 456 23% 83%
6 1 89 44 45 -0.24 0.33 0.05 0.03 0.97778 -0.02247 123 501 36% 91%
7 1 89 54 35 0.37 1.07 0.66 0.63 1.54286 0.43364 177 536 52% 98%
8 1 88 77 11 1.08 2.15 1.56 1.5 7 1.94591 254 547 74% 100%
9 1 90 88 2 2.18 3.63 2.77 2.76 44 3.78419 342 549 100% 100%
A reproducible example would be great, i.e. something we can copy-paste to our terminal that demonstrates your problem. For example, here is the dataframe I'll work with:
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
# Unique_ID xbeta Good Bad
# 1 1 -6.264538 1 0
# 2 2 1.836433 1 0
# 3 3 -8.356286 0 1
# 4 4 15.952808 1 1
# 5 5 3.295078 1 0
# 6 6 -8.204684 1 1
(The particular numbers don't matter to your question which is why I made up random ones).
The idea is to:
work out which quantile each row belongs to: see ?quantile. You can specify which quantiles you want (I've shown deciles)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# -30.0804860 -13.3880074 -8.7326454 -5.1121923 -3.0097613 -0.4493361 2.3680366 5.3732613 8.7867326 13.2425863 38.1027668
This gives the quantile cutoffs; if you use cut on these you can add a variable that says which quantile each row is in (?cut):
my_dataframe$quantile <- cut(my_dataframe$xbeta,
quantile(my_dataframe$xbeta, seq(0, 1, by=.1)))
Have a look at head(my_dataframe) to see what this did. The quantile column is a factor.
split up your dataframe by quantile, and calculate the stats for each. You can use the plyr, dplyr or data.table packages for this; I recommend one of the first two as you are new to R. If you need to do massive merges and calculations on huge tables efficiently (thousands of rows) use data.table, but the learning curve is much steeper. I will show you plyr purely because it's the one I find easiest. dplyr is very similar, but just has a different syntax.
# The idea: `ddply(my_dataframe, .(quantile), FUNCTION)` applies FUNCTION
# to each subset of `my_dataframe`, where we split it up into unique
# `quantile`s.
# For us, `FUNCTION` is `summarize`, which calculates summary stats
# on each subset of the dataframe.
# The arguments after `summarize` are the new summary columns we
# wish to calculate.
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta),
GB_ODDS=GOOD/BAD) # you can calculate the rest yourself,
# "the rest should be self explanatory".
> head(output, 3)
quantile n GOOD BAD xbeta_min xbeta_max GB_ODDS
1 (-30.1,-13.4] 89 41 39 -29.397737 -13.388007 1.0512821
2 (-13.4,-8.73] 89 49 45 -13.353714 -8.732645 1.0888889
3 (-8.73,-5.11] 89 46 48 -8.667335 -5.112192 0.9583333
Calculate the other columns. See (E.g.) ?cumsum for cumulative sums. e.g. output$cummu_good <- cumsum(output$GOOD).
Add the 'total' row. You should be able to do this. You can add an extra row to output using rbind.
Here is the final version my script with math coffee's guidance. I had to use .bincode instead of the suggested cut due to "'breaks' are not unique" error.
Thanks everyone.
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
my_dataframe$quantile = .bincode(my_dataframe$xbeta,quantile(my_dataframe$xbeta,seq(0,1,by=.1)))
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta), xbeta_median=median(xbeta), xbeta_mean=mean(xbeta),
GB_ODDS=GOOD/BAD, LN_GB_ODDS = log(GOOD/BAD))
output$cummu_good = cumsum(output$GOOD)
output$cummu_bad = cumsum(output$BAD)
output$cummu_n = cumsum(output$n)
output$sum_good = sum(output$GOOD)
output$sum_bad = sum(output$BAD)
output$cummu_good_pct = cumsum(output$GOOD/output$sum_good)
output$cummu_bad_pct = cumsum(output$BAD/output$sum_bad)
output[["sum_good"]]=NULL
output[["sum_bad"]]=NULL
output
I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.
The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.
Is there anyway I can take the different number of articles each day into consideration?
library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)
I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.
One approach would be to use number of words as in the following example (I like the first approach moreso here):
poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])
output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2
output[, -c(5:6)]
## sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1 f.cap.FALSE 158 1641 0.083 0.143583793 0.082504197
## 2 f.cap.TRUE 24 206 0.044 0.060969157 0.005564434
## 3 f.mont.TRUE 4 29 0.079 0.060996614 0.001397106
## 4 m.cap.FALSE 73 651 0.031 0.049163984 0.012191207
## 5 m.cap.TRUE 17 160 -0.176 -0.231357933 -0.017135804
## 6 m.escal.FALSE 9 170 -0.164 -0.218126656 -0.016977931
## 7 m.escal.TRUE 27 590 -0.067 -0.106080866 -0.024092720
## 8 m.mont.FALSE 70 868 -0.047 -0.078139272 -0.025099276
## 9 m.mont.TRUE 114 1175 -0.002 -0.003389105 -0.001433481
## 10 m.none.FALSE 7 71 0.066 0.072409049 0.002862997
## 11 none.none.FALSE 5 16 -0.300 -0.147087026 -0.002925046