I work with animal trials in which I try to get information about movement for several groups of animals (normally 4 groups of 12 individuals, but not allways the same).
My final data frame per trial looks like this.
> dput(aa)
structure(list(Tiempo = c(618.4, 618.6, 618.8, 619, 619.2, 619.4,
619.6, 619.8, 620, 620.2, 620.4), UT1 = c(0, 0, 15, 19, 26, 27,
29, 37, 42, 44, 45), UT2 = c(0, 0, 0, 0, 0, 1, 18, 19, 21, 21,
21), UT3 = c(0, 2, 3, 3, 3, 3, 16, 19, 20, 20, 20), UT4 = c(0,
0, 0, 0, 0, 0, 5, 17, 29, 34, 39), UT5 = c(0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1), UT6 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), UT7 = c(0,
0, 1, 2, 2, 3, 4, 6, 7, 7, 8), UT8 = c(0, 19, 20, 23, 24, 25,
33, 80, 119, 122, 130), UT9 = c(0, 1, 1, 1, 1, 3, 6, 9, 19, 19,
19), UT10 = c(0, 0, 0, 0, 0, 1, 2, 3, 10, 12, 14), TR1 = c(0,
0, 0, 0, 0, 0, 0, 1, 2, 2, 2), TR2 = c(0, 0, 0, 0, 0, 0, 2, 19,
32, 37, 43), TR3 = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), TR4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), TR5 = c(0, 0, 0, 0, 0, 0, 13,
18, 20, 22, 26), TR6 = c(0, 2, 11, 20, 25, 29, 37, 40, 41, 42,
43), TR7 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), TR8 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), TR9 = c(0, 0, 4, 9, 16, 19, 23, 27,
31, 33, 34), TR10 = c(0, 1, 9, 25, 32, 41, 49, 49, 51, 57, 60
), UT1.1 = c(0, 10, 15, 17, 23, 31, 37, 48, 53, 57, 58), UT2.1 = c(0,
1, 1, 1, 1, 2, 2, 4, 4, 4, 4), UT3.1 = c(0, 2, 11, 14, 20, 22,
24, 25, 26, 26, 26), UT4.1 = c(0, 0, 0, 0, 0, 0, 0, 11, 13, 13,
14), UT5.1 = c(0, 3, 5, 7, 18, 19, 19, 27, 37, 39, 42), UT6.1 = c(0,
0, 0, 0, 0, 0, 2, 2, 3, 4, 4), UT7.1 = c(0, 0, 2, 8, 9, 9, 12,
16, 18, 18, 18), UT8.1 = c(0, 0, 1, 8, 13, 15, 44, 68, 80, 89,
94), UT9.1 = c(0, 1, 1, 1, 1, 2, 3, 5, 9, 10, 10), UT10.1 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), UT11 = c(0, 12, 17, 17, 18, 34,
74, 116, 131, 145, 170), UT12 = c(0, 1, 2, 3, 3, 3, 5, 14, 21,
22, 24), TR1.1 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1), TR2.1 = c(0,
0, 0, 11, 16, 19, 40, 94, 121, 134, 145), TR3.1 = c(0, 0, 0,
2, 3, 5, 6, 6, 6, 7, 7), TR4.1 = c(0, 0, 0, 1, 1, 1, 1, 1, 4,
4, 5), TR5.1 = c(0, 24, 27, 28, 29, 37, 86, 151, 212, 258, 288
), TR6.1 = c(0, 0, 1, 1, 1, 2, 5, 9, 12, 12, 13), TR7.1 = c(0,
4, 7, 28, 47, 70, 108, 125, 127, 127, 127), TR8.1 = c(0, 1, 2,
2, 2, 2, 3, 3, 4, 4, 4), TR9.1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), TR10.1 = c(0, 1, 1, 1, 1, 1, 13, 40, 41, 45, 49), TR11 = c(0,
0, 0, 1, 4, 8, 10, 11, 17, 23, 25), TR12 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), .Names = c("Tiempo", "UT1", "UT2", "UT3", "UT4",
"UT5", "UT6", "UT7", "UT8", "UT9", "UT10", "TR1", "TR2", "TR3",
"TR4", "TR5", "TR6", "TR7", "TR8", "TR9", "TR10", "UT1.1", "UT2.1",
"UT3.1", "UT4.1", "UT5.1", "UT6.1", "UT7.1", "UT8.1", "UT9.1",
"UT10.1", "UT11", "UT12", "TR1.1", "TR2.1", "TR3.1", "TR4.1",
"TR5.1", "TR6.1", "TR7.1", "TR8.1", "TR9.1", "TR10.1", "TR11",
"TR12"), row.names = c(NA, -11L), class = "data.frame")
My goal is to lm the individuals represented in each column using Tiempo variable as x so I do it like this:
fit<-apply(aa,2,function(x) lm(x~aa$Tiempo))
It works perfect but the problem is that all the valuable (and useless) information gets stored in that lm object and I can't extract the data in an efficient way. My lm object looks like this
summary(fit)
Length Class Mode
Tiempo 12 lm list
UT1 12 lm list
UT2 12 lm list
UT3 12 lm list
UT4 12 lm list
UT5 12 lm list
UT6 12 lm list
UT7 12 lm list
UT8 12 lm list
UT9 12 lm list
UT10 12 lm list
TR1 12 lm list
TR2 12 lm list
TR3 12 lm list
TR4 12 lm list
TR5 12 lm list
TR6 12 lm list
TR7 12 lm list
TR8 12 lm list
TR9 12 lm list
TR10 12 lm list
UT1.1 12 lm list
UT2.1 12 lm list
UT3.1 12 lm list
UT4.1 12 lm list
UT5.1 12 lm list
UT6.1 12 lm list
UT7.1 12 lm list
UT8.1 12 lm list
UT9.1 12 lm list
UT10.1 12 lm list
UT11 12 lm list
UT12 12 lm list
TR1.1 12 lm list
TR2.1 12 lm list
TR3.1 12 lm list
TR4.1 12 lm list
TR5.1 12 lm list
TR6.1 12 lm list
TR7.1 12 lm list
TR8.1 12 lm list
TR9.1 12 lm list
TR10.1 12 lm list
TR11 12 lm list
TR12 12 lm list
And each animal looks like this
summary(fit$UT1)
Call:
lm(formula = x ~ aa$Tiempo)
Residuals:
Min 1Q Median 3Q Max
-6.873 -1.845 1.182 2.314 4.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14642.700 1104.825 -13.25 3.29e-07 ***
aa$Tiempo 23.682 1.784 13.28 3.24e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.742 on 9 degrees of freedom
Multiple R-squared: 0.9514, Adjusted R-squared: 0.946
F-statistic: 176.3 on 1 and 9 DF, p-value: 3.24e-07
I would like to get the summary information organised in a data frame with all animals (or at least the coefficients and R-squared data) in order to keep doing some statistical analysis. Having that information cuould possibly help me to think a function to evaluate if the R-squared is lower than a fixed value and I should check that fit (or discard that animal if it's really not performing well). Besides, I should find a way to make it reproducible because nowadays I'm using
FIT<-data.frame(UT1=fit$UT1$coefficients,
UT2=fit$UT2$coefficients,
UT3=fit$UT3$coefficients,...)
This approach doesn't even meet what I'm trying to do and it's really precarious.
I've made a little search and find about coef function but
coef(fit)
NULL
With your fit list, you can extract the coefficients and r-squared values with
fit<-apply(aa,2,function(x) lm(x~aa$Tiempo))
mysummary <- t(sapply(fit, function(x) {
ss<-summary(x); c(coef(x),
r.square=ss$r.squared, adj.r.squared=ss$adj.r.squared)
}))
We use sapply to go over the list you created and extract the coefficients from the model and the r-squared values from the summary. The output is
> mysummary
(Intercept) aa$Tiempo r.square adj.r.squared
Tiempo 0.0000 1.0000000 1.0000000 1.0000000
UT1 -14642.7000 23.6818182 0.9514231 0.9460256
UT2 -8662.4182 14.0000000 0.7973105 0.7747894
UT3 -7535.5091 12.1818182 0.8404400 0.8227111
...
Related
I am trying to implement a bincount operation in OpenCL which allocates an output buffer and uses indices from x to accumulate some weights at the same index (assume that num_bins == max(x)). This is equivalent to the following python code:
out = np.zeros_like(num_bins)
for i in range(len(x)):
out[x[i]] += weight[i]
return out
What I have is the following:
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
{
int gid = get_global_id(0);
res_g[x_g[gid]] += weight_g[gid];
}
""").build()
# test
x = np.arange(5, dtype=np.int32).repeat(2) # [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
x_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=x)
weight = np.arange(10, dtype=np.int32) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
weight_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=weight)
res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 5)
prg.bincount(queue, [10], None, res_g, x_g, weight_g)
# transfer back to cpu
res_np = np.empty(5).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)
Output in res_np:
array([1, 3, 5, 7, 9], dtype=int32)
Expected output:
array([1, 5, 9, 13, 17], dtype=int32)
How do I accumulate the elements that are indexed more than once?
EDIT
The above is a contrived example, in my real-world application x will be indices from a sliding window algorithm:
x = np.array([ 0, 1, 2, 4, 5, 6, 8, 9, 10, 1, 2, 3, 5, 6, 7, 9, 10,
11, 4, 5, 6, 8, 9, 10, 12, 13, 14, 5, 6, 7, 9, 10, 11, 13,
14, 15, 8, 9, 10, 12, 13, 14, 16, 17, 18, 9, 10, 11, 13, 14, 15,
17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 21, 22, 23, 25, 26,
27, 29, 30, 31, 24, 25, 26, 28, 29, 30, 32, 33, 34, 25, 26, 27, 29,
30, 31, 33, 34, 35, 28, 29, 30, 32, 33, 34, 36, 37, 38, 29, 30, 31,
33, 34, 35, 37, 38, 39], dtype=np.int32)
weight = np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0], dtype=np.int32)
There is a pattern which becomes more apparent when reshaping x to (2,3,2,3,3). But I am having a hard time figuring out how the approach given by #doqtor can be used here and especially if it is easy enough to generalize.
The expected output is:
array([1, 1, 0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 1,
0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0], dtype=int32)
The problem is that OpenCL buffer to which weights are accumulated is not initialized (zeroed). Fixing that:
res_np = np.zeros(5).astype(np.int32)
res_g = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=res_np)
prg.bincount(queue, [10], None, res_g, x_g, weight_g)
# transfer back to cpu
cl.enqueue_copy(queue, res_np, res_g)
Returns correct results: [ 1 5 9 13 17]
====== Update ==========
As #Kevin noticed there is race condition here too. If there is any pattern it could be addressed this way without using synchronization, for example processing every 2 elements by 1 work item:
__kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
{
int gid = get_global_id(0);
for(int x = gid*2; x < gid*2+2; ++x)
res_g[x_g[x]] += weight_g[x];
}
Then schedule 5 work items:
prg.bincount(queue, [5], None, res_g, x_g, weight_g)
I have a data frame of counts. I would like to calculate weighted proportions, plot the proportions, and also plot standard error bars for these weighted proportions.
Sample of my data frame:
head(df[1:4,])
badge year total b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 b_10
1 15 2014 14 3 2 1 1 1 1 1 1 1 1
2 15 2015 157 13 12 11 8 6 6 6 5 5 5
3 15 2016 15 5 3 1 1 1 1 1 1 1 0
4 2581 2014 13 1 1 1 1 1 1 1 1 1 1
The data contain counts of 911 calls officers respond to in ten different police beats (b_1, b_2,...) in a given year. So officer 15 responds to 14 calls total in 2014, 3 of which were in beat 1, 2 in beat 2, and so on.
Essentially, what I want is to get the overall proportion of calls that occur within each beat. But I want these proportions to be weighted by the total number of calls.
So far, I've been able to calculate this by just adding the values within each b_ column and the total column, and calculating proportions. I have plotted these in a simple bar plot. I am haven't been able to figure out how to calculate standard errors that are weighted by total.
I have no preference for how the data are plotted. I'm mainly interested in getting the right standard errors.
Here is the code I have so far:
sums_by_beat <- apply(df[, grep('b_', colnames(df2))], 2, sum)
props_by_beat <- sums_by_beat / sum(df$total)
# Bar plot of proportions by beat
barplot(props_by_beat, main='Distribution of Calls by Beat',
xlab="Nth Most Common Division", ylim=c(0,1),
names.arg=1:length(props_by_beat), ylab="Percent of Total Calls")
And a 30-row sample of my data:
df <- structure(list(badge = c(15, 15, 15, 2581, 2581, 2745, 2745,
3162, 3162, 3162, 3396, 3650, 3650, 3688, 3688, 3688, 3698, 3698,
3698, 3717, 3717, 3717, 3740, 3740, 3740, 3813, 3873, 3907, 3930,
4007), year = c(2014, 2015, 2016, 2014, 2015, 2015, 2016, 2014,
2015, 2016, 2016, 2014, 2015, 2014, 2015, 2016, 2014, 2015, 2016,
2014, 2015, 2016, 2014, 2015, 2016, 2016, 2015, 2014, 2014, 2014
), total = c(14, 157, 15, 13, 29, 1, 1, 754, 1172, 1039, 14,
1, 2, 34, 57, 146, 3, 7, 28, 593, 1036, 1303, 461, 952, 1370,
1, 4, 41, 5, 451), b_1 = c(3, 13, 5, 1, 3, 1, 1, 33, 84, 83,
2, 1, 2, 5, 10, 14, 2, 7, 7, 39, 72, 75, 42, 69, 81, 1, 1, 7,
1, 36), b_2 = c(2, 12, 3, 1, 2, 0, 0, 33, 61, 52, 2, 0, 0, 3,
6, 8, 1, 0, 2, 37, 65, 70, 29, 65, 75, 0, 1, 5, 1, 23), b_3 = c(1,
11, 1, 1, 2, 0, 0, 32, 57, 45, 2, 0, 0, 3, 5, 8, 0, 0, 2, 34,
62, 67, 28, 50, 73, 0, 1, 3, 1, 22), b_4 = c(1, 8, 1, 1, 2, 0,
0, 31, 44, 39, 2, 0, 0, 3, 3, 7, 0, 0, 2, 34, 61, 67, 26, 42,
72, 0, 1, 3, 1, 21), b_5 = c(1, 6, 1, 1, 1, 0, 0, 30, 42, 37,
1, 0, 0, 3, 3, 7, 0, 0, 1, 33, 53, 61, 23, 42, 67, 0, 0, 2, 1,
21), b_6 = c(1, 6, 1, 1, 1, 0, 0, 30, 40, 36, 1, 0, 0, 2, 2,
6, 0, 0, 1, 32, 53, 61, 22, 41, 63, 0, 0, 2, 0, 21), b_7 = c(1,
6, 1, 1, 1, 0, 0, 26, 39, 35, 1, 0, 0, 2, 2, 6, 0, 0, 1, 30,
47, 58, 22, 39, 62, 0, 0, 2, 0, 21), b_8 = c(1, 5, 1, 1, 1, 0,
0, 26, 39, 33, 1, 0, 0, 2, 2, 6, 0, 0, 1, 30, 47, 58, 21, 38,
59, 0, 0, 2, 0, 19), b_9 = c(1, 5, 1, 1, 1, 0, 0, 24, 34, 33,
1, 0, 0, 2, 2, 5, 0, 0, 1, 30, 43, 57, 20, 37, 57, 0, 0, 2, 0,
15), b_10 = c(1, 5, 0, 1, 1, 0, 0, 23, 34, 32, 1, 0, 0, 1, 2,
5, 0, 0, 1, 27, 40, 56, 18, 36, 55, 0, 0, 2, 0, 14)), row.names = c(NA,
30L), class = "data.frame")
There isn't (as far as I know) a built-in R function to calculate the standard error of a weighted mean, but it is fairly straightforward to calculate - with some assumptions that are probably valid in the case you describe.
See, for instance:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Standard_error
Standard error of the weighted mean
If the elements used to calculate the weighted mean are samples from populations that all have the same variance v, then the variance of the weighted sample mean is estimated as:
var_m = v^2 * sum( wnorm^2 ) # wnorm = weights normalized to sum to 1
And the standard error of the weighted mean is equal to the square root of the variance.
sem = sqrt( var_m )
So, we need to calculate the sample variance from the weighted data.
Weighted variance
The weighted population variance (or biased sample variance) is calculated as:
pop_v = sum( w * (x-mean)^2 ) / sum( w )
However, if (as in the case you describe), we are working with samples taken from the population, rather then with the population itself, we need to make an adjustment to obtain an unbiased sample variance.
If the weights represent the frequencies of observations underlying each of the elements used to calculate the weighted mean & variance, then the adjustment is:
v = pop_v * sum( w ) / ( sum( w ) -1 )
However, this is not the case here, as the weights are the total frequenceis of 911 calls for each policeman, not the calls for each beat. So in this case the weights correspond to the reliabilities of each element, and the adjustment is:
v = pop_v * sum( w )^2 / ( sum( w )^2 - sum( w^2) )
weighted.var and weighted.sem functions
Putting all this together, we can define weighted.var and weighted.sem functions, similar to the base R weighted.mean function (note that several R packages, for instance "Hmisc", already include more-versatile functions to calculate the weighted variance):
weighted.var = function(x,w,type="reliability") {
m=weighted.mean(x,w)
if(type=="frequency"){ return( sum(w*(x-m)^2)/(sum(w)-1) ) }
else { return( sum(w*(x-m)^2)*sum(w)/(sum(w)^2-sum(w^2)) ) }
}
weighted.sem = function(x,w,...) { return( sqrt(weighted.var(x,w,...)*sum(w^2)/sum(w)^2) ) }
applied to 911 call data in the question
In the case of the question, the elements from which we want to calculate the weighted mean and weighted sem correspond to the proportions of calls in each beat, for each policeman.
So (finally...):
props = t(apply(df,1,function(row) row[-(1:3)]/row[3]))
wmean_props = apply(props,2,function(col) weighted.mean(col,w=df[,3]))
wsem_props = apply(props,2,function(col) weighted.sem(col,w=df[,3]))
Aren't your "proportions" actually the mean of the weighted (by total) observations? Then we could simply calculate the weighted colMeans accordingly.
df2 <- df[, grep('b_', colnames(df))]
means.w <- colMeans(df2 / df$total)
For the error bars we could use the quantiles of 1 - alpha/2, i.e. for alpha==.05 we use c(.025, .975). The analytical sds would yield negative values.
q.w <- t(apply(df2 / df$total, 2, quantile, c(.025, .975)))
Now, we store the x-positions that barplot returns invisible,
# Bar plot of proportions by beat
b <- barplot(means.w, main='Distribution of Calls by Beat',
xlab="Nth Most Common Division", ylim=c(0,1),
names.arg=1:length(means.w), ylab="Percent of Total Calls")
and construct the error bars with arrows.
arrows(b, q.w[,1], b, q.w[,2], length=.02, angle=90, code=3)
I am replicating a logit model example from Econometrics book from Gujarati and Porter (Spanish edition). I have no problems with the model estimation, but I can't replicate marginal effects. In book, regression results are the following:
In the following lines I show my regression results:
> dat <- data.frame(
+ debito = c(0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
+ 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
+ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1),
+ saldo = c(1756, 748, 1501, 1831, 1622, 1886, 740, 1593, 1169, 2125,
+ 1554, 1474, 1913, 1218, 1006, 2215, 137, 167, 343, 2557,
+ 2276, 1494, 2144, 1995, 1053, 1526, 1120, 1838, 1746, 1616,
+ 1958, 634, 580, 1320, 1675, 789, 1735, 1784, 1326, 2051, 1044,
+ 1885, 1790, 765, 1645, 32, 1266, 890, 2204, 2409, 1338, 2076,
+ 1708, 2138, 2375, 1455, 1487, 1125, 1989, 2156),
+ cajero = c(13, 9, 10, 10, 14, 17, 6, 10, 6, 18, 12, 12, 6, 10, 12, 20,
+ 7, 5, 7, 20, 15, 11, 17, 10, 8, 8, 8, 7, 11, 10, 6, 2, 4, 4,
+ 6, 8, 12, 11, 16, 14, 7, 10, 11, 4, 6, 2, 11, 7, 14, 16, 14,
+ 12, 13, 18, 12, 9, 8, 6, 12, 14),
+ interes = c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
+ 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0)
+ )
> mod <- glm(debito ~ saldo + cajero + interes,data = dat,
+ family = "binomial")
> summary(mod)
Call:
glm(formula = debito ~ saldo + cajero + interes, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6619 -0.9712 -0.6629 1.1440 1.7076
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5749004 0.7857866 -0.732 0.4644
saldo 0.0012481 0.0006973 1.790 0.0735 .
cajero -0.1202251 0.0939842 -1.279 0.2008
interes -1.3520856 0.6809872 -1.985 0.0471 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 82.108 on 59 degrees of freedom
Residual deviance: 75.500 on 56 degrees of freedom
AIC: 83.5
Number of Fisher Scoring iterations: 4
As you can see, I obtained the same results. My problem is when I try to replicate marginal effects, which in book are the following:
I try to replicate thus results with the following method:
> coef(mod) * mean(dlogis(predict(mod, type = "link")))
(Intercept) saldo cajero interes
-0.1262098468 0.0002740024 -0.0263934252 -0.2968279711
As you can see, I failed. I know results in book are obtained from Stata, so I don't know the details of calculation or if there is posible to replicate the calculation in R. There exist a way?
Use the package "margins" for marginal effects.
install.packages("margins")
library("margins")
mod <- glm(debito ~ saldo + cajero + interes, data = dat, family = "binomial")
(m <- margins(mod))
Here are the outputs:
Average marginal effects
glm(formula = debito ~ saldo + cajero + interes, family = "binomial", data = dat)
saldo cajero interes
0.000274 -0.02639 -0.2968
Are your column names correct?
Consider dput:
structure(list(REAÇÃO = structure(c(0, 1, 0, 0, 1, 0, 1, 1,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
1, 0, 1, 1, 0, 1, 1), format.spss = "F11.0"), IDADE = structure(c(22,
38, 36, 58, 37, 31, 32, 54, 60, 34, 45, 27, 30, 20, 30, 30, 22,
26, 19, 18, 22, 23, 24, 50, 20, 47, 34, 31, 43, 35, 23, 34, 51,
63, 22, 29), format.spss = "F11.0"), ESCOLARIDADE = structure(c(6,
12, 12, 8, 12, 12, 10, 12, 8, 12, 12, 12, 8, 4, 8, 8, 12, 8,
9, 4, 12, 6, 12, 12, 12, 12, 12, 12, 12, 8, 8, 12, 16, 12, 12,
12), format.spss = "F11.0"), SEXO = structure(c(1, 1, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1), format.spss = "F11.0")), .Names = c("REAÇÃO",
"IDADE", "ESCOLARIDADE", "SEXO"), row.names = c(NA, -36L), class = "data.frame")
where: REAÇÃO is a dependent variable in the model.
Constant: -4.438.
How can I obtain this value using a simple function in R?
For obtain constant term in Discriminant Analysis on R (with library MASS):
groupmean<-(model$prior%*%model$means)
constant<-(groupmean%*%model$scaling)
constant
where model is the lda discriminant expression:
model<-lda(y~x1+x2+xn,data=mydata)
model
I would to overlay two different survival curves on same plot, for example OS et PFS (here false results).
N pt. OS. OS_Time_(years). PFS. PFS_Time_(years).
__________________________________________________________________
1. 1 12 0 12
2. 0 10 1 8
3. 0 14 0 14
4. 0 10 0 10
5. 1 11 1 8
6. 1 16 1 6
7. 0 11 1 4
8. 0 12 1 10
9. 1 9 0 9
10 1 10 1 9
__________________________________________________________
First, I import my dataset:
library(readxl)
testR <- read_excel("~/test.xlsx")
View(testR)
Then, I created survfit for both OS and PFS:
OS<-survfit(Surv(OS_t,OS)~1, data=test)
PFS<-survfit(Surv(PFS_t,PFS)~1, data=test)
And finally, I can plot each one thanks to:
plot(OS)
plot(PFS)
for example (or ggplot2...).
Here my question, if I want to overlay the 2 ones on same graph, how can I do?
I tried multipleplot or
ggplot(testR, aes(x)) + # basic graphical object
geom_line(aes(y=y1), colour="red") + # first layer
geom_line(aes(y=y2), colour="green") # second layer
But it didn't work (but I'm not sure to use it correctly).
Can someone help me, please ?
Thanks a lot
Here is my code for Data sample:
test <- structure(list(ID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 8, 9),
Sex = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Tabac = c(2, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 2, 0, 1, 1, 1),
Bmi = c(20, 37, 37, 25, 28, 38, 16, 27, 26, 28, 15, 36, 20, 17, 28, 37, 27, 26, 18),
Age = c(75, 56, 45, 65, 76, 34, 87, 43, 67, 90, 56, 37, 84, 45, 80, 87, 90, 65, 23), c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
OS_times = c(2, 4, 4, 2, 3, 5, 5, 3, 2, 2, 4, 1, 3, 2, 4, 3, 4, 3, 2),
OS = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0),
PFS_time = c(1, 2, 1, 1, 3, 4, 3, 1, 2, 2, 4, 1, 2, 2, 2, 3, 4, 3, 2),
PFS = c(1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0)),
.Names = c("ID", "Sex", "Tabac", "Bmi", "Age", "LN", "OS_times", "OS", "PFS_time", "PFS"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -19L))
You may use the ggsurv function from the GGally package in the following way. Combine both groups of variables in a data frame and add a "type" column. Later in the call to the plot, you refer to the type.
I used your data structure and named it "test". Afterwards, I transformed it to a data frame with the name "testdf".
library(GGally)
testdf <- data.frame(test)
OS_PFS1 <- data.frame(life = testdf$OS, life_times = testdf$OS_times, type= "OS")
OS_PFS2 <- data.frame(life = testdf$PFS, life_times = testdf$PFS_time, type= "PFS")
OS_PFS <- rbind(OS_PFS1, OS_PFS2)
sf.OS_PFS <- survfit(Surv(life_times, life) ~ type, data = OS_PFS)
ggsurv(sf.OS_PFS)
if you want the confidence intervals shown:
ggsurv(sf.OS_PFS, CI = TRUE)
Please let me know whether this is what you want.