Using tapply on two columns instead of one - r

I would like to calculate the gini coefficient of several plots with R unsing the gini() function from the package reldist.
I have a data frame from which I need to use two columns as input to the gini function.
> head(merged[,c(1,17,29)])
idp c13 w
1 19 126 14.14
2 19 146 14.14
3 19 76 39.29
4 19 74 39.29
5 19 86 39.29
6 19 93 39.29
The gini function uses the first elements for calculation (c13 here) and the second elements are the weights (w here) corresponding to each element from c13.
So I need to use the column c13 and w like this:
gini(merged$c13,merged$w)
[1] 0.2959369
The thing is I want to do this for each plot (idp). I have 4 thousands different values of idp with dozens of values of the two other columns for each.
I thought I could do this using the function tapply(). But I can't put two colums in the function using tapply.
tapply(list(merged$c13,merged$w), merged$idp, gini)
As you know this does not work.
So what I would love to get as a result is a data frame like this:
idp Gini
1 19 0.12
2 21 0.45
3 35 0.65
4 65 0.23
Do you have any idea of how to do this?? Maybe the plyr package?
Thank you for your help!

You can use function ddply() from library plyr() to calculate coefficient for each level (changed in example data frame some idp values to 21).
library(plyr)
library(reldist)
ddply(merged,.(idp),summarize, Gini=gini(c13,w))
idp Gini
1 19 0.15307402
2 21 0.05006588

Related

missing value imputation for unevenly spaced univariate time series using R

I have the following dataset:
timestamp value
1 90
3 78
6 87
8 NA
12 98
15 100
18 NA
24 88
27 101
As you can see, the gaps between the consecutive timestamps are not equi-spaced. Is there a way to imputate values to replace the NA using a timestamp dependend method?
All packages I found are only suitable for equi-spaced time series...
Thanks!
The zoo R package can be used to handle irregular spaced / unevenly spaced time series.
First you have to create a zoo ts object. You can either specify indices or use POSIXct timestamps.
Afterwards you can use a imputation method on this object. Zoo's imputation methods are limited, but they also work on irregular speced time series. You can use linear interpolation (na.approx) or spline interpolation (na.spline), which also account for the uneven time stamps.
# First create a unevenly spaced zoo time series object
# First vector with values, second with your indices
zoo_ts <- zoo(c(90,78,87,NA,98,100,NA,88,101), c(1, 3, 6,8,12,15,18,24,27))
# Perform the imputation
na.approx(zoo_ts)
Your zoo object looks like this:
> 1 3 6 8 12 15 18 24 27
> 90 78 87 NA 98 100 NA 88 101
Your imputed series like this afterwards:
> 1 3 6 8 12 15 18 24 27
> 90.00000 78.00000 87.00000 90.66667 98.00000 100.00000 96.00000 88.00000 101.00000
When you have time stamps and the series is only slightly / few seconds off for each time stamp, you could also try to transform the series into a regular time series by mapping your values to the correct regular intervals. (only reasonably if the differences are small). By doing this you could also use additional imputation methods e.g. by the imputeTS package (which only works for regular spaced data).

LOCF and NOCF methods for missing data: how to plot data?

I'm working on the following dataset and its missing data:
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
I would like to fill the missiningness data via the Last Observation Carried Forward method (LOCF) and the Next Observation Carried Backward one (NOCB) and report also a graphic representation, plotting the individual profiles during age by sex, highlighting the imputed values, and compute the means and the standard errors at each age by sex. May you suggest a way to set properly the argument in plot() function?
Someone may have any clue about this?
I let you below some code, just in case they could turn out as useful, drawn from other dataset as example.
par(mfrow=c(1,1))
Oz <- airquality$Ozone
locf <- function(x) {
a <- x[1]
for (i in 2:length(x)) {
if (is.na(x[i])) x[i] <- a
else a <- x[i]
}
return(x)
}
Ozi <- locf(Oz)
colvec <- ifelse(is.na(Oz),mdc(2),mdc(1))
### Figure
plot(Ozi[1:80],col=colvec,type="l",xlab="Day number",ylab="Ozone (ppb)")
points(Ozi[1:80],col=colvec,pch=20,cex=1)
Next Observation Carried Backward / Last Observation Carried Forward is probably a very bad choice for your data.
These algorithms are usually used for time series data. Where carrying the last observation forward might be a good idea. E.g. if you think of 10 minute temperature measurements, the actual outdoor temperature will be quite likely quite similar to the temperature 10 minutes ago.
For cross sectional data (it seems you are looking at persons) the previous person is usually no more similar to actual person than any other random person.
Take a look at the mice R package for your cross-sectional dataset.
It offers way better algorithms for your case than locf/nocb.
Here is a overview about the function it offers: https://amices.org/mice/reference/index.html
It also includes different plots to assess the imputations e.g.:
Usually when using mice you create multiple possible imputations ( is worth reading about the technique of multiple imputation ). But you can also just produce one imputed dataset with the package.
There are the following functions for visualization of your imputations:
bwplot() (Box-and-whisker plot of observed and imputed data)
densityplot() (Density plot of observed and imputed data)
stripplot() (Stripplot of observed and imputed data)
xyplot()(Scatterplot of observed and imputed data)
Hope this helps a little bit. So my advice would be to take a look at this package and then start a new approach with your new knowledge.

How to write the Kolmogorov-Smirnov in R

Long story short, I want to manually write the code for the Kolmogorov-Smirnov one-sample statistic instead of using ks.test() in R. From what I understand, the K-S test is basically a ratio between a numerator and a denominator. I am interested in writing out the numerator, and from what I understand it is the maximal absolute difference between a sample of observations and the theoretical assumption. Let's use the below case as an example:
Data Expected
1 0.01052632 0.008864266
2 0.02105263 0.010969529
13 0.05263158 0.018282548
20 0.06315789 0.031689751
22 0.09473684 0.046315789
24 0.26315789 0.210526316
26 0.27368421 0.220387812
27 0.29473684 0.236232687
28 0.30526316 0.252520776
3 0.42105263 0.365650970
4 0.42105263 0.372299169
5 0.45263158 0.398781163
6 0.49473684 0.452853186
7 0.50526316 0.460277008
8 0.73684211 0.656842105
9 0.74736842 0.665484765
10 0.75789474 0.691523546
11 0.77894737 0.718005540
12 0.80000000 0.735955679
14 0.84210526 0.791135734
15 0.86315789 0.809972299
16 0.88421053 0.838559557
17 0.89473684 0.857950139
18 0.96842105 0.958337950
19 0.97894737 0.968642659
21 0.97894737 0.979058172
23 0.98947368 0.989473684
25 1.00000000 1.000000000
Here, I want to obtain the maximal absolute difference (Data - Expected).
Anyone have an idea? I can rephrase this question, if necessary. Thanks!
I utilized the below function to obtain the answer:
> A <- with(df, max(abs(Data-Expected)))
> A
0.082
Basically, this function calculates the differences between the two columns into a new vector, whose values are transformed into absolute values, and from the absolute values the largest one is obtained.
Credit to Josh O'Brien.

Kolmogorov-Smirnov using R

Long story short, I want to manually write the code for the Kolmogorov-Smirnov one-sample statistic instead of using ks.test() in R. From what I understand, the K-S test can be broken down into a ratio between a numerator and a denominator. I am interested in writing out the numerator, and from what I understand it is the maximal absolute difference between a sample of observations and the theoretical assumption. Let's use the below case as an example:
Data Expected
1 0.01052632 0.008864266
2 0.02105263 0.010969529
13 0.05263158 0.018282548
20 0.06315789 0.031689751
22 0.09473684 0.046315789
24 0.26315789 0.210526316
26 0.27368421 0.220387812
27 0.29473684 0.236232687
28 0.30526316 0.252520776
3 0.42105263 0.365650970
4 0.42105263 0.372299169
5 0.45263158 0.398781163
6 0.49473684 0.452853186
7 0.50526316 0.460277008
8 0.73684211 0.656842105
9 0.74736842 0.665484765
10 0.75789474 0.691523546
11 0.77894737 0.718005540
12 0.80000000 0.735955679
14 0.84210526 0.791135734
15 0.86315789 0.809972299
16 0.88421053 0.838559557
17 0.89473684 0.857950139
18 0.96842105 0.958337950
19 0.97894737 0.968642659
21 0.97894737 0.979058172
23 0.98947368 0.989473684
25 1.00000000 1.000000000
Here, I want to obtain the maximal absolute difference (Data - Expected).
Anyone have an idea? I can rephrase this question, if necessary. Thanks!
I was looking for an answer something along the lines of this code:
> A <- with(df, max(abs(Data-Expected)))
,where df is the data frame.
Here, I obtain the differences between each Data and Expected, convert the values into absolute values, and from the vector of absolute differences select the maximum value. Thus, the answer is:
> A
0.082

How do I create a random contingency table in R?

I would like to create random two-way contingency tables, given fixed row and column marginals. Supposing I have a table like this:
A C G T
A 79 6 13 53
C 16 7 6 17
G 9 3 1 6
T 58 28 18 114
with given row marginals:
A C G T
151 46 19 218
and column marginals:
A C G T
162 44 38 190
I'd like to create a random contingency table, for example:
A C G T
A 49 16 10 76
C 23 2 6 15
G 11 0 1 7
T 79 26 21 92
which preserves those marginals.
Since n is not too large in this case, I tried to approach this by "untabling" the marginal vectors, i.e. by converting the marginals into vectors of the form
A A A ...C C C ... G G G ... T T T
and then permuting and tabling them.
My current method for "untabling" the marginals is highly unnatural and inefficient, and I was curious to know if there's a better way. Certain built-in functions must create random contingency tables, for instance chisq.test when simulate.p.value=TRUE. Is random contingency table construction also built in?
Thanks in advance for any suggestions.
I'm not entirely sure what you mean by 'untabling', and since you didn't actually specify the method you're currently using, I can't be sure that this isn't what you're currently doing.
But given marginals of (162, 44, 38, 190) you can 'recreate' the vector just by doing this:
rep(c('A','C','G','T'),times = c(162, 44, 38, 190))
which you can then permute as needed.
I'm sorry, but #joran's answer is not correct. His formula correctly simulates tables with the correct column totals, but the OP requested a simulation that respects both row and column totals. The solution to this was given in 1981 by W.M. Patefield. Algorithm AS159. An efficient method of generation r x c tables given row and column totals. Applied Statistics, 30. 91-97.
Patefield's algorithm is implemented in Base R function r2dtable().

Resources