I want to create a column in R that is simply the average of all previous values of another column. For Example:
D
X
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I would like D$Y to be the prior average of D$X that is, D$Y is the average of all previous observations of D$X. I know how to do this using a for loop moving through every row, but is there a more efficient manner?
I have a large dataset and hardware not up to that task!
You can generate cumulative means of a vector like this:
set.seed(123)
x<-sample(20)
x
## [1] 6 15 8 16 17 1 18 12 7 20 10 5 11 9 19 13 14 4 3 2
xmeans<-cumsum(x)/1:length(x)
xmeans
## [1] 6.000000 10.500000 9.666667 11.250000 12.400000 10.500000 11.571429
## [8] 11.625000 11.111111 12.000000 11.818182 11.250000 11.230769 11.071429
## [15] 11.600000 11.687500 11.823529 11.388889 10.947368 10.500000
So D$Y<-cumsum(D$X)/1:nrow(D) should work.
Related
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.
I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame.
cluster year lambda v e x
1 1 1 -0.12160997 -0.31105287 -0.253391178 15
2 1 2 -0.12160997 -1.06313732 -0.300349972 10
3 1 3 -0.12160997 -0.06704185 0.754397069 40
4 2 1 -0.07378295 -0.31105287 -1.331764904 4
5 2 2 -0.07378295 -1.06313732 0.279413039 19
6 2 3 -0.07378295 -0.06704185 -0.004581941 23
7 3 1 -0.02809310 -0.31105287 0.239647063 28
8 3 2 -0.02809310 -1.06313732 1.284568047 38
9 3 3 -0.02809310 -0.06704185 -0.294881283 18
10 4 1 0.33479251 -0.31105287 -0.480496125 15
11 4 2 0.33479251 -1.06313732 -0.380251626 12
12 4 3 0.33479251 -0.06704185 -0.078851036 34
13 5 1 0.27953088 -0.31105287 1.435456851 100
14 5 2 0.27953088 -1.06313732 -0.795435607 0
15 5 3 0.27953088 -0.06704185 -0.166848530 0
16 6 1 0.29409366 -0.31105287 0.126647655 44
17 6 2 0.29409366 -1.06313732 0.162961658 18
18 6 3 0.29409366 -0.06704185 -0.812316265 13
To aggregate, I then add up the x value for cluster 1 across all three years with seroconv.cluster1=sum(data.all[c(1:3),6]) and repeat for each cluster.
Every time I change the number of clusters right now I have to manually change the addition of the x's. I would like to be able to say n.vec <- seq(6, 12, by=2) and feed n.vec into the functions and get x and have R add up the x values for each cluster every time with the number of clusters changing. So it would do 6 clusters and add up all the x's per cluster. Then 8 and add up the x's and so on.
It seems you are asking for an easy way to split your data up, apply a function (sum in this case) and then combine it all back together. Split apply combine is a common data strategy, and there are several split/apply/combine strategies in R, the most popular being ave in base, the dplyr package and the data.table package.
Here's an example for your data using dplyr:
library(dplyr)
df %>% group_by(cluster, year) %>% summarise_each(funs(sum))
To get the sum of x for each cluster as a vector, you can use tapply:
tapply(df$x, df$cluster, sum)
# 1 2 3 4 5 6
# 65 46 84 61 100 75
If you instead wanted to output as a data frame, you could use aggregate:
aggregate(x~cluster, sum, data=df)
# cluster x
# 1 1 65
# 2 2 46
# 3 3 84
# 4 4 61
# 5 5 100
# 6 6 75
I am trying to create a table with random entries from a central hypergeometric distribution where the column and row totals are fixed.
However I can get the column sums to be fixed and equal but not the row sums. I have read other answers but none seem to talk specifically about how to do it, my R knowledge is pretty basic and could do with some help or a point in the right direction.
To get the values from a central hypergeometric distribution I am using the BiasedUrn package.
For example:
N <- 50
rand <- 10
n1 <- 25
odds0 <- rep(1,K)
m0 <- rep(N/K,K)
library(BiasedUrn)
i <- as.table(rMFNCHypergeo(nran=rand, n=n1, m=m0, odds=odds0))
addmargins(i)
A B C D E F G H I J Sum
A 5 3 5 7 5 5 6 6 5 5 52
B 8 7 4 5 5 6 3 4 5 4 51
C 3 6 4 4 4 5 6 8 5 4 49
D 4 4 6 3 6 4 5 3 3 5 43
E 5 5 6 6 5 5 5 4 7 7 55
Sum 25 25 25 25 25 25 25 25 25 25 250
Where I'm looking to keep all the column sums equal to 25, and all the row sums equal to another number which I can choose such as 50.
Are you looking for the r2dtable function from base R?
set.seed(101)
tt <- r2dtable(n=1,c=rep(25,6),r=rep(50,3))
addmargins(as.table(tt[[1]]))
## A B C D E F Sum
## A 7 9 7 11 9 7 50
## B 10 7 10 6 7 10 50
## C 8 9 8 8 9 8 50
## Sum 25 25 25 25 25 25 150
I would like to import the data into R as intervals, then I would like to count all the numbers falling within these intervals and draw a histogram from this counts.
Example:
start end freq
1 8 3
5 10 2
7 11 5
.
.
.
Result:
number freq
1 3
2 3
3 3
4 3
5 5
6 5
7 10
8 10
9 7
10 7
11 5
Some suggestions?
Thank you very much!
Assuming your data is in df, you can create a data set that has each number in the range repeated by freq. Once you have that it's trivial to use the summarizing functions in R. This is a little roundabout, but a lot easier than explicitly computing the sum of the overlaps (though that isn't that hard either).
dat <- unlist(apply(df, 1, function(x) rep(x[[1]]:x[[2]], x[[3]])))
hist(dat, breaks=0:max(df$end))
You can also do table(dat)
dat
1 2 3 4 5 6 7 8 9 10 11
3 3 3 3 5 5 10 10 7 7 5