Create vector of the culmative counts per level - r

I have lots of little vectors like this:
id<-c(2,2,5,2,1,9,4,4,3,9,5,5)
and I want to create another vector of the same length that has the cumulative number of occurrences of each id thus far. For the above example data this would look like:
> id.count
[1] 1 2 1 3 1 1 1 2 1 2 2 3
I can not find a function that does this easily, maybe because I do not really know how to fully articulate in words what it is that I actually want (hence the slightly awkward question title). Any suggestions?

Here is another way:
ave(id,id,FUN=seq_along)
Gives:
[1] 1 2 1 3 1 1 1 2 1 2 2 3

> sapply(1:length(id), function(i) sum(id[1:i] == id[i]))
[1] 1 2 1 3 1 1 1 2 1 2 2 3
Or if you have NAs in id you should use id[1:i] %in% id[i] instead.

Related

Reorder (collate) vector elements automatically

It's an easy one, but I can find a simple solution for my problem. I have several vectors look like this one: rep(1:3, each = 3) and I want to convert them to like rep(1:3, times = 3).
So each element is repeated multiple times c(1,1,1,2,2,2,3,3,3) and I want to reorder them to c(1,2,3,1,2,3,1,2,3). How can I achieve that?
You can use a matrix transpose:
as.vector(t(matrix(x, nrow = 3)))
# [1] 1 2 3 1 2 3 1 2 3
v1 <- c(1,1,1,2,2,2,3,3,3)
o1 <- rle(v1)
rep(o1$values, min(o1$length))
[1] 1 2 3 1 2 3 1 2 3
This allows for unknown amount of numbers or strings but expects each value to be present in equal numbers. It only has some flexibility on what you want to do on some values occuring more than others.
Consider:
v2 <- c(1,1,1,2,2,2,3,3,3,3)
o2 <- rle(v2)
rep(o2$values, min(o2$length))
[1] 1 2 3 1 2 3 1 2 3
rep(o2$values, max(o2$length))
[1] 1 2 3 1 2 3 1 2 3 1 2 3

Generating Sequences with Irregular Patterns in R

I'm attempting to generate a column that shows persistence throughout a field. The field is sequential and numeric, but not conventionally increasing. Essentially, it goes up by 7 (when it ends in 2) and then 3 (when it ends in 9) by each ID. It's possible for an ID to miss one or more of the sequence, but then return to the same pattern. The data looks like this:
ID Col
1 0769
1 0772
1 0779
1 0782
1 0799
1 0802
1 0812
2 0769
2 0772
2 0779
3 0782
3 0799
3 0802
3 0812
What I'm trying to do is generate this:
ID Col Persistence
1 0769 1
1 0772 1
1 0779 1
1 0782 1
1 0799 2
1 0802 2
1 0812 3
2 0769 1
2 0772 1
2 0779 1
3 0782 1
3 0799 2
3 0802 2
3 0812 3
If you just want to make sure the jump is either 3 or 7, you can write a helper function to increment when a jump of a different size occurs
jumpchange <- function(x) c(0,cumsum(!diff(x) %in% c(3,7)))+1
Then you can apply this to each group most easily with dplyr
library(dplyr)
dd %>% group_by(ID) %>%
mutate(persistence = jumpchange(Col))
Or you can use transform/ave with just base R
transform(dd, persistence=ave(Col, ID, FUN=jumpchange))

Expand.grid with unknown number of columns

I have the following data frame:
map_value LDGroup ComboNum
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 3 1
1 3 2
I want to find all combinations, selecting one from each LD group. Expand.grid seems to work for this, doing
expand.grid(df[df$LDGroup==1,3],df[df$LDGroup==2,3],df[df$LDGroup==3,3])
My problem is that I have about 500 map_values I need to do this for and I do not know what number of LDGroups will exist for each map_value. Is there a way to dynamically provide the function arguments?
We can split the 3rd column by the 'LDGroup' and apply the expand.grid
out <- expand.grid(split(df$ComboNum, df$LDGroup))
names(out) <- paste0("Var", names(out))

Using factors in R programming

If I have the code:
x <- c(rnorm(10),runif(10), rnorm(10,1))
f <- gl(3,10)
f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x,f,mean)
1 2 3
0.07368817 0.42992416 0.64212383
How are the 1,2,3's decided? I am assuming they are levels of something.
Furthermore, why is f used in the second argument, I dont see why it is an index and how does it know when to stop running through the index?.
I tried looking up the function definition but to no avail.
If you are asking about how tapply works (rather than gl) consider another simpler example:
> x1 <- c(1,1,2,2,3,3)
> tapply(x1, x1, mean)
1 2 3
1 2 3
> f2 <- c(2,2,2,2,3,3)
> tapply(x1, f2, mean)
2 3
1.5 3.0
In the first case, tapply has picked the first two items (indices), and found their mean
giving 1 for 1, then the next two items (2 and 2) having mean 2 etc.
In the second case, the first 4 items are treated as 2's, having mean (1+1+2+2)/4, and the last two and 3's having mean (3+3)/2
In effect, then "index" is labelling the data, and applying the requested function to each "group"

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources