R - How create a variable based in another variable - r

I have:
v1 <- c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4)
and I want create v2 which assigns to v1 the number of sets of 3 elements:
v2 <- c(1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
Explanation:
For the first three times a number is repeated the value corresponding to that number is a 1, for the second three times it's a 2, and so on.

v1 <- c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4)
Use rle to find the run lengths:
l <- rle(v1)$lengths
#[1] 3 3 9 6
Create a sequence 1:n for each run length n:
s <- sequence(l)
#[1] 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6
Use integer division:
(s - 1) %/% 3 + 1
#[1] 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2

Related

Create sequence based on a condition

How to conditionally increment if the previous value is greater than the current value? Say I have a column x on my data frame and I want a column y which starts from 1 and increments if the previous value is greater than the current.
x y
1 1
2 1
3 1
4 1
5 1
6 1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
1 3
2 3
5 3
As #A5C1D2H2I1M1N2O1R2T1 mentioned, you can use cumsum with diff to generate y.
cumsum(diff(x) < 0) + 1
#[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3
You might want to prepend 1 in the beginning to get y with same length as x.
c(1, cumsum(diff(x) < 0) + 1)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3
data
x <- c(1:6, 1:8, 1, 2, 5)

rep and/or seq function to create continuously reducing vector?

Suppose I have a vector from 1 to 5,
a<-c(1:5)
What I need to do is to repeat the vector by losing one element continuously. That is, the final outcome should be like
1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
We can reverse the vector and apply sequence
sequence(rev(a))
#[1] 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
Or another option is toeplitz
m1 <- toeplitz(a)
m1[lower.tri(m1, diag=TRUE)]
#[1] 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

columnwise sum matching values to another column

Seems, I am missing some link here.
I have data frame
df<-data.frame(w=sample(1:3,10, replace=T), x=sample(1:3,10, replace=T), y=sample(1:3,10, replace=T), z=sample(1:3,10, replace=T))
> df
w x y z
1 3 1 1 3
2 2 1 1 3
3 1 3 2 2
4 3 1 3 1
5 2 2 1 1
6 1 2 2 3
7 1 2 2 2
8 2 2 2 3
9 1 3 3 3
10 2 2 1 1
I want to get the number of rows of each column which matches to 1st column.
sum(df$w==df$x)
[1] 3
sum(df$w==df$y)
[1] 2
sum(df$w==df$z)
[1] 1
I know using apply, I can do rowwise or colwise operations.
apply(df,2,length)
w x y z
10 10 10 10
How do I combine these two functions?
Try colSums
colSums(df[-1] == df[, 1])
# x y z
# 3 2 1
Or if you into *apply loops could try
vapply(df[-1], function(x) sum(x == df[, 1]), double(1))

In R: How to create a vector of lagged differences but keep the original value for negative differences without using loops

I have a vector in R of the form:
> a <- c(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
> a
[1] 1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
I can take the lagged differences like this:
b <- diff(a)
> b
[1] 2 2 2 2 2 -10 2 2 2 2 2 -10 2 2 2 2 2
But I would like the negative differences to be replaced by the original values in the vector a. Or, in this case the -10's to be replaced by the 1's.
Is there a way to do this without looping though the vectors?
Thanks
One possible way:
indices<-which(b<0)
b[indices]<-a[indices+1]
One approach using replacement:
d <- diff(a)
d_neg <- d < 0
d[d_neg] <- a[-1][d_neg]
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2
One approach using ifelse:
d <- diff(a)
ifelse(d < 0, a[-1], d)
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2
One approach using mathematics and pmax:
d <- diff(a)
(d < 0) * a[-1] + pmax(d, 0)
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2

Resources