How to use lag in R as in Stata - r

I am trying to recreate a Stata code snippet in R and I have hit a snag.
In Stata, the lag function gives this result when applied:
A B
1 2
1 2
1 2
1 2
replace A=B if A==A[_n-1]
A B
1 2
2 2
1 2
2 2
If I try to replicate in R I get the following:
temp <- data.frame("A" = rep(1,4), "B" = rep(2,4))
temp
A B
1 2
1 2
1 2
1 2
temp <- temp %>% mutate(A = ifelse(A==lag(A,1),B,A))
temp
A B
2 2
2 2
2 2
2 2
I need it to be the same as in Stata.

lag would not be used here because it uses the original values in A whereas at each iteration the question needs the most recently updated values.
Define an Update function and apply it using accumulate2 in the purrr package. It returns a list so unlist it.
library(purrr)
Update <- function(prev, A, B) if (A == prev) B else A
transform(temp, A = unlist(accumulate2(A, B[-1], Update)))
giving:
A B
1 1 2
2 2 2
3 1 2
4 2 2
Another way to write this uses fn$ in gsubfn which causes formula arguments to be interpreted as functions. The function that it builds uses the free variables in the formula as the arguments in the order encountered.
library(gsubfn)
library(purrr)
transform(temp, A = unlist(fn$accumulate2(A, B[-1], ~ if (prev == A) B else A)))
Also note the comments below this answer for another variation.

Looks like we need to update after each run
for(i in 2:nrow(temp)) temp$A[i] <- if(temp$A[i] == temp$A[i-1])
temp$B[i] else temp$A[i]
temp
# A B
#1 1 2
#2 2 2
#3 1 2
#4 2 2
Or as #G.Grothendieck mentioned in the comments, it can be compact with
for(i in 2:nrow(temp)) if (temp$A[i] == temp$A[i-1]) temp$A[i] <- temp$B[i]

Here's a function that will do it:
lagger <- function(x,y){
current = x[1]
out = x
for(i in 2:length(x)){
if(x[i] == current){
out[i] = y[i]
}
current = out[i]
}
out
}
lagger(temp$A, temp$B)
[1] 1 2 1 2

Related

Writing a function in in R

I am doing an exercise to practice writing functions.
I'm trying to figure out the general code before writing the function that reproduces the output from the table function. So far, I have the following:
set.seed(111)
vec <- as.integer(runif(10, 5, 20))
x <- sort(unique(vec))
for (i in x) {
c <- length(x[i] == vec[i])
print(c)
}
But this gives me the following output:
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
I don't think I'm subsetting correctly in my loop. I've been watching videos, but I'm not quite sure where I'm going wrong. Would appreciate any insight!
Thanks!
We can sum the logical vector concatenate it to count
count <- c()
for(number in x) count <- c(count, sum(vec == number))
count
#[1] 3 1 4 1 5 4 3 2 7
In the OP's for loop, it is looping over the 'x' values and not on the sequence of 'x'
If we do
for(number in x) count <- c(count, length(vec[vec == number]))
it should work as well
You can try sapply + setNames to achieve the same result like table, i.e.,
count <- sapply(x, function(k) setNames(sum(k==vec),k))
or
count <- sapply(x, function(k) setNames(length(na.omit(match(vec,k))),k))
such that
> count
1 2 3 4 5 6 7 8 9
3 1 4 1 5 4 3 2 7
Here is a solution without using unique and with one pass through the vector (if only R was fast with for loops!):
count = list()
for (i in vec) {
val = as.character(i)
if (is.null(count[[val]]))
count[[val]] = 1
else
count[[val]] = count[[val]] + 1
}
unlist(count)

How to skip iterations in a for loop using the next function in R [duplicate]

This question already has an answer here:
For loop problems in r: Error in if (length[i] == 1) { : missing value where TRUE/FALSE needed
(1 answer)
Closed 3 years ago.
len1 <- sample(1:2,100,replace=TRUE)
df <- data.frame(col1= c(1:200),col2= c(1:200))
for (i in 1:length(len1)) {
if (len1[i]==1) {
df$col1[i] <- len1[i] }
else if (len1[i]==2) {
df$col1[i] <- len1[i]
df$col1[i+1] <- 2
next
}
}
Every time "2" occurs in the len1 list, I would like to add this in the proceeding row and skip the next iteration (i+1). Basically, I would want (i+1) every time a "2" occurs in the len1 list.
The desired final table will be longer than the len1 sample, it should be equal to sum(len1).
I would like it to look something like this: where every 2 is followed by an additional 2.
> df
col1 col2
1 2 1
2 2 2
3 1 3
4 2 4
5 2 5
6 1 6
Any suggestions? Thanks!
This question is difficult to understand. One important information is disclosed in a comment by the OP:
The desired final table will be longer than the len1 sample, it should
be equal to sum(len1)
If I understand correctly, the OP wants to copy a 1 to the output vector if the input is 1 and he wants to copy two subsequent 2s to the output vector if the input vector is 2.
If my understanding is correct, then this is what
rep(len1, times = len1)
does.
So, with a proper reproducible example
n_row <- 10L
set.seed(2L)
len1 <- sample(1:2, n_row , replace = TRUE)
len1
[1] 1 2 2 1 2 2 1 2 1 2
rep(len1, times = len1)
returns
[1] 1 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2
Of course, sum(len1) == length(rep(len1, times = len1)) is TRUE.
Thus, the data.frame can be created by
data.frame(col1 = rep(len1, times = len1), col2 = seq_len(sum(len1)))
Fixing the for loop (not recommended)
If my understanding of OP's intenstions is correct, OP's for loop can be fixed by introducing a separate count j for the output vector:
df <- data.frame(col1 = seq_len(sum(len1)), col2 = seq_len(sum(len1)))
j <- 1L
for (i in 1:length(len1)) {
if (len1[i] == 1L) {
df$col1[j] <- len1[i]
j <- j + 1L
}
else if (len1[i] == 2L) {
df$col1[j] <- len1[i]
df$col1[j + 1L] <- 2L
j <- j + 2L
}
}
df
col1 col2
1 1 1
2 2 2
3 2 3
4 2 4
5 2 5
6 1 6
7 2 7
8 2 8
9 2 9
10 2 10
11 1 11
12 2 12
13 2 13
14 1 14
15 2 15
16 2 16
Use a skip variable that always resets to FALSE when set.
skip <- FALSE
for (i in 1:length(len1)) {
if (skip) {
skip <- FALSE
next
}
if (len1[i]==1) {
df$col1[i] <- len1[i]
} else if (len1[i]==2) {
df$col1[i] <- len1[i]
df$col1[i+1] <- 2
skip <- TRUE
}
}
To state differently, you want col1 to be 2 if the previous len1 is 2, but len1 otherwise.
We can create a lagged vector with: c(1, len1[-length(len1)]) to determine whether a row should take the previous len1 or a 2 as its value. A 1 is added to the beginning for the fact that the first row should take the value of len1.
So you can then get the desired result if ifelse:
df$col1 <- ifelse(
c(1, len1[-length(len1)]) == 2,
2, # if prev is 2, value is 2
len1 #value of len1 otherwise
)
As pointed out in the comment, since the sample function is involved, it gives different result every time so there is no point to show the result here.
Ok I think what you want is
while (i <= length(len1)) {
if (len1[i]==1) {
df$col1[i] <- len1[i] }
else if (len1[i]==2) {
df$col1[i] <- len1[i]
df$col1[i+1] <- 2
i = i + 1
}
i = i + 1
}
In your question you state you want the next loop to be skipped, but using next jumps out of the current iteration and into the next. Using a while loop gives you more control over the increment, in this case we add 1 or 2 to the counter based on the value of len1[I], which means we can essentially skip ahead.

R - Where element doesn't equal next element in vector

With a vector, say:
a <- c(1,1,1, 2,2,2,2, 1,1, 3,3,3,3)
I need to know
at which index values the element values changed
the element value before/after change
Using the above example, the output would look something like:
i before after
3 1 2
7 2 1
9 1 3
I suppose I can convert the values into a data.frame and lag/shift the column but am wondering if there is a better approach. I am also trying to avoid looping over the vector, if possible.
I also don't need the results in data.frame / spreadsheet format; feel free to propose a different output format.
Here is an option
idx <- which(a != c(a[-1], NA))
data.frame(
i = idx,
before = a[idx],
after = a[idx + 1])
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3
You can roll everything into a function
f <- function(x) {
idx <- which(x != c(x[-1], NA))
data.frame(
i = idx,
before = x[idx],
after = x[idx + 1])
}
f(a)
giving the same output as above.
with(rle(a), data.frame(i = head(cumsum(lengths), -1),
before = head(values, -1),
after = values[-1]))
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3

R ddply, applying if and ifelse functions

I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.

R: How can I sum across variables, within cases, while counting NA as zero

Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0

Resources