recursive replacement in R - r

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?

The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7

You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

Related

Using Strings to Identify Sequence of Column Names in R

I am currently try to use pre-defined strings in order to identify multiple column names in R.
To be more explicit, I am using the ave function to create identification variables for subgroups of a dataframe. The twist is that I want the identification variables to be flexible, in such a manner that I would just pass it as a generic string.
A sample code would be:
ids = with(df,ave(rep(1,nrow(df)),subcolumn1,subcolumn2,subcolumn3,FUN=seq_along))
I would like to run this code in the following fashion (code below does not work as expected):
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),subColumnsString ,FUN=seq_along))
I tried something with eval, but still did not work:
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),eval(parse(text=subColumnsString)),FUN=seq_along))
Any ideas?
Thanks.
EDIT: Working code example of what I want:
df = mtcars
id_names = c("vs","am")
idDF_correct = transform(df,idItem = as.numeric(interaction(vs,am)))
idDF_wrong = cbind(df,ave(rep(1,nrow(df)),df[id_names],FUN=seq_along))
Note how in idDF_correct, the unique combinations are correctly mapped into unique values of idItem. In idDF_wrong this is not the case.
I think this achieves what you requested. Here I use the mtcars dataset that ships with R:
subColumnsString <- c("cyl","gear")
ids = with(mtcars, ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along))
Just index your data.frame using the sub columns which returns a list that naturally works with ave
EDIT
ids = ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along)
You can omit the with and just call plain 'ol ave, as G. Grothendieck, stated and you should also use their answer as it is much more general.
This defines a function whose arguments are:
data, the input data frame
by, a character vector of column names in data
fun, a function to use in ave
Code--
Ave <- function(data, by, fun = seq_along) {
do.call(function(...) ave(rep(1, nrow(data)), ..., FUN = fun), data[by])
}
# test
Ave(CO2, c("Plant", "Treatment"), seq_along)
giving:
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3
[39] 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
[77] 7 1 2 3 4 5 6 7

Arguments for Subset within a function in R colon v. greater or equal to

Suppose I have the following data.
x<- c(1,2, 3,4,5,1,3,8,2)
y<- c(4,2, 5,6,7,6,7,8,9)
data<-cbind(x,y)
x y
1 1 4
2 2 2
3 3 5
4 4 6
5 5 7
6 1 6
7 3 7
8 8 8
9 2 9
Now, if I subset this data to select only the observations with "x" between 1 and 3 I can do:
s1<- subset(data, x>=1 & x<=3)
and obtain my desired output:
x y
1 1 4
2 2 2
3 3 5
4 1 6
5 3 7
6 2 9
However, if I subset using the colon operator I obtained a different result:
s2<- subset(data, x==1:3)
x y
1 1 4
2 2 2
3 3 5
This time it only includes the first observation in which "x" was 1,2, or 3. Why?
I would like to use the ":" operator because I am writing a function so the user would input a range of values from which she wants to see an average calculated over the "y" variable. I would prefer if they can use ":" operator to pass this argument to the subset function inside my function but I don't know why subsetting with ":" gives me different results.
I'd appreciate any suggestions on this regard.
You can use %in% instead of ==
subset(data, x %in% 1:3)
In general, if we are comparing two vectors of unequal sizes, %in% would be used. There are cases where we can take advantage of the recycling (it can fail too) if the length of one of the vector is double that of the second. Some examples with some description is here.

input sequential numbers without specific end in a data frame's column in r

I would like to give a sequence of numbers to a new column to a data frame. But this sequence will repeat several times based on a value in another column. (i.e It starts from 1 until that specific value will be changed to other value).
My problem is how to define the ending point for each sequence in r.
A part of my data frame with the column "V2" which I intend to add:
V1 V2(new added column with sequential numbers)
12 1
12 2
12 3
12 4
12 5
13 1
13 2
13 3
13 4
13 5
13 6
14 1
14 2
14 3
14 4
I tried to use the following code, which was not working!
count <- table(df$V1)
c <- as.integer(names(count)[df$V1==12])
repeat{
df$V2<- seq(1,c, by=1)
if(df$V1!=12){
break
}
}
It sounds like you might be looking for rle since you're interested in any time the "V1" variable changes.
Try the following:
> sequence(rle(df$V1)$lengths)
[1] 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4
rle is a very good solution but you could also have used ave:
tab$V2 <- ave(tab$V1, tab$V1, FUN=seq_along)
hth
Well Ananda beats my effort:
vec = numeric(0)
for(i in unique(df$V1)){
n = length(df$V1[df$V1 == i])
vec = c(vec, 1:n)
}

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Storing an output in the same data.frame when row size of output different

Sometimes I want to perform a function (eg difference calculation) on a dataset and store the results directly in the data frame
df <- data.frame(a$C, diff(a$C))
But I cannot do that because the number of rows is different.
Is there some syntax that will allow me to to that, perhaps having NA when the function (diff()) gives no results?
There isn't a general solution to this without making vast assumptions about the whole panoply of function one may wish to use.
For the example you show, we can easily work out that the first value from diff() would be an NA if it returned it:
set.seed(5)
d <- rpois(10, 5)
> d
[1] 3 6 8 4 2 6 5 7 9 2
> diff(d)
[1] 3 2 -4 -2 4 -1 2 2 -7
So if you are using diff() then you can always just do:
> dd <- data.frame(d, Diff = c(NA, diff(d)))
> dd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
But now consider what you would do with any other function that you might wish to use that doesn't always return NA in the correct place.
For this example, we can use the zoo package which has an na.pad argument:
require(zoo)
d2 <- as.zoo(d)
ddd <- data.frame(d, Diff = diff(d2, na.pad = TRUE))
> ddd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
If you are using a modelling function with a formula interface (e.g. lm()) and that function has an na.action argument, then you can set na.action = na.exclude in the function call and extractor functions such as fitted(), resid() etc will add back in to their output NA in the correct places so that the output is of the same length as the data passed to the modelling function.
If you have other more specific cases you want to explore, please edit your Answer. In specific cases there will usually be a simple Answer to your Q. In the general case the Answer is no, it is not possible to do what you ask.
The standard method is to create as you say a vector that is extended at one end or the other with an NA
dfrm$diffvec <- c(NA, diff(firstvec) )

Resources