Adding information to a column conditionally comparing strings - r

I have a data frame called "grass". One of the information in this data frame is "Line" which can be: high, low, f1, f2, bl or bh.
I created a new column and want to add information to this column as the following code shows.
The problem is that I get "1" for all, not just for "high"
#add new column
grass["genome.inherited"] <- NA
#adding information to genome.inherited
#1 for the high-tolerance parent genotype (high)
#0 for the low-tolerance parent genotype (low)
#0.5 for the F1 and F2 hybrids (f1) (f2)
#0.25 for the backcross to the low tolerance population (bl)
#0.75 for the backcross to the high tolerance population (bh)
#how I tried to solve the problem
grass$genome.inherited <- if(grass$line == 'high'){
1
} else if(grass$line == 'low'){
0
} else if(grass$line == 'bl'){
0.25
} else if(grass$line == 'bh'){
0.75
} else {
0.5
}
As suggested here is the output for head(grass)
line cube.root.height genome.inherited
high 4.13 1
high 5.36 1
high 4.37 1
high 5.08 1
high 4.85 1
high 5.59 1
Thank you!

How about using the match function. It gives a number that indicates the position of a value in a character vector and has an "nomatch" value as well.
grass$genome.inherited <- c(1, 0, 0.25, 0.75, 0.5)[
match( grass$line, c( 'high', 'low','bl','bh'), nomatch=5) ]
Example from console with other values of line to test:
grass <- read.table(text="line cube.root.height genome.inherited
high 4.13 1
high 5.36 2
low 4.37 1
high 5.08 1
junk 4.85 1
high 5.59 1
", head=T)
grass$genome.inherited <- c(1, 0, 0.25, 0.75, 0.5)[
match( grass$line, c( 'high', 'low','bl','bh'), nomatch=5) ]
grass
#----
line cube.root.height genome.inherited
1 high 4.13 1.0
2 high 5.36 1.0
3 low 4.37 0.0
4 high 5.08 1.0
5 junk 4.85 0.5
6 high 5.59 1.0

You dont have to create a new column with NA. Here is the code which does it for you.
grass$genome_inherited_values <- ifelse(grass$line == 'high', 1,
ifelse(grass$line == 'low', 0,
ifelse(grass$line == 'bl',0.25,
ifelse(grass$line == 'bh',0.75,0.5)

I agree (with 42-) that nested ifelse statements is not preferred. #42-'s solution of match is (imo) far better than ifelses.
An alternative is to merge them.
Data:
grass <- read.table(text="line cube.root.height
high 4.13
high 5.36
low 4.37
high 5.08
junk 4.85
high 5.59
", head=TRUE, stringsAsFactors=FALSE)
The table of values to merge in:
genome <- data.frame(
line=c("high","low","bl","bh"),
genome.inherited=c(1, 0, 0.25, 0.75),
stringsAsFactors=FALSE)
The merge:
grass2 <- merge(grass, genome, by="line", all.x=TRUE)
If you look at the data, you'll see an NA, because "junk" (an unknown value) is not present in the genome table and therefore assigned as NA. We can fix this with an easy step:
grass2$genome.inherited[is.na(grass2$genome.inherited)] <- 0.5
grass2
# line cube.root.height genome.inherited
# 1 high 4.13 1.0
# 2 high 5.36 1.0
# 3 high 5.08 1.0
# 4 high 5.59 1.0
# 5 junk 4.85 0.5
# 6 low 4.37 0.0
#42-'s answer has the advantage of providing a default (nomatch) value in the initial call.

Your if conditions have length > 1. When the condition has length > 1 only the first element will be used and that's why you are getting all 1s.
Here's a different (simpler than nested ifelse) approach for the same thing -
vals <- c(high = 1, low = 0, f1 = 0.5, f2 = 0.5, bl = 0.25, bh = 0.75)
grass$genome.inherited <- vals[as.character(grass$line)]

Related

Using rle function with condition on a column in r

My dataset has 523 rows and 93 columns and it looks like this:
data <- structure(list(`2018-06-21` = c(0.6959635416667, 0.22265625,
0.50341796875, 0.982942708333301, -0.173828125, -1.229259672619
), `2018-06-22` = c(0.6184895833333, 0.16796875, 0.4978841145833,
0.0636718750000007, 0.5338541666667, -1.3009207589286), `2018-06-23` = c(1.6165364583333,
-0.375, 0.570800781250002, 1.603515625, 0.5657552083333, -0.9677734375
), `2018-06-24` = c(1.3776041666667, -0.03125, 0.7815755208333,
1.5376302083333, 0.5188802083333, -0.552966889880999), `2018-06-25` = c(1.7903645833333,
0.03125, 0.724609375, 1.390625, 0.4928385416667, -0.723074776785701
)), row.names = c(NA, 6L), class = "data.frame")
Each row is a city, and each column is a day of the year.
After calculating the row average in this way
data$mn <- apply(data, 1, mean)
I want to create another column data$duration that indicates the average length of a period of consecutive days where the values are > than data$mn.
I tried with this code:
data$duration <- apply(data[-6], 1, function(x) with(rle`(x > data$mean), mean(lengths[values])))
But it does not seem to work. In particular, it appears that rle( x > data$mean) fails to recognize the end of a row.
What are your suggestions?
Many thanks
EDIT
Reference dataframe has been changed into a [6x5]
The main challenge you're facing in your code is getting apply (which focuses on one row at a time) to look at the right values of the mean. We can avoid this entirely by keeping the mean out of the data frame, and doing the comparison data > mean to the whole data frame at once. The new columns can be added at the end:
mn = rowMeans(data)
dur = apply(data > mn, 1, function(x) with(rle(x), mean(lengths[values])))
dur
# 1 2 3 4 5 6
# 3.0 1.5 2.0 3.0 4.0 2.0
data = cbind(data, mean = mn, duration = dur)
print(data, digits = 2)
# 2018-06-21 2018-06-22 2018-06-23 2018-06-24 2018-06-25 mean duration
# 1 0.70 0.618 1.62 1.378 1.790 1.2198 3.0
# 2 0.22 0.168 -0.38 -0.031 0.031 0.0031 1.5
# 3 0.50 0.498 0.57 0.782 0.725 0.6157 2.0
# 4 0.98 0.064 1.60 1.538 1.391 1.1157 3.0
# 5 -0.17 0.534 0.57 0.519 0.493 0.3875 4.0
# 6 -1.23 -1.301 -0.97 -0.553 -0.723 -0.9548 2.0

Manipulate list object into data frame

library(survey)
I have data such as this. I am using the survey package to produce the MEAN, SE and FREQ of each variables in the vector named vars. I am new to manipulating lists in R & would really appreciate help!
df <- data.frame(
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
vars <- c("weight","married","pens")
design <- svydesign(ids=~1, data=df, weights=~weight)
myfun <- function(x){
means <- svymean(as.formula(paste0('~(', x, ')')), design, na.rm = T)
table <- svytable(as.formula(paste0('~(', x, ')')), design)
results <- list(svymean = means, svytable = table)
return(results)
}
lapply(vars, myfun)
The output looks like this:
[[1]]
[[1]]$svymean
mean SE
weight 0.79791 0.1177
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
[[2]]
[[2]]$svymean
mean SE
married 0.91085 0.0717
[[2]]$svytable
married
0 1
0.46 4.70
[[3]]
[[3]]$svymean
mean SE
pens 0.46272 0.2255
[[3]]$svytable
pens
0 1
2.45 2.11
I want to extract/manipulate this list above to create a dataframe that looks more like this:
question mean SE sum_svytable
weight 0.797 0.1177 5.16
married 0.910 0.071 5.16
As you can see, the sum_svytable is the sum of the frequencies produced in the $svytable generated list for each variable. Even though this number is the same for each variable (5.16 for all) in my example, it is not the same in my dataset.
sum_svytable was derived like this:
output of myfun function for weight:
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
I simply summed the frequencies for each response:
sum_svytable(for weight) = 0.46 +0.55+ 0.60+ 0.66+ 0.67+ 1.10+ 1.12
I don't mind how this result is arrived at, I just need it to be in a df!
Is this possible?
An option is to loop over the list of output from 'myfun' then extract teh components, 'svymean', create a data.frame, add the column of sums from 'svytable' element, rbind the list elements and create the 'question' column from the row names
out <- lapply(vars, myfun)
lst1 <- lapply(out, function(x)
cbind(setNames(as.data.frame(x$svymean), c("mean", "SE")),
sum_svytable = sum(x$svytable)))
out1 <- do.call(rbind, lst1)
out1$question <- row.names(out1)
row.names(out1) <- NULL
out1[c('question', 'mean', 'SE', 'sum_svytable')]
# question mean SE sum_svytable
#1 weight 0.7979070 0.1177470 5.16
#2 married 0.9108527 0.0716663 5.16
#3 pens 0.4627193 0.2254907 4.56

Assigning a value to each range of consecutive numbers with same sign in R

I'm trying to create a data frame where a column exists that holds values representing the length of runs of positive and negative numbers, like so:
Time V Length
0.5 -2 1.5
1.0 -1 1.5
1.5 0 0.0
2.0 2 1.0
2.5 0 0.0
3.0 1 1.75
3.5 2 1.75
4.0 1 1.75
4.5 -1 0.75
5.0 -3 0.75
The Length column sums the length of time that the value has been positive or negative. Zeros are given a 0 since they are an inflection point. If there is no zero separating the sign change, the values are averaged on either side of the inflection.
I am trying to approximate the amount of time that these values are spending either positive or negative. I've tried this with a for loop with varying degrees of success, but I would like to avoid looping because I am working with extremely large data sets.
I've spent some time looking at sign and diff as they are used in this question about sign changes. I've also looked at this question that uses transform and aggregate to sum consecutive duplicate values. I feel like I could use this in combination with sign and/or diff, but I'm not sure how to retroactively assign these sums to the ranges that created them or how to deal with spots where I'm taking the average across the inflection.
Any suggestions would be appreciated. Here is the sample dataset:
dat <- data.frame(Time = seq(0.5, 5, 0.5), V = c(-2, -1, 0, 2, 0, 1, 2, 1, -1, -3))
First find indices of "Time" which need to be interpolated: consecutive "V" which lack a zero between positive and negative values; they have an abs(diff(sign(V)) equal to two.
id <- which(abs(c(0, diff(sign(dat$V)))) == 2)
Add rows with average "Time" between relevant indices and corresponding "V" values of zero to the original data. Also add rows of "V" = 0 at "Time" = 0 and at last time step (according to the assumptions mentioned by #Gregor). Order by "Time".
d2 <- rbind(dat,
data.frame(Time = (dat$Time[id] + dat$Time[id - 1])/2, V = 0),
data.frame(Time = c(0, max(dat$Time)), V = c(0, 0))
)
d2 <- d2[order(d2$Time), ]
Calculate time differences between time steps which are zero and replicate them using "zero-group indices".
d2$Length <- diff(d2$Time[d2$V == 0])[cumsum(d2$V == 0)]
Add values to original data:
merge(dat, d2)
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 1.00
# 4 2.0 2 1.00
# 5 2.5 0 1.75
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Set "Length" to 0 where V == 0.
This works, at least for your test case. And it should be pretty efficient. It makes some assumptions, I'll try to point out the big ones.
First we extract the vectors and stick 0s on the beginning. We also set the last V to 0. The calculation will be based on time differences between 0s, so we need to start and end with 0s. Your example seems to tacitly assume V = 0 at Time = 0, hence the initial 0, and it stops abruptly at the maximum time, so we set V = 0 there as well:
Time = c(0, dat$Time)
V = c(0, dat$V)
V[length(V)] = 0
To fill in the skipped 0s, we use approx to do linear approximation on sign(V). It also assumes that your sampling frequency is regular, so we can get away with doubling the frequency to get all the missing 0s.
ap = approx(Time, sign(V), xout = seq(0, max(Time), by = 0.25))
The values we want to fill in are the durations between the 0s, both observed and approximated. In the correct order, these are:
dur = diff(ap$x[ap$y == 0])
Lastly, we need the indices of the original data to fill in the durations. This is the hackiest part of this answer, but it seem to work. Maybe someone will suggest a nice simplification.
# first use rleid to get the sign groupings
group = data.table::rleid(sign(dat$V))
# then we need to set the groups corresponding to 0 values to 0
# and reduce any group numbers following 0s correspondingly
# lastly we add 1 to everything so that we can stick 0 at the
# front of our durations and assign those to the 0 V values
ind = (group - cumsum(dat$V == 0)) * (dat$V != 0) + 1
# fill it in
dat$Length = c(0, dur)[ind]
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
It took me longer than I care to admit, but here is my solution.
Because you said you wanted to use it on large datasets (thus speed matters) I use Rcpp to write a loop that does all the checking. For speed comparisons I also create another sample dataset with 500,000 data.points and check the speed (I tried to compare to the other datasets but couldn't translate them to data.table (without that it would be an unfair comparison...)). If supplied, I will gladly update the speed-comparisons!
Part 1: My solution
My solution looks like this:
(in length_time.cpp)
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector length_time(NumericVector time, NumericVector v) {
double start = 0;
double time_i, v_i;
bool last_positive = v[0] > 0;
bool last_negative = v[0] < 0;
int length_i = time.length();
NumericVector ret_vec(length_i);
for (int i = 0; i < length_i; ++i) {
time_i = time[i];
v_i = v[i];
if (v_i == 0) { // injection
if (i > 0) { // if this is not the beginning, then a regime has ended!
ret_vec[i - 1] = time_i - start;
start = time_i;
}
} else if ((v_i > 0 && last_negative) || (v_i < 0 && last_positive)) {
ret_vec[i - 1] = (time_i + time[i - 1]) / 2 - start;
start = (time_i + time[i - 1]) / 2;
}
last_positive = v_i > 0;
last_negative = v_i < 0;
}
ret_vec[length_i - 1] = time[length_i - 1] - start;
// ret_vec now only has the values for the last observation
// do something like a reverse na_locf...
double tmp_val = ret_vec[length_i - 1];
for (int i = length_i - 1; i >= 0; --i) {
if (v[i] == 0) {
ret_vec[i] = 0;
} else if (ret_vec[i] == 0){
ret_vec[i] = tmp_val;
} else {
tmp_val = ret_vec[i];
}
}
return ret_vec;
}
and then in an R-file (i.e., length_time.R):
library(Rcpp)
# setwd("...") #to find the .cpp-file
sourceCpp("length_time.cpp")
dat$Length <- length_time(dat$Time, dat$V)
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Which seems to work on the sample dataset.
Part 2: Testing for Speed
library(data.table)
library(microbenchmark)
n <- 10000
set.seed(1235278)
dt <- data.table(time = seq(from = 0.5, by = 0.5, length.out = n),
v = cumsum(round(rnorm(n, sd = 1))))
dt[, chg := v >= 0 & shift(v, 1, fill = 0) <= 0]
plot(dt$time, dt$v, type = "l")
abline(h = 0)
for (i in dt[chg == T, time]) abline(v = i, lty = 2, col = "red")
Which results in a dataset with 985 observations (crossings).
Testing the speed with microbenchmark results in
microbenchmark(dt[, length := length_time(time, v)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt[, `:=`(length, length_time(time, v))] 2.625714 2.7184 3.054021 2.817353 3.077489 5.235689 100
Resulting in about 3 milliseconds for calculating with 500,000 observations.
Does that help you?
Here is my attempt done completely in base R.
Joseph <- function(df) {
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
v <- df$V
t <- df$Time
sv <- sign(v)
nR <- length(v)
v0 <- which(v==0)
id <- which(abs(c(0, diff(sv))) > 1) ## This line and (t[id] + t[id - 1L])/2 From #Henrik
myZeros <- sort(c(v0*t[1L], (t[id] + t[id - 1L])/2))
lenVals <- diff(c(0,myZeros,t[nR])) ## Actual values that
## will populate the Length column
## remove values that result from repeating zeros from the df$V column
lenVals <- lenVals[lenVals != t[1L] | c(!is.wholenumber(myZeros/t[1L]),F)]
## Below we need to determine how long to replicate
## each of the lenVals above, so we need to find
## the starting place and length of each run...
## rle is a great candidate for both of these
m <- rle(sv)
ml <- m$lengths
cm <- cumsum(ml)
zm <- m$values != 0 ## non-zero values i.e. we won't populate anything here
rl <- m$lengths[zm] ## non-zero run-lengths
st <- cm[zm] - rl + 1L ## starting index
out <- vector(mode='numeric', length = nR)
for (i in 1:length(st)) {out[st[i]:(st[i]+rl[i]-1L)] <- lenVals[i]}
df$Length <- out
df
}
Here is the output of the given example:
Joseph(dat)
Time V Length
1 0.5 -2 1.50
2 1.0 -1 1.50
3 1.5 0 0.00
4 2.0 2 1.00
5 2.5 0 0.00
6 3.0 1 1.75
7 3.5 2 1.75
8 4.0 1 1.75
9 4.5 -1 0.75
10 5.0 -3 0.75
Here is a larger example:
set.seed(142)
datBig <- data.frame(Time=seq(0.5,50000,0.5), V=sample(-3:3, 10^5, replace=TRUE))
library(compiler)
library(data.table)
library(microbenchmark)
c.Joseph <- cmpfun(Joseph)
c.Henrik <- cmpfun(Henrik)
c.Gregor <- cmpfun(Gregor)
microbenchmark(c.Joseph(datBig), c.Gregor(datBig), c.Henrik(datBig), David(datBig), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
David(datBig) 2.20602 2.617742 4.35927 2.788686 3.13630 114.0674 10 a
c.Joseph(datBig) 61.91015 62.62090 95.44083 64.43548 93.20945 225.4576 10 b
c.Gregor(datBig) 59.25738 63.32861 126.29857 72.65927 214.35961 229.5022 10 b
c.Henrik(datBig) 1511.82449 1678.65330 1727.14751 1730.24842 1816.42601 1871.4476 10 c
As #Gregor pointed out, the goal is to find the x-distance between each occurrence of zero. This can be seen visually by plotting (again, as pointed out by #Gregor (many kudos btw)). For example, if we plot the first 20 values of datBig, we obtain:
From this, we can see that the x-distances such that the graph is either positive or negative (i.e. not zero (this happens when there are repeats of zeros)) are approximately:
2.0, 1.25, 0.5, 0.75, 2.0, 1.0, 0.75, 0.5
t1 <- c.Joseph(datBig)
t2 <- c.Gregor(datBig)
t3 <- c.Henrik(datBig)
t4 <- David(datBig)
## Correct values according to the plot above (x above a value indicates incorrect value)
## 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## all correct
t1$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## mostly correct
t2$Length[1:20] x x x x x
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 0.75 0.75 0.75 0.00 0.00 0.00 0.50 0.00 0.75 0.25
## least correct
t3$Length[1:20] x x x x x x x x x x x x x
[1] 2.00 2.00 2.00 0.50 1.00 1.25 0.75 1.25 0.00 1.75 1.75 0.00 1.50 1.50 0.00 0.00 1.25 1.25 1.25 1.25
## all correct
t4$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
# agreement with David's solution
all.equal(t4$Length, t1$Length)
[1] TRUE
Well, it seems the Rcpp solution provided by David is not only accurate but blazing fast.

Show Kruskal-Wallis test ranks

I performed a kruskal wallis test on multi-treatment data where I compared five different methods.
A friend showed me the calculation in spss and the results included the mean ranks of each method.
In R, I only get the chi2 and df value and p-value when applying kruskal.test to my data set. those values are equal to the ones in spss but I do not get any ranks.
How can I print out the ranks of the computation?
My code looks like this:
comparison <- kruskal.test(all,V3,p.adj="bon",group=FALSE, main="over")
If I print comparison I get the following:
Kruskal-Wallis rank sum test
data: all
Kruskal-Wallis chi-squared = 131.4412, df = 4, p-value < 2.2e-16
But I would like to get something like this additional output from spss:
Type H Middle Rank
1,00 57 121.11
2,00 57 148.32
3,00 57 217.49
4,00 57 53.75
5,00 57 174.33
total 285
How do I get this done in r?
The table you want you have to compute yourself unfortunately. Luckely I have made a function for you:
#create some random data
ozone <- airquality$Ozone
names(ozone) <- airquality$Month
spssOutput <- function(vector) {
# This function takes your data as one long
# vector and ranks it. After that it computes
# the mean rank of each group. The groupes
# need to be given as names to the vector.
# the function returns a data frame with
# the results in SPSS style.
ma <- matrix(, ncol=3, nrow= 0)
r <- rank(vector, na.last = NA)
to <- 0
for(n in unique(names(r))){
# compute the rank mean for group n
g <- r[names(r) == n]
gt <- length(g)
rm <- sum(g)/gt
to <- to + gt
ma <- rbind(ma, c(n, gt, rm))
}
colnames(ma) <- c("Type","H","Middle Rank")
ma <- rbind(ma, c("total", to, ""))
as.data.frame(ma)
}
# calculate everything
out <- spssOutput(ozone)
print(out, row.names= FALSE)
kruskal.test(Ozone ~ Month, data = airquality)
This gives you the following output:
Type H Middle Rank
5 26 36.6923076923077
6 9 48.7222222222222
7 26 77.9038461538462
8 26 75.2307692307692
9 29 48.6896551724138
total 116
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 6.901e-06
You haven't shared your data so you have to figure out yourself how this would work for your data set.
I had an assignment where I had to do this. Make a data frame where one column is the combined values you're ranking, one column is the categories each value belongs to, and the final column is the ranking of each value. The function rank() is the one you need for the actual ranking. The code looks like this:
low <- c(0.56, 0.57, 0.58, 0.62, 0.64, 0.65, 0.67, 0.68, 0.74, 0.78, 0.85, 0.86)
medium <- c(0.70, 0.74, 0.75, 0.76, 0.78, 0.79, 0.80, 0.82, 0.83, 0.86)
high <- c(0.65, 0.73, 0.74, 0.76, 0.81,0.82, 0.85, 0.86, 0.88, 0.90)
data.value <- c(low, medium, high)
data.category <- c(rep("low", length(low)), rep("medium", length(medium)), rep("high", length(high)) )
data.rank <- rank(data.value)
data <- data.frame(data.value, data.category, data.rank)
data
data.value data.category data.rank
1 0.56 low 1.0
2 0.57 low 2.0
3 0.58 low 3.0
4 0.62 low 4.0
5 0.64 low 5.0
6 0.65 low 6.5
7 0.67 low 8.0
8 0.68 low 9.0
9 0.74 low 13.0
10 0.78 low 18.5
11 0.85 low 26.5
12 0.86 low 29.0
13 0.70 medium 10.0
14 0.74 medium 13.0
15 0.75 medium 15.0
16 0.76 medium 16.5
17 0.78 medium 18.5
18 0.79 medium 20.0
19 0.80 medium 21.0
20 0.82 medium 23.5
21 0.83 medium 25.0
22 0.86 medium 29.0
23 0.65 high 6.5
24 0.73 high 11.0
25 0.74 high 13.0
26 0.76 high 16.5
27 0.81 high 22.0
28 0.82 high 23.5
29 0.85 high 26.5
30 0.86 high 29.0
31 0.88 high 31.0
32 0.90 high 32.0
This will give you a table that looks like this.

R: Update with value of previous row (subject to condition)

I would like to update the value in a table with values of the previous row, within groups, (and probably stop the updates on a given condition)
Here is an example:
set.seed(12345)
field <- data.table(time=1:3, player = letters[1:2], prospects = round(rnorm(6),2))
setkey(field, player, time)
field[time == 1, energy := round(rnorm(2),2)] #initial level - this is what I want to propagate down the table
#let 'prospects < 0.27' be the condition that stops the process, and sets 'energy = 0'
#player defines the groups within which the updates are made
Here is the table I have.
> field
time player prospects energy
1: 1 a 0.81 -0.32
2: 2 a 0.25 NA
3: 3 a 2.05 NA
4: 1 b 1.63 -1.66
5: 2 b 2.20 NA
6: 3 b 0.49 NA
Here is the table I want.
> field
time player prospects energy
1: 1 a 0.81 -0.32
2: 2 a 0.25 0
3: 3 a 2.05 0
4: 1 b 1.63 -1.66
5: 2 b 2.20 -1.66
6: 3 b 0.49 -1.66
Thanks in advance
Probably there are better ways, but this is what came to my mind. This makes use of roll=TRUE argument. The idea is to first set energy=0.0 where prospects < 0.27:
field[prospects < 0.27, energy := 0.0]
Then, if we remove the NA values from field, we can use roll=TRUE by doing a join with all combinations as follows:
field[!is.na(energy)][CJ(c("a", "b"), 1:3), roll=TRUE][, prospects := field$prospects][]
# player time prospects energy
# 1: a 1 0.81 0.63
# 2: a 2 0.25 0.00
# 3: a 3 2.05 0.00
# 4: b 1 1.63 -0.28
# 5: b 2 2.20 -0.28
# 6: b 3 0.49 -0.28
We've to reset prospects because the roll changes it too. You could do it better, but you get the idea.
A variation, so that the roll is performed only on energy column:
field[!is.na(energy)][CJ(c("a", "b"), 1:3), list(energy),
roll=TRUE][, prospects := field$prospects][]
Or it may be simpler to use na.locf from package zoo :
field[time == 1, energy := round(rnorm(2),2)]
field[prospects < 0.27, energy := 0.0]
require(zoo)
field[, energy := na.locf(energy, na.rm=FALSE)]
which works if the first row of each group is guaranteed to be non-NA, which it is here by construction. But if not, you can run na.locf by group, too :
field[, energy := na.locf(energy, na.rm=FALSE), by=player]
something like this ?
ddply(field, 'player', function(x) {
baseline <- x[x$time == 1, 'energy']
x$energy <- baseline
ind <- which(x$prospects < 0.27)
if (length(ind)) {
x[min(ind):nrow(x), 'energy'] <- 0
}
x
})

Resources