Replace 0 when first observation for a level factor R

Replace 0 when first observation for a level factor R - r

I have this sample:
data <- structure(list(mmsi = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
tr = c(1, 1, 1, 0, 2, 2, 0, 4, 4, 0, 5, 5)), .Names = c("mmsi",
"tr"), row.names = c(NA, -12L), class = "data.frame")
I want to replace each 0 in the column tr with the previous value of tr, for each mmsi.
This function works well on the sample:
for ( i in levels(data$mmsi) ) {
data$test <- na.locf(with(data, { is.na(tr) <- tr == 0; tr }), fromLast = FALSE)}
But when I play with a bigger sample, one issue apears: if the first value is 0, then I have an error (because it can not find the previous value...).
For example if I edit the small sample with
data <- structure(list(mmsi = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
tr = c(0, 1, 1, 0, 2, 2, 0, 4, 4, 0, 5, 5)), .Names = c("mmsi",
"tr"), row.names = c(NA, -12L), class = "data.frame")
The column tr begins here with 0 instead of 1 in the previous sample. If I apply the same function for ( i in levels(data$mmsi) ) {
data$test <- na.locf(with(data, { is.na(tr) <- tr == 0; tr }), fromLast = FALSE)} then I have of course the error
Error in `$<-.data.frame`(`*tmp*`, "test", value = c(1, 1, 1, 2, 2, 2, :
replacement has 11 rows, data has 12
--> the function could not replace the value I changes (the first value in the column tr)
I guess I need in my function one more row to edit first the 0 when they occur as a first level in tr. The new row should to replace the 0 with the following non-zero value. Then the rest of the function is fine.
The output I am looking for this new column is:
data$test
[1] 1 1 1 1 2 2 2 4 4 4 5 5
Any idea how to get this?

We can do this with one of the group by functions. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'mmsi', apply the na.locf (from zoo) after replacing the '0' values to 'NA' and with the option na.rm = FALSE, then we do a second na.locf with fromLast = TRUE to replace the starting 0 (aka NA) to the next value.
library(data.table)
library(zoo)
setDT(data)[, test := na.locf(na.locf(replace(tr, tr==0, NA),
na.rm=FALSE), fromLast=TRUE), by = mmsi]
data
# mmsi tr test
# 1: a 0 1
# 2: a 1 1
# 3: a 1 1
# 4: a 0 1
# 5: a 2 2
# 6: a 2 2
# 7: a 0 2
# 8: b 4 4
# 9: b 4 4
#10: b 0 4
#11: b 5 5
#12: b 5 5
We could also do this without using the na.locf
setDT(data)[, test := pmax(pmax(tr, shift((NA^!tr) * tr), na.rm = TRUE),1), mmsi]

Related

How to do a binary if statement in R?

I have a data set with a column of letters, followed by another column of ones and zeroes. I want to total the amount of "ones" for each letter, but am unsure how to do so in an effective way.
I appreciate the help.

We can group by the first column ('col1') and then get the sum of 'col2'
library(dplyr)
df1 %>%
group_by(col1) %>%
summarise(Total = sum(col2))
Or in data.table
library(data.table)
setDT(df1)[, .(Total = sum(col2)), col1]
Or with base R
rowsum(df1$col2, df1$col1)

Here are some other base R solutions
> tapply(df$col2, df$col1, sum)
a b c
1 1 2
> xtabs(col2~col1,df)
col1
a b c
1 1 2
Dummy Data
df <- structure(list(col1 = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 3L,
2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"), col2 = c(0,
0, 0, 0, 1, 1, 1, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-10L))
> df
col1 col2
1 a 0
2 c 0
3 a 0
4 b 0
5 a 1
6 c 1
7 c 1
8 b 1
9 b 0
10 c 0

Sample n consecutive dates from a random starting date for each index in a data frame

Consider the following DataFrame:
DF = structure(list(c_number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L), date = c("2001-01-06", "2001-01-07", "2001-01-08",
"2001-01-09", "2001-01-10", "2001-01-11", "2001-01-12", "2001-01-13",
"2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", "2001-01-18",
"2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", "2001-01-23",
"2001-01-24", "2001-01-25", "2001-01-26", "2001-01-11", "2001-01-12",
"2001-01-13", "2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17",
"2001-01-18", "2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22",
"2001-01-23", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27",
"2001-01-28", "2001-01-12", "2001-01-13", "2001-01-14", "2001-01-15",
"2001-01-16", "2001-01-17", "2001-01-18", "2001-01-19", "2001-01-20",
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25",
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30",
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25",
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30",
"2001-01-31", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27",
"2001-01-28", "2001-01-29", "2001-01-30", "2001-01-31", "2001-02-01"
), value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("c_number",
"date", "value"), row.names = c(NA, -78L), class = "data.frame")
I have sales data for 5 customer on consecutive dates; For customer 1, I have sales data on 21 consecutive dates....for customer # 5, I have sales data on 9 consecutive dates...:
> table(DF[, 1])
1 2 3 4 5
21 18 19 11 9
For each customer I want to sample a sub DF of 15 consecutive days (If I have at least 15 consecutive dates for that customer) or all dates for that customer (if I don't have 15 consecutive dates for that customer).
The key part is that in case 1 (If I have at least 15 consecutive dates for that customer) those 15 consecutive days should have a random start date (e.g. not always be the first or last 15 dates for an customer) to avoid introducing a bias in the analysis.
In plain R I would do:
library(dplyr)
slow_function <- function(i, DF, length_out = 15){
sub_DF = DF[DF$c_number == i, ]
if(nrow(sub_DF) <= length_out){
out_DF = sub_DF
} else {
random_start = sample.int(nrow(sub_DF) - length_out, 1)
out_DF = sub_DF[random_start:(random_start + length_out - 1), ]
}
}
a_out = lapply(1:nrow(a_1), slow_function, DF = DF, length_out = 15)
a_out = dplyr::bind_rows(a_out)
table(a_out[, 1])
1 2 3 4 5
15 15 15 11 9
But my data is much larger and the operation above unbearably slow. Is there a fast way to obtain the same result in data.table/dplyr?
Edit: code to generate the data.
num_customer = 10
m = 2 * num_customer
a_0 = seq(as.Date("2001-01-01"), as.Date("2001-12-31"), by = "day")
a_1 = matrix(sort(sample(as.character(a_0), m)), nc = 2)
a_2 = list()
for(i in 1:nrow(a_1)){
a_3 = seq(as.Date(a_1[i, 1]), as.Date(a_1[i, 2]), by = "day")
a_4 = data.frame(i, as.character(a_3), round(runif(length(a_3), 1)))
colnames(a_4) = c("c_number", "date", "value")
a_2[[i]] = a_4
}
DF = dplyr::bind_rows(a_2)
dim(DF)
table(DF[, 1])
dput(DF)
Edit2:
on a 100k customer DF, Christoph Wolk's solution is the fastest.
Next is G. Grothendieck's (about 4 times more time), next is
Nathan Werth's (another factor of 2 slower than G. Grothendieck's).
The other solutions are noticeably slower. Still, all proposals are faster than my tentative 'slow_function' so thanks to everyone!

Try this:
sample15consecutive <- function(DF) {
runs <- rle(DF$c_number)$lengths
start <- ifelse(runs > 15, sapply(pmax(runs-15, 1), sample.int, size=1), 1)
end <- ifelse(runs >= 15, 15, runs)
previous <- cumsum(c(0, head(runs, -1)))
DF[unlist(mapply(seq, previous + start, previous + start + end - 1), length),]
}
It's about 4 times faster according to microbenchmark. The c_numbers and dates have to be sorted.

A way to speed up in base R might be to just work with indices rather than the whole data.frame before subsetting.
output = DF[unlist(lapply(
split(1:NROW(DF), DF$c_number), #Split indices along rows of DF
function(x){
if(length(x) < 15){ #Grab all indices if there are less than 15
x
} else{
#Grab an index randomly such that there will be 14 more left after it
x[sample(0:(length(x) - 15), 1) + sequence(15)]
}
})),
]
sapply(split(output, output$c_number), NROW)
# 1 2 3 4 5
#15 15 15 11 9

samp generates a vector of 1 (in sample) and 0 (out of sample) and we subset by that. I haven't benchmarked it but it does not break up DF into sub-dataframes but only splits the c_number vector and then does a single subset on the original DF.
samp <- function(x) {
n <- length(x)
replace(0*x, seq(sample(max(n - 15, 1), 1), length = min(n, 15)), 1)
}
s <- subset(DF, ave(c_number, c_number, FUN = samp) == 1)

Try this:
library(data.table)
setDT(DF)
DF[
,
{
if (.N <= 15) {
# 15 or fewer rows? Grab them all.
.SD
} else {
# Grab a random starting row not too close to the end
random_start <- sample(seq_len(.N - 14), size = 1)
.SD[random_start + 0:14]
}
},
by = c_number
]

This is pretty straightforward with the tidyverse packages (specifically, dplyr and tidyr).
library(tidyverse)
df.sample <- arrange(DF, date) %>%
group_by(c_number) %>%
do(head(., 15))
Output (first 30 rows / 2 employees):
# A tibble: 65 x 3
c_number date value
<int> <chr> <dbl>
1 1 2001-01-06 1
2 1 2001-01-07 1
3 1 2001-01-08 1
4 1 2001-01-09 1
5 1 2001-01-10 1
6 1 2001-01-11 1
7 1 2001-01-12 1
8 1 2001-01-13 1
9 1 2001-01-14 1
10 1 2001-01-15 1
11 1 2001-01-16 1
12 1 2001-01-17 1
13 1 2001-01-18 1
14 1 2001-01-19 1
15 1 2001-01-20 1
16 2 2001-01-11 1
17 2 2001-01-12 1
18 2 2001-01-13 1
19 2 2001-01-14 1
20 2 2001-01-15 1
21 2 2001-01-16 1
22 2 2001-01-17 1
23 2 2001-01-18 1
24 2 2001-01-19 1
25 2 2001-01-20 1
26 2 2001-01-21 1
27 2 2001-01-22 1
28 2 2001-01-23 1
29 2 2001-01-24 1
30 2 2001-01-25 1
# ... with 35 more rows
Edit: the following selects a random start date for each employee and then selects up to 15 consecutive days after the randomly chosen point:
df.sample <- arrange(DF, date) %>%
group_by(c_number) %>%
mutate(date = as.Date(date), start = sample(date, 1)) %>%
filter(date >= start & date <= (start + 14))

How can I reference a list based on a variable within a data.frame?

I have a simple table with emp_id and job_code. I would like to return the correct payout based on the job_code
I've managed this with nested ifelse's but what if I have more job_code's?
library(dplyr)
set.seed(1)
emp_id <- round(rnorm(100, 500000, 10000))
job_code <- sample(c('a', 'b', 'c'), 100, replace = TRUE)
result <- sample(c(1,2,3,4), 100, replace = TRUE)
df <- data.frame(emp_id = emp_id, job_code = job_code, result = result)
job_a <- c(0, 500, 1000, 5000)
job_b <- c(0, 200, 500, 750)
job_c <- c(0, 250, 750, 1000)
# Works but sucky
df %>% mutate(payout = ifelse(job_code == 'a', job_a[result],
ifelse(job_code == 'b', job_b[result],
job_c[result])))
and dput if you prefer:
structure(list(emp_id = c(493735, 501836, 491644, 515953, 503295,
491795, 504874, 507383, 505758, 496946, 515118, 503898, 493788,
477853, 511249, 499551, 499838, 509438, 508212, 505939, 509190,
507821, 500746, 480106, 506198, 499439, 498442, 485292, 495218,
504179, 513587, 498972, 503877, 499462, 486229, 495850, 496057,
499407, 511000, 507632, 498355, 497466, 506970, 505567, 493112,
492925, 503646, 507685, 498877, 508811, 503981, 493880, 503411,
488706, 514330, 519804, 496328, 489559, 505697, 498649, 524016,
499608, 506897, 500280, 492567, 501888, 481950, 514656, 501533,
521726, 504755, 492901, 506107, 490659, 487464, 502914, 495567,
500011, 500743, 494105, 494313, 498648, 511781, 484764, 505939,
503330, 510631, 496958, 503700, 502671, 494575, 512079, 511604,
507002, 515868, 505585, 487234, 494267, 487754, 495266), job_code = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 2L,
3L, 3L, 2L, 1L, 1L, 1L, 2L, 3L, 2L, 1L, 1L, 2L, 3L, 2L, 1L, 2L,
2L, 2L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L,
3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 3L, 3L,
1L, 1L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L,
2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 3L, 1L,
1L, 1L, 3L), .Label = c("a", "b", "c"), class = "factor"), result = c(3,
1, 2, 2, 2, 4, 1, 4, 1, 2, 1, 1, 4, 3, 2, 2, 1, 2, 4, 3, 3, 2,
2, 4, 4, 4, 4, 4, 2, 4, 4, 2, 2, 4, 1, 2, 2, 1, 3, 4, 4, 1, 3,
2, 3, 2, 2, 1, 2, 3, 2, 1, 4, 2, 4, 2, 4, 1, 4, 2, 1, 2, 4, 2,
3, 4, 1, 3, 3, 2, 2, 3, 4, 1, 1, 2, 2, 4, 1, 2, 2, 3, 3, 4, 1,
1, 4, 4, 1, 4, 1, 1, 4, 3, 1, 2, 3, 2, 2, 1)), .Names = c("emp_id",
"job_code", "result"), row.names = c(NA, -100L), class = "data.frame")
What I'd like to do ideally is have the payouts within a data.frame but not sure how to reference it properly:
job_payouts <- data.frame(a = job_a, b = job_b, c = job_c)
# Won't work...
df %>% mutate(payout = job_payouts$job_code[result])

This can be achieved through the super cool method of matrix indexing in base R which is extremely fast and efficient.
# build jobs payout lookup matrix, by hand (see edit below for an extension)
jobs <- rbind(job_a, job_b, job_c)
# add row names to the matrix for convenient reference
rownames(jobs) <- levels(df$job_code)
# get payout using matrix indexing
df$payout <- jobs[cbind(df$job_code, df$result)]
This returns
# print out first 6 observations
head(df)
emp_id job_code result payout
1 493735 a 3 1000
2 501836 a 1 0
3 491644 b 2 200
4 515953 a 2 500
5 503295 a 2 500
6 491795 b 4 750
# print out jobs matrix for comparison
jobs
[,1] [,2] [,3] [,4]
a 0 500 1000 5000
b 0 200 500 750
c 0 250 750 1000
There are a couple of details that worth mentioning.
The data.frame function converts the job_code character vector, so that df$job_code is a factor variable where labels are associated with the natural numbers 1, 2, 3, ... By default, levels of the factor are ordered alphabetically by label so, in this example, the label "a" corresponds to 1, "b" to 2, and "c" to 3. You can use the levels function to find the order of the factor variable and construct the jobs matrix following that template.
The jobs matrix is used as a lookup table. It is constructed so that these integers refer to row numbers of the jobs matrix. Then, the columns can be subset as you do with the original payout vectors.
cbind(df$job_code, df$result) forms a 2 by nrow(df) (100) matrix which is used to look up the nrow(df) payoff values for each employee from the jobs matrix using matrix indexing. The R intro manual has a nice intro section on matrix indexing and additional details can be found in help("[").
Edit: Automating the construction of the lookup matrix
In the comments to this answer, the OP expresses concern that building the lookup matrix (which I called "jobs"), by hand would be tedious and prone to error. To address these valid concerns, we can use a somewhat obscure argument to the mget function, "ifnotfound." This argument allows us to control the output of elements of the list that mget returns when they are present in the vector of names, but not present in the environment.
In the comments, I suggested using NA to fill in missing levels in the comment below. We can extend this by using NA as the input for "ifnotfound."
Suppose df$job_code is a factor that has levels "a", "aa", "b", and "c" in that order. Then we build the look up matrix as follows:
# build vector for example, the actual code, using levels(), follows as a comment
job_codes <- c("a", "aa", "b", "c") # job_codes <- levels(df$jobcodes)
# get ordered list of payouts, with NA for missing payouts
payoutList <- mget(paste0("job_", job_codes), ifnotfound=NA)
which returns a named list.
payoutList
$job_a
[1] 0 500 1000 5000
$job_aa
[1] NA
$job_b
[1] 0 200 500 750
$job_c
[1] 0 250 750 1000
Note that payoutList$job_aa is a single NA. Now, build the matrix from this list.
# build lookup matrix using do.call() and rbind()
jobs.lookupMat <- do.call(rbind, payoutList)
jobs.lookupMat
[,1] [,2] [,3] [,4]
job_a 0 500 1000 5000
job_aa NA NA NA NA
job_b 0 200 500 750
job_c 0 250 750 1000
The rows of the matrix are properly ordered according to the levels of the factor df$job_code, conveniently named, and NAs fill in rows wherever there is no payout.

Using tools from tidyverse:
library(dplyr)
library(stringr)
library(tidyr)
# your data
set.seed(1)
emp_id <- round(rnorm(100, 500000, 10000))
job_code <- sample(c('a', 'b', 'c'), 100, replace = TRUE)
result <- sample(c(1,2,3,4), 100, replace = TRUE)
# construct a data frame
df <-
data.frame(emp_id = emp_id,
job_code = job_code,
result = result,
stringsAsFactors = FALSE)
# your jobs
job_a <- c(0, 500, 1000, 5000)
job_b <- c(0, 200, 500, 750)
job_c <- c(0, 250, 750, 1000)
# construct a data frame
my_job <-
data.frame(job_a, job_b, job_c) %>%
gather(job, value) %>%
group_by(job) %>%
mutate(result = 1:n(),
job_code = str_replace(job, "job_", "")) %>%
ungroup %>%
select(-job)
# join df and my_job into my_results table
my_results <-
left_join(df, my_job)
Results:
my_results %>% tbl_df
Source: local data frame [100 x 4]
emp_id job_code result value
(dbl) (chr) (dbl) (dbl)
1 493735 a 3 1000
2 501836 a 1 0
3 491644 b 2 200
4 515953 a 2 500
5 503295 a 2 500
6 491795 b 4 750
7 504874 b 1 0
8 507383 a 4 5000
9 505758 a 1 0
10 496946 c 2 250
.. ... ... ... ...

Without changing your data structure, you can do this by defining a function:
job_search <- function(code){
var_name <- paste0("job_",code)
if (exists(var_name)){
return(get(var_name))
}else{
return(NA)
}
}
library(data.table)
setDT(df)
df[, payout := job_search(job_code)[result], by = .(emp_id)]
df
emp_id job_code result payout
1: 493735 a 3 1000
2: 501836 a 1 0
3: 491644 b 2 200
4: 515953 a 2 500
5: 503295 a 2 500
6: 491795 b 4 750
7: 504874 b 1 0
8: 507383 a 4 5000
9: 505758 a 1 0
10: 496946 c 2 250
11: 515118 c 1 0
12: 503898 a 1 0
...
However, this is a fairly unstable way to keep your data, and the paste + get syntax is convoluted.
A better way to store your data would be in a lookup table:
library(data.table)
job_a <- data.frame(payout = c(0, 500, 1000, 5000))
job_b <- data.frame(payout = c(0, 200, 500, 750))
job_c <- data.frame(payout = c(0, 250, 750, 1000))
job_lookup <- rbindlist( #this is a data.table
l = list(a = job_a,b = job_b,c = job_c),
idcol = TRUE
)
# create your result index
job_lookup[, result := 1:.N, by = .id]
job_lookup
.id payout result
1: a 0 1
2: a 500 2
3: a 1000 3
4: a 5000 4
5: b 0 1
6: b 200 2
7: b 500 3
8: b 750 4
9: c 0 1
10: c 250 2
11: c 750 3
12: c 1000 4
# merge to your initial data.frame
merge(df, job_lookup, by.x = c("job_code","result"), by.y = c(".id","result"), all.x = TRUE)
job_code result emp_id payout
1 a 1 505758 0
2 a 1 501836 0
3 a 1 503898 0
4 a 1 494575 0
5 a 1 487464 0
6 a 1 503700 0
7 a 1 505939 0
8 a 1 503330 0
9 a 1 512079 0
10 a 1 481950 0
11 a 1 507685 0
12 a 1 490659 0
...

Ordered values - select first instance of lowest value, then first instance of next lowest subsequent value and so on

I have a dataframe with with many distinct UniqueIDs, and which are also ordered by dates. Each UniqueID is sorted from oldest date to newest date. We also have a column called steps which is ordered from 1 to 4.
The goal is for each UniqueID is to find the oldest instance of the first Step, then the oldest instance of the second step etc. Some steps may be missing, for instance step 3 is missing for UniqueID = "B". In this case we skip over Step 3 and move on to step 4.
Here is the original dataframe.
UniqueID Date Step
1 A 2015-07-03 2
2 A 2015-07-07 3
3 A 2015-07-09 1
4 A 2015-07-14 4
5 A 2015-07-17 1
6 A 2015-07-20 2
7 A 2015-07-23 2
8 A 2015-07-24 3
9 A 2015-07-29 3
10 B 2015-06-01 3
11 B 2015-06-15 2
12 B 2015-06-22 1
13 B 2015-06-29 4
14 B 2015-07-13 2
15 B 2015-06-22 2
16 B 2015-07-08 2
17 B 2015-07-27 4
The valid entries we want to select are observations 3, 6, 8, 12, 14, 17. Creating this dataframe:
UniqueID Date Step
3 A 2015-07-09 1
6 A 2015-07-20 2
8 A 2015-07-24 3
12 B 2015-06-22 1
14 B 2015-07-13 2
17 B 2015-07-27 4
I have the logic and some pseudo code but can't put it together. So in the example data frame for UniqueID = "A" we would first group the dataframe:
group_by(UniqueID)
The find the lowest value for UniqueID = "A" and assign to a variable.
v <- min(Step)returns 1
Then take the index for this step
i <- which.min(Step) returns 3
We then want to find the min step that is greater than the first step, and only search the elements that occur after the first step. So now we are only searching for values of Step which are > 1, and only from the position of the first value we found onward, in this case from observation 3. We want to keep repeating this for all observations of each UniqueID until we either reach the last observation, or can no longer find an observation that is greater than the last observation in the remaining elements.
Here is the dput for creating the example dataframe:
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Date = structure(c(16619, 16623, 16625,
16630, 16633, 16636, 16639, 16640, 16645, 16587, 16601, 16608,
16615, 16629, 16608, 16624, 16643), class = "Date"), Step = c(2,
3, 1, 4, 1, 2, 2, 3, 3, 3, 2, 1, 4, 2, 2, 2, 4)), .Names = c("UniqueID",
"Date", "Step"), row.names = c(NA, -17L), class = "data.frame")
Alternative dput which crashes using jeremycg's method.
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L, 9L, 10L, 11L), .Label = c("A","B",
"C","D","E","F","G","H","I","J","K"),
class = "factor"), Date = c("3/08/2015",
"21/07/2015", "7/07/2015", "7/07/2015", "29/07/2015", "29/07/2015",
"29/06/2015", "13/07/2015", "9/07/2015", "29/07/2015", "24/07/2015",
"2/07/2015", "16/07/2015", "18/06/2015", "8/07/2015", "29/07/2015",
"12/06/2015", "27/07/2015"), Step = c(1, 1, 4, 4, 4, 3,
5, 5, 1, 4, 1, 2, 2, 2, 3, 3, 2, 2)), .Names = c("UniqueID",
"Date", "Step"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -18L))
Edit: dput of UniqueID that continues to crash even using updated code from jeremycg :
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("A" ), class = "factor"), Date = structure(c(16619, 16623, 16625, 16630, 16633, 16636), class = "Date"), Step = c(1, 5, 5, 1, 1, 1)), .Names = c("UniqueID", "Date", "Step"), row.names = c(NA, -6L), class = "data.frame")

Pretty inefficient, but working.
First define a function:
myseq <- function(df){
if(which.min(df$Step) == nrow(df)){
return(list(df[nrow(df),]))
}
store <- vector(mode = "list", length = nrow(df))
i=1
while(any(!is.na(df$Step))){
store[[i]] <- df[which.min(df$Step),]
df <- df[which.min(df$Step) : nrow(df), ]
df$Step[df$Step == min(df$Step)] <- NA
i = i+1
}
store
}
Then wrap it on the dataframe using dplyr:
library(dplyr)
dta %>% group_by(UniqueID) %>%
do(do.call(rbind, myseq(.)))
Source: local data frame [6 x 3]
Groups: UniqueID
UniqueID Date Step
1 A 2015-07-09 1
2 A 2015-07-20 2
3 A 2015-07-24 3
4 B 2015-06-22 1
5 B 2015-07-13 2
6 B 2015-07-27 4

How to efficiently add an order variable to a big data frame

I have a data frame similar to the following but with 55.000 observations and about 50.000 groups:
d <- structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), b = c(1, 1, 2, 1, 2, 1, 2)), .Names = c("a",
"b"), row.names = c(1L, 3L, 2L, 4L, 5L, 6L, 7L), class = "data.frame")
As in this data frame each group is again ordered depending on variable "b". I now like to split the data frame according to the grouping variable "a" and add a vector that indicates the ordering number of each element of each sub-data-frame. So the result should look like this:
structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), b = c(1, 2, 1, 1, 2, 1, 2), order = c(1,
2, 3, 1, 2, 1, 2)), .Names = c("a", "b", "order"), row.names = c("1",
"2", "3", "4", "5", "6", "7"), class = "data.frame")
I was able to get this result with the split() function an my own gmark() function on the test data frame like this (gmark() assumes the input is already sorted):
gmark <- function(input){
x = 0
result = vector()
for(i in input){
x <- x+1
result <- append(result, x)
}
result
}
x <- split(d, d$a)
x <- lapply(x, function(x){cbind(x, order = gmark(x$b))})
x <- unsplit(x, a)
However, once I apply this to the bigger data frame split() gets very slow and does not return. Is there a way to get this result on a bigger data frame more efficiently?

Here's a solution with data.table package. This'll be much faster.
require(data.table)
DT <- as.data.table(DF)
DT[, order := 1:.N, by=a]
> DT
a b order
1: A 1 1
2: A 2 2
3: A 1 3
4: B 1 1
5: B 2 2
6: C 1 1
7: C 2 2
:= is a data.table operator, that adds columns by reference (meaning no copy of your data is made). And .N is a special variable that contains the length of each group (here, it'll hold 3,2,2 corresponding to A,B,C).

Your code is not completely runnable. Try this (the second-to-last line makes the difference):
d <- data.frame(
a = sample(LETTERS[1:5],5e4,replace=TRUE),
b = sample(letters[1:10],5e4,replace=TRUE))
x <- split(d,d$a)
y <- lapply(x, function(x){cbind(x, order = 1:nrow(x))})
z <- unsplit(y,d$a)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace 0 when first observation for a level factor R - r

Related

How to do a binary if statement in R?

Sample n consecutive dates from a random starting date for each index in a data frame

How can I reference a list based on a variable within a data.frame?

Ordered values - select first instance of lowest value, then first instance of next lowest subsequent value and so on

How to efficiently add an order variable to a big data frame

Categories

Resources