I have a large data set, 150k rows, ~11 MB in size. Each row contains an hourly measure of profit, which can be positive, negative, or zero. I am trying to calculate a new variable equal to the profit of each positive "block." Hopefully this is self-explanatory in the data set below.
"Profit" is the input variable. I can get the next two columns but can't solve for "profit_block". Any help would be much appreciated!
dat <- data.frame(profit = c(20, 10, 5, 10, -20, -100, -40, 500, 27, -20),
indic_pos = c( 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
cum_profit = c(20, 30, 35, 45, 0, 0, 0, 500, 527, 0),
profit_block = c(45, 45, 45, 45, 0, 0, 0, 527, 527, 0))
profit indic_pos cum_profit profit_block
1 20 1 20 45
2 10 1 30 45
3 5 1 35 45
4 10 1 45 45
5 -20 0 0 0
6 -100 0 0 0
7 -40 0 0 0
8 500 1 500 527
9 27 1 527 527
10 -20 0 0 0
I've found the following post below very helpful, but I can't quite conform it to my need here. Thanks again.
Related URL: Assigning a value to each range of consecutive numbers with same sign in R
We can use rleid to create a group based on the sign of the column i.e. same adjacent sign elements will be a single group and then get the max of the 'cum_profit'
library(dplyr)
dat %>%
group_by(grp = rleid(sign(profit))) %>%
mutate(profit_block2 = max(cum_profit)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 10 x 5
# profit indic_pos cum_profit profit_block profit_block2
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20 1 20 45 45
# 2 10 1 30 45 45
# 3 5 1 35 45 45
# 4 10 1 45 45 45
# 5 -20 0 0 0 0
# 6 -100 0 0 0 0
# 7 -40 0 0 0 0
# 8 500 1 500 527 527
# 9 27 1 527 527 527
#10 -20 0 0 0 0
Related
I want to model whether I can respond to an event. This depends on when I last responded to an event. I need to take these conditions into account:
Events can occur throughout the day, but I can only respond between 7 am and 11pm.
Time between events can vary, but time between responses must be at least 90 min.
In other words, you can only respond to a new event if your last response was at least 90 min ago.
It is important, that I don’t want a 1 if the time between events is >90 but I only want a 1 if the time between an event and last response is >90.
structure(list(event_day = c(0L, 0L, 0L, 0L, 0L, 0L), event_hr = c(1,
8, 9, 9, 10, 12), event_minute = c(41L, 25L, 22L, 41L, 26L, 1L
), onset_time = c(101, 505, 562, 581, 626, 721)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Onset_time is the time since start of the model in minutes. We would like to have a “respond_col” with 1 when we can respond and 0 when we cannot respond. For these 6 rows, the respond_col is supposed to result in 0,1,0,0,1,1.
This sums up what I want to do but I don’t know how to code this:
If difference in onset_time>90 since last 1 in respond_col, print 1 in respond_col, else print 0 in respond_col.
Hope you can help me!
This requires a few data modifications and a for statement.
This requires 2 libraries, hms and tidyverse.
I added rows to your data frame to test some of the conditions you mentioned.
library(hms)
library(tidyverse)
dat <- read.table(header = T, text = "
event_day event_hr event_minute onset_time
1 0 1 41 101
2 0 8 25 505
3 0 9 22 562
4 0 9 41 581
5 0 10 26 626
6 0 12 1 721")
# add rows for testing
dat <- do.call("rbind",
list(dat, c(0, 12, 59, 721 + 58),
c(0, 14, 20, 721 + 58 + 21 + 60),
c(0, 23, 5, 860 + 45 + 8 * 60),
c(1, 7, 5, 860 + 45 + 16 * 60))) %>% as.data.frame()
# event_day event_hr event_minute onset_time
# 1 0 1 41 101
# 2 0 8 25 505
# 3 0 9 22 562
# 4 0 9 41 581
# 5 0 10 26 626
# 6 0 12 1 721
# 7 0 12 59 779
# 8 0 14 20 860
# 9 0 23 0 1380
# 10 1 7 5 1865
The next step requires a vector that stores the time thresholds (7-11) and the following changes to dat: a column with the time differences, a field that indicates whether or not the time meets the 7 am - 11 pm criteria, and 2 columns filled with 0: accumulated time and response. Both of these columns are filled in the for statement. The function hms is from the library hms.
these <- c(hms(0, 0, 7), hms(0, 0, 23)) # day constraints
(dat1 <- dat %>% mutate(
time = c(0, diff(onset_time)), # 0 for first row, then the rest
time_avail = between(hms(hours = event_hr, minutes = event_minute),
these[1], these[2]),
# accumulated time since last reset; whether response is warranted (conditions met)
accum_time = 0, response = 0))
# event_day event_hr event_minute onset_time time time_avail accum_time response
# 1 0 1 41 101 0 FALSE 0 0
# 2 0 8 25 505 404 TRUE 0 0
# 3 0 9 22 562 57 TRUE 0 0
# 4 0 9 41 581 19 TRUE 0 0
# 5 0 10 26 626 45 TRUE 0 0
# 6 0 12 1 721 95 TRUE 0 0
# 7 0 12 59 779 58 TRUE 0 0
# 8 0 14 20 860 81 TRUE 0 0
# 9 0 23 5 1385 525 FALSE 0 0
# 10 1 7 5 1865 480 TRUE 0 0
For the for statement, I'm using a boolean flag: reset for when the cumulative time resets.
reset = F # boolean flag for cumulative time
for(j in 1:nrow(dat1)) {
if(j == 1 | reset) { # first row or reset
dat1$accum_time[j] <- dat1$time[j]
reset = F
} else { # any row other than first or reset
dat1$accum_time[j] <- dat1$accum_time[j - 1] + dat1$time[j]
} # determine whether trigger the reset
if(dat1$accum_time[j] > 90 & dat1$time_avail[j]) {
dat1$response[j] <- 1
reset = T
}
}
dat1
# event_day event_hr event_minute onset_time time time_avail accum_time response
# 1 0 1 41 101 0 FALSE 0 0
# 2 0 8 25 505 404 TRUE 404 1
# 3 0 9 22 562 57 TRUE 57 0
# 4 0 9 41 581 19 TRUE 76 0
# 5 0 10 26 626 45 TRUE 121 1
# 6 0 12 1 721 95 TRUE 95 1
# 7 0 12 59 779 58 TRUE 58 0
# 8 0 14 20 860 81 TRUE 139 1
# 9 0 23 5 1385 525 FALSE 525 0
# 10 1 7 5 1865 480 TRUE 1005 1
Let me know if you have any questions.
I have a very wide dataset with multiple psychometric scales and I would like to remove rows if any of a handful of columns contains zero (i.e., a missing response).
I know how to do it when the data frame is small, but my method is not scalable. For example,
dftry <- data.frame(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37), z=c(12, 0, 33, 22, 23))
x y z
1 1 0 12
2 2 10 0
3 5 5 33
4 3 3 22
5 0 37 23
# Remove row if it has 0 in y or z columns
# is there a difference between & and , ?
dftry %>% filter(dftry$y > 0 & dftry$z > 0)
x y z
1 5 5 33
2 3 3 22
3 0 37 23
In my actual data, I want to remove rows if there are zeroes in any of these columns:
# this is the most succinct way of selecting the columns in question
select(c(1:42, contains("BMIS"), "hamD", "GAD"))
You can use rowSums :
cols <- c('y', 'z')
dftry[rowSums(dftry[cols] == 0, na.rm = TRUE) == 0, ]
# x y z
#1 5 5 33
#2 3 3 22
#3 0 37 23
We can integrate this into dplyr for your real use-case.
library(dplyr)
dftry %>%
filter(rowSums(select(.,
c(1:42, contains("BMIS"), "hamD", "GAD")) == 0, na.rm = TRUE) == 0)
Does this work using dplyr:
> library(dplyr)
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dftry %>% select(c(1:3,contains('BMIS'), hamD, GAD)) %>% filter_all(all_vars(. != 0))
x y z BMIS_1 BMIS_3 hamD GAD
1 5 5 33 5 5 33 33
2 3 3 22 3 3 22 22
>
Data used:
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dput(dftry)
structure(list(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37),
z = c(12, 0, 33, 22, 23), a = c(1, 2, 5, 3, 0), b = c(0,
10, 5, 3, 37), c = c(12, 0, 33, 22, 23), BMIS_1 = c(1, 2,
5, 3, 0), BMIS_3 = c(0, 10, 5, 3, 37), hamD = c(12, 0, 33,
22, 23), GAD = c(12, 0, 33, 22, 23), m = c(12, 0, 33, 22,
23), n = c(12, 0, 33, 22, 23)), class = "data.frame", row.names = c(NA,
-5L))
>
I have the following data frame in R. For this experiment I was testing the survival of cells at several times with 2 treatments, and 2 replicates for each treatment. I want to calculate the percentage of cells alive at each time for each treatment/replicate.
For example, for Treat 1 Rep 1 it would be 500/500, 470/500, 100/500, 20/500, for Treat 2 Rep 1 it would be 430/430, 420/430, 300/430, 100/430
Thanks!
x <- data.frame("treatment"= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
"rep"=c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),
"Time" = c(0, 30, 60, 180, 0, 30, 60, 180, 0, 30, 60, 180,0, 30, 60, 180 ),
"cells_alive" = c(500, 470, 100, 20, 476, 310, 99, 2, 430, 420, 300, 100, 489, 451, 289, 4))
We can group by 'treatment', 'rep', calculate the 'prop'ortion by dividing the 'cells_alive' with the value of 'cells_alive' that correspond to 'Time' as 0
library(dplyr)
x1 <- x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[Time == 0])
-output
x1
# A tibble: 16 x 5
# Groups: treatment, rep [4]
# treatment rep Time cells_alive prop
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 500 1
# 2 1 1 30 470 0.94
# 3 1 1 60 100 0.2
# 4 1 1 180 20 0.04
# 5 1 2 0 476 1
# 6 1 2 30 310 0.651
# 7 1 2 60 99 0.208
# 8 1 2 180 2 0.00420
# 9 2 1 0 430 1
#10 2 1 30 420 0.977
#11 2 1 60 300 0.698
#12 2 1 180 100 0.233
#13 2 2 0 489 1
#14 2 2 30 451 0.922
#15 2 2 60 289 0.591
#16 2 2 180 4 0.00818
Or with match
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[match(0, Time)])
if the 'Time' is already ordered
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/first(cells_alive))
[ First Stack question please be kind :) ]
I'm creating multiple new columns in a data frame based on multiple conditional statements of existing columns - all essentially new combinations of columns.
For example, if there are 4 columns (a:d), I need new columns of all combinations (abcd, abc, abd, etc) and a 0/1 coding based on threshold data in a:d.
Toy data example included and desired outcome. However needs to be scalable: there are 4 base columns, but I need all combinations of 2, 3 and 4 columns not just 3-value (abc, abd, .... ab, ac, ad, ... total n = 11)
[Background for context: this is actually flow cytometry data from multipotent stem cells that can grow into colonies of all lineage cell type (multipotent, or abcd) or progressively more restricted populations (only abc, or abd, ab, ac, etc)
# Toy data set
set.seed(123)
df <- tibble(a = c(sample(10:50, 10)),
b = c(sample(10:50, 10)),
c = c(sample(10:50, 10)),
d = c(sample(10:50, 10)))
Current code produces the desired result however, this needs 11 lines of repetitive code which is error prone and I hope has a more elegant solution:
df %>%
mutate(
abcd = if_else(a > 30 & b > 20 & c > 30 & d > 30, 1, 0),
abc = if_else(a > 30 & b > 20 & c > 30 & d <= 30, 1, 0),
abd = if_else(a > 30 & b > 20 & c <= 30 & d > 30, 1, 0),
acd = if_else(a > 30 & b <= 20 & c > 30 & d > 30, 1, 0),
bcd = if_else(a <= 30 & b > 20 & c > 30 & d > 30, 1, 0))
What I understand from your question, for each row you just need to find which columns meet the criteria defined in your ifelse() conditions. This vectorized solution will add a column to your df which contains all the combinations. This probably is also faster than multiple ifelse conditions as well. Finally, the new column can be used for ordering or grouping.
# define the threshold levels for all columns
threshold = c(a=30, b=20, c=30, d=30)
# get names of columns meeting the threshold and paste names
df$combn <- apply(df, 1, function(x) {
paste(names(x)[x > threshold], collapse = "")
})
> df
# A tibble: 10 x 5
a b c d combn
<int> <int> <int> <int> <chr>
1 21 49 46 49 bcd
2 41 28 37 46 abcd
3 25 36 34 36 bcd
4 43 31 47 40 abcd
5 44 13 48 10 ac
6 11 42 35 27 bc
7 28 18 29 48 d
8 40 11 30 17 a
9 46 20 19 20 a
10 24 40 14 43 bd
If I get that correctly, you want to categorize each row into exactly one class, so getting the category name as concatenation of threshold tests should be enough. Then you can get 0/1 columns using spread():
df %>%
mutate(
a_ = if_else(a > 30, 'a', 'x'),
b_ = if_else(b > 20, 'b', 'x'),
c_ = if_else(c > 30, 'c', 'x'),
d_ = if_else(d > 30, 'd', 'x'),
all_ = paste0(a_, b_, c_, d_),
one_ = 1) %>%
spread(all_, one_, fill = 0) %>%
select(-ends_with("_"))
Gives
# A tibble: 10 x 11
a b c d abcd axcx axxx xbcd xbcx xbxd xxxd
<int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 42 35 27 0 0 0 0 1 0 0
2 21 49 46 49 0 0 0 1 0 0 0
3 24 40 14 43 0 0 0 0 0 1 0
4 25 36 34 36 0 0 0 1 0 0 0
5 28 18 29 48 0 0 0 0 0 0 1
6 40 11 30 17 0 0 1 0 0 0 0
7 41 28 37 46 1 0 0 0 0 0 0
8 43 31 47 40 1 0 0 0 0 0 0
9 44 13 48 10 0 1 0 0 0 0 0
10 46 20 19 20 0 0 1 0 0 0 0
(You can use '' instead of 'x', but then spread() will overwrite some of your original columns.)
I have seen other questions similar to this but they do not answer my question. I want to expand my dataset as I need to create a time-varying variable for survival analysis and want to use survSplit command (survival package) but my data is already partially in long format. Example data:
data1<-structure(list(id = c(1, 1, 1, 1, 5, 5, 5, 5, 5, 7, 7, 7, 7,
7, 7), start = c(0, 183, 210, 241, 0, 183, 187, 212, 244, 0,
118, 139, 188, 212, 237), no_days = c(NA, 28L, 28L, 28L, NA,
7L, 28L, 28L, 28L, NA, 28L, 28L, 28L, 28L, 28L), stop = c(NA,
211, 238, 269, NA, 190, 215, 240, 272, NA, 146, 167, 216, 240,
265), drug = c(0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1),
dead = c(0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1)), .Names = c("id",
"start", "no_days", "stop", "drug", "dead"), row.names = c(NA,
15L), class = "data.frame")
> head(data1,15)
id start no_days stop drug dead
1 1 0 NA NA 0 0
2 1 183 28 211 1 0
3 1 210 28 238 1 0
4 1 241 28 269 1 1
5 5 0 NA NA 0 0
6 5 183 7 190 1 0
7 5 187 28 215 1 0
8 5 212 28 240 1 0
9 5 244 28 272 1 1
10 7 0 NA NA 0 0
11 7 118 28 146 1 0
12 7 139 28 167 1 0
13 7 188 28 216 1 0
14 7 212 28 240 1 0
15 7 237 28 265 1 1
Start is the day the drug was prescribed, no_days is how long the prescription was for, drug indicates whether a person was on the drug for the given time period (this is the variable I need to make time-varying), dead indicates when a person died. At the moment the dataset only contains times an individual was on the drug so the final dataset I want should look like this:
head(data1,18)
id start no_days stop drug dead
1 1 0 NA 182 0 0
2 1 183 28 211 1 0
3 1 210 28 238 1 0
4 1 239 NA 240 0 0
5 1 241 28 269 1 1
6 5 0 NA 182 0 0
7 5 183 7 190 1 0
8 5 187 28 215 1 0
9 5 212 28 240 1 0
10 5 241 NA 243 0 0
11 5 244 28 272 1 1
12 7 0 NA 117 0 0
13 7 118 28 146 1 0
14 7 139 28 167 1 0
15 7 168 NA 187 0 0
16 7 188 28 216 1 0
17 7 212 28 240 1 0
18 7 237 28 265 1 1
Maybe this should be a standard data manipulation problem where I need to add more rows based on a certain criteria but considering it is survival data and survSplit was designed for this, albeit in a slightly different data structure to begin I was wondering is there an easy way to use survSplit to solve my problem. If not, does anyone have a simple suggestion to expand the dataframe.
My ultimate step is to fit a cox model something like:
coxph(Surv(data$start,data$stop,data$dead)~covariates + drug +cluster(id),data=data1)
Thanks for any suggestions.
Consider the following data wrangling with base R where essentially you merge dataframe with itself shifted by one row to align current and next record and then transform for start and stop calculations.
Note: merge will raise a warning (not error) of the duplicate nextidcnt column. Either ignore or create a second data1 for the merge using id and idcnt (shifted one in new df) as join keys.
# OBTAIN GROUP COUNT (FOR MERGE IDs)
data1$idcnt <- sapply(1:nrow(data1), function(i) sum(data1[1:i, c("id")] == data1$id[i]))
data1$nextidcnt <- data1$idcnt + 1
# MERGE
dfm <- merge(data1, data1, by.x=c("id", "nextidcnt"), by.y=c("id", "idcnt"))
# CALCULATE NEW COLUMNS
dfm <- transform(dfm,
start = ifelse(is.na(stop.x), start.x, stop.x + 1),
no_days = no_days.x,
stop = start.y - 1,
drug = 0,
dead = dead.x)
# ROW BIND ORIGINAL SUBSET WITH NEW ROWS
finaldf <- rbind(data1[data1$start != 0, c(1:6)],
dfm[dfm$start < dfm$stop,
c("id", "start", "no_days", "stop", "drug", "dead")])
finaldf <- finaldf[with(finaldf, order(id, start, stop)),] # ORDER BY ID, START, STOP
rownames(finaldf) <- NULL # RESET ROW NAMES
Output
finaldf
# id start no_days stop drug dead
# 1 1 0 NA 182 0 0
# 2 1 183 28 211 1 0
# 3 1 210 28 238 1 0
# 4 1 239 28 240 0 0
# 5 1 241 28 269 1 1
# 6 5 0 NA 182 0 0
# 7 5 183 7 190 1 0
# 8 5 187 28 215 1 0
# 9 5 212 28 240 1 0
# 10 5 241 28 243 0 0
# 11 5 244 28 272 1 1
# 12 7 0 NA 117 0 0
# 13 7 118 28 146 1 0
# 14 7 139 28 167 1 0
# 15 7 168 28 187 0 0
# 16 7 188 28 216 1 0
# 17 7 212 28 240 1 0
# 18 7 237 28 265 1 1