How to determine the first row after which a variable continuously decreases? - r

Problem
Given the following dataset, I want to find the first row after which vn continuously decreases (highlighted with a red circle). I know which can be used, but I don't know how to ensure a continuous decrease. Please guide me.
Data
df <- structure(list(Time = c(152.216666666667, 152.233333333333, 152.25,
152.266666666667, 152.283333333333, 152.3, 152.316666666667,
152.333333333333, 152.35, 152.366666666667, 152.383333333333),
vn = c(22.8733019569441, 22.8485877814354, 22.8539833863057,
22.8293883815954, 22.8347839864658, 22.8101348273251, 22.8047392224548,
22.7798917511031, 22.7744961462328, 22.7496737884944, 22.7442781836241
), diff_vn = c(0.00539560487035118, -0.024714175508727, 0.00539560487035118,
-0.0245950047103243, 0.00539560487035118, -0.0246491591406404,
-0.00539560487035118, -0.0248474713516487, -0.00539560487035118,
-0.0248223577383548, -0.00539560487035118), sign_diff_vn = c(1,
-1, 1, -1, 1, -1, -1, -1, -1, -1, -1)), row.names = c(NA,
-11L), class = "data.frame")

Try with diff
with(df, which(c(diff(sign_diff_vn) == 0, FALSE))[1])
[1] 6
Or may also be
v1 <- cumsum(df$sign_diff_vn >=0)
match(max(v1), v1) + 1
[1] 6
Or another option is rleid
library(dplyr)
library(data.table)
df %>%
mutate(rn = row_number(), grp = rleid(sign_diff_vn)) %>%
filter(grp == max(grp) & sign_diff_vn < 0) %>%
pull(rn) %>%
first
[1] 6

Related

highlight cell if previous column meets specific condition R

I have a dataframe
library(flextable)
df = structure(list(col1 = c(1, NA, 1, 1, 1), col2 = c(NA, 1, NA,
1, 1), col3 = c(1, 1, NA, 1, NA), col4 = c(1, 1, 1, 1, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>% flextable()
I want
to return the last 3 columns highlighted based on the following logic:
red if it is blank
green if and only if the preceeding column was blank.
Based on this, I am trying to create a color matrix to identify the green highlights, but have hit a brick wall.
To identify the red matrix, I used the following code ifelse(is.na(df),"red","").
what would be the best method to identify the green labels
Not the prettiest, but works
df=data.frame(col1 = c(1,NA,1,1,1,1),
col2 = c(NA,1,NA,1,1,1),
col3 = c(1,1,NA,1, NA,1),
col4 = c(1,1,1,1,NA,1))
df %>% flextable()
red = ifelse(is.na(df),1,0)
green = data.frame()
for(n in 1:(ncol(red)-1)){
print(n)
r=ifelse(red[,n]==1 & red[,n+1] == 0,1,0)
green = rbind(green, r)
}
green = t(green)
colnames(green) = paste0("col",2:4)
green
red[,2:4]
ft = df[,2:4] %>%
flextable() %>%
bg(i = ~ is.na(col2), j = 1,bg='red') %>%
bg(i = ~ is.na(col3), j = 2,bg='red') %>%
bg(i = ~ is.na(col4), j = 3,bg='red') %>%
bg(i = ~ green[,1]==1,j = 1, bg='green') %>%
bg(i = ~ green[,2]==1,j = 2, bg='green') %>%
bg(i = ~ green[,3]==1,j = 3, bg='green')
ft

gather 3 different detections of three different variables

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

Group by, convert to factor and extract levels as integers issue R?

I have 2 dataframes with same columns (vars) and 2 different user id's:
df1:
structure(list(user_id = c(1, 1, 1, 1, 1, 1), obs_id = c("717b1913-0c0f-4963-8bc9-81a06a3bb1c0",
"717b1913-0c0f-4963-8bc9-81a06a3bb1c0", "717b1913-0c0f-4963-8bc9-81a06a3bb1c0",
"717b1913-0c0f-4963-8bc9-81a06a3bb1c0", "717b1913-0c0f-4963-8bc9-81a06a3bb1c0",
"717b1913-0c0f-4963-8bc9-81a06a3bb1c0"), timestamp = c(337837075445301,
337837075445301, 337837077455301, 337837077455301, 337837079457301,
337837079457301), acc_x = c(0.5363176, 0.5363176, 0.5243462,
0.5243462, 0.5243462, 0.5243462), acc_y = c(6.4693303, 6.4693303,
6.4693303, 6.4693303, 6.4693303, 6.4693303), acc_z = c(6.8093176,
6.8093176, 6.821289, 6.821289, 6.821289, 6.821289)), .Names = c("user_id",
"obs_id", "timestamp", "acc_x", "acc_y", "acc_z"), row.names = c(NA,
6L), class = "data.frame")
and df2:
structure(list(user_id = c(2, 2, 2, 2, 2, 2), obs_id = c("8027eac3-8839-498e-98b9-3b46da98d1f4",
"8027eac3-8839-498e-98b9-3b46da98d1f4", "8027eac3-8839-498e-98b9-3b46da98d1f4",
"8027eac3-8839-498e-98b9-3b46da98d1f4", "8027eac3-8839-498e-98b9-3b46da98d1f4",
"8027eac3-8839-498e-98b9-3b46da98d1f4"), timestamp = c(336965414272993,
336965414272993, 336965414272993, 336965416627384, 336965418627300,
336965420627376), acc_x = c(-1, -1, -1, 0.81644773, 0.80208206,
0.8140534), acc_y = c(-1, -1, -1, 6.648901, 6.646507, 6.651295
), acc_z = c(-1, -1, -1, 7.2618356, 7.257047, 7.233104)), .Names = c("user_id",
"obs_id", "timestamp", "acc_x", "acc_y", "acc_z"), row.names = c(NA,
6L), class = "data.frame")
Now I want to bind them, group by user_id turn obs_id to factor and extract the levels out of it to be an integers column:
bind_rows(df1,df2) %>%
group_by(user_id) %>%
mutate(obs_id = as_factor(obs_id),
replicate = as.numeric(levels(obs_id)))
returns an error:
Error in mutate_impl(.data, dots) : Column replicate must be length
6 (the group size) or one, not 0
Please advise what I am doing wrong here?
I want obs_id column to be turned to factor column, take the levels and "encode" it to be integer and instead the long string you may observe in obs_id.
After binding the datasets, convert the 'obs_id' to factor, then do the group_by as there is a conflict when we convert to factor within the group_by as the levels can be different. An easier option would be to match the 'obs_id' with unique elements of 'obs_id'
bind_rows(df1, df2) %>%
group_by(user_id) %>%
mutate(Rep = match(obs_id, unique(obs_id)))
The issue is in storing a factor column in each 'user_id' having different levels. If the objective is to get the 'Rep' column, we don't need a factor intermediate column
bind_rows(df1, df2) %>%
group_by(user_id) %>%
mutate(Rep = as.integer(factor(obs_id)))

Create a new variable using dplyr where, based on whether one variable has a specific value AND the previous or next value has a different value in R

I have data which looks like this
df <- data.frame(
ID = c(rep("A12345",5), rep("A23456",10), rep("A34567",5), "A45678", "A67891", rep("A78910",8), "A91011",
rep("A10111",4), rep("A11121",3), "A12131", "A16731"),
medication = c(rep("colchicine",5), rep("febuxosat",9), "hosps", rep("colchicine",5), "hosps", "colchicine",
rep("allopurinol",8), "allopurinol",
rep("colchicine",3), "hosps", rep("colchicine",3), "colchicine", "allopurinol"),
Date = c("2004-12-08", "2005-01-28", "2005-07-15", "2005-08-23", "2005-11-30", "2007-02-01", "2007-07-20", "2014-06-03",
"2008-04-17",
"2008-12-19", "2009-09-09", "2010-02-24", "2010-11-01", "2010-12-03", "2011-08-10", "2012-11-05", "2012-12-17",
"2012-12-19", "2013-10-03", "2013-12-11", "2014-03-26", "2015-11-12", "2014-08-07", "2008-01-31", "2008-02-21",
"2008-09-19", "2008-11-06", "2009-01-06", "2009-01-14", "2009-03-25", "2009-03-27", "2009-06-18", "2009-08-18",
"2009-09-08", "2009-11-13", "2010-01-21", "2010-04-19", "2010-07-07", "2010-08-06", "2010-08-19")
)
I then want to create a new year variable, based on the date; group everyone together based on year and their unique ID, and compute a variable which measures how many times they received medications in that year for that unique ID.
df <- df %>%
mutate(year = as.numeric(substr(Date, 1,4))) %>%
group_by(ID) %>%
mutate(meds_count = ifelse(medication %in% c("colchicine", "allopurinol", "febuxosat"), 1, 0)) %>%
unite(ID_year, ID, year, sep = "_", remove = FALSE) %>%
group_by(ID_year) %>%
mutate(meds_sum = sum(meds_count)) %>%
distinct(ID_year, .keep_all = TRUE)
Then I create a new variable 'gout', which is value one if the meds_sum variable is equal to or greater than 4; otherwise 0.
df <- df %>%
mutate(gout = ifelse(meds_sum >= 4, 1, 0))
Then, I want to create a new variable, 'gout2', which is value one if the meds_sum variable is equal to or greater than four, and is one if the meds_sum is non-zero in the year before or after. This is what I try to do for this last step, but lead() and lag() are creating NA values in this code.
df <- df %>%
mutate(gout2 = ifelse((meds_sum >= 4 & ((lead(meds_sum) >= 1 | lag(meds_sum)) >= 1)), 1, 0))
Can anyone tell me what I'm doing wrong?
This is what I would like the output to look like:
df$gout2 <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0)
Use this code for the final step, you need to use group_by() to group variable "ID" in order to produce the desired effect.
df <- df %>%
group_by(ID)%>%
mutate(gout2 = ifelse((meds_sum >= 4 & ((lead(meds_sum) >= 1 | lag(meds_sum)) >= 1)), 1, 0))
hope this helps (Y) #Laura

tally() and n() in same chain

I'm trying to calculate probability of a certain result (e.g. the value precip >=3) but don't know how to combine tally and n in the same chain.
this works but i'd like to not depend on numsim:
numsim=2
simdF %>%
group_by(iter) %>%
tally( precip >= 3 ) %>%
mutate(
prob=n/numsim
)
why not:
simdF %>%
group_by(iter) %>%
summarise(
freq=tally( precip >= 3 ),
prob=freq/n()
)
)
and on that note, how can I make 3 be an argument to a function that contains this block?
Thanks!
sample data:
simdF=structure(list(nsim = c(1,2,1,2,1,2), iter = c(5, 5,10, 10, 30, 30), locE = c(-1, -2, -2, -1, 0, 4), locN = c(-1, 4, -2, -3, 0, 2), precip = c(1.4142135623731, 4.47213595499958, 2.82842712474619, 3.16227766016838, 0, 4.47213595499958)), .Names = c("nsim", "iter", "locE", "locN", "precip"), class = c("tbl_df", "data.frame"), row.names = c(NA, -6L))
Looking at the documentation for ?tally
tally is a convenient wrapper for summarise that will either call n or sum(n) depending...
tally calls summarize, so it doesn't make sense to put it inside summarize. Just go directly to the n() or sum(n) that tally would. In this case, since you have a condition, use sum:
simdF %>%
group_by(iter) %>%
summarise(
freq = sum(precip >= 3),
prob = freq/n()
)
As to
how can I make 3 be an argument to a function that contains this block
The same way you'd make anything an argument:
your_function = function(data, precip_lower_bound = 3) {
data %>%
group_by(iter) %>%
summarise(
freq = sum(precip >= precip_lower_bound),
prob = freq/n()
)
}
your_function(data = simdF, precip_lower_bound = 3)

Resources