I would like to assign a number to the days (events) were val>0,36 . I want to identify with an unique number any event that reach that rule (val> 0.36) preferably using Tidyverse.
library(lubridate)
#create a df
date =as_date(ymd("2020-11-01"):ymd("2020-11-25"))
val = rnorm (25)
data = tibble(date, val)
Could anyone help me?
Thanks
There are several ways to choose unique numbers to identify the events for which the threshold value is exceeded. Also, tidyverse might not be necessary - here is a Base R solution. It takes the index of the value in the val vector that exceeds the threshold as the unique ID. All other values that are below the threshold of 0.36 are coded as 0.
# (a) Initializing the 0s that later identify the non-events
data$flag <- rep(0, nrow(data))
# (b) identifying all values which exceed the threshold with the index
data$flag[which(val > 0.36)] <- which(val > 0.36)
Output (on my machine, randomly)
> data
# A tibble: 25 x 3
date val flag
<date> <dbl> <dbl>
1 2020-11-01 0.0231 0
2 2020-11-02 -0.413 0
3 2020-11-03 0.240 0
4 2020-11-04 -0.465 0
5 2020-11-05 -0.929 0
6 2020-11-06 -0.409 0
7 2020-11-07 0.598 7
8 2020-11-08 0.970 8
9 2020-11-09 1.25 9
10 2020-11-10 0.244 0
# ... with 15 more rows
Related
I have this tibble
host_id district availability_365
<dbl> <chr> <dbl>
1 8573 Fatih 280
2 3725 Maltepe 365
3 1428 Fatih 355
4 6284 Fatih 164
5 3518 Esenyurt 0
6 8427 Esenyurt 153
7 4218 Fatih 0
8 5342 Kartal 134
9 4297 Pendik 0
10 9340 Maltepe 243
# … with 51,342 more rows
I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.
I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?
Use summarise() function along with group_by() in dplyr.
library(dplyr)
df %>%
group_by(district) %>%
summarise(Zero_Availability = sum(availability_365==0)/n())
# A tibble: 5 x 2
district Zero_Availability
<chr> <dbl>
1 Esenyurt 0.5
2 Fatih 0.25
3 Kartal 0
4 Maltepe 0
5 Pendik 1
It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work
library(data.table)
setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
prop_no_avail = sum(no_avail) / .N
), .(district)]
I have longitudinal panel data of 1000 individuals measured at two time points. Using the MICE package I have imputed values for those variables with missing data. The imputation itself works fine, generating the required 17 imputed data frames. One of the imputed variables is fitness. I would like to create a new variable of fitness scaled, scale(fitness). My understanding is that I should impute first, and then create the new variable with the imputed data. How do I access each of the 17 imputed datasets and generate a scaled fitness variable in each?
My original data frame looks like (some variables missing):
id age school sex andersen ldl_c_trad pre_post
<dbl> <dbl> <fct> <fct> <int> <dbl> <fct>
1 2 10.7 1 1 951 2.31 1
2 2 11.3 1 1 877 2.20 2
3 3 11.3 1 1 736 2.88 1
4 3 11.9 1 1 668 3.36 2
5 4 10.1 1 0 872 3.31 1
6 4 10.7 1 0 905 2.95 2
7 5 10.5 1 1 925 2.02 1
8 5 11.0 1 1 860 1.92 2
9 8 10.7 1 1 767 3.41 1
10 8 11.2 1 1 709 3.32 2
My imputation code is:
imputed <- mice(imp_vars, method = meth, predictorMatrix = predM, m = 17)
imp_vars are the variables selected for imputation.
I have pre-specified both the method and predictor matrix.
Also, my assumption is that the scaling should be performed separately for each time point, as fitness is likely to have improved over time. Is it possible to perform the scaling filtered by pre_post for each imputed dataset?
Many thanks.
To access each of the imputations where x is a value from 1-17
data <- complete(imputed, x)
or if you want access to the fitness variable
complete(imputed, x)$fitness
If you want to filter observations according to a value of another variable in the dataframe, you could use
data[which(data$pre_post==1), "fitness"]
This should return the fitness observations for when pre_post==1, from there it is simply a matter of scaling these observations for each level of pre_post, assigning them to another variable fitness_scaled and then repeating for each imputation 1-17.
I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857
This question already has answers here:
Finding percentage in a sub-group using group_by and summarise
(3 answers)
Closed 6 years ago.
I have a following dataframe:
sleep health count prop
1 7 Good 100 NA
2 7 Normal 75 NA
3 7 Bad 25 NA
4 8 Good 125 NA
5 8 Normal 75 NA
6 8 Bad 25 NA
I want to fill the prop column with each percentage of count based on sleep group. For instance, the first 3 rows prop should be 0.5, 0.375, and 0.125 then the last 3 rows prop are 0.555, 0.333, and 0.111 respectively.
This can be done manually by separating the data frame by sleep first then use prop.table(prop) for each, but since there are numerous sleep group I can't find a succinct way to do this. Any thoughts?
In R, we can do this by dividing by the sum of 'count' after grouping by 'sleep'
library(dplyr)
df1 %>%
group_by(sleep) %>%
mutate(prop = round(count/sum(count), 3))
# sleep health count prop
# <int> <chr> <int> <dbl>
#1 7 Good 100 0.500
#2 7 Normal 75 0.375
#3 7 Bad 25 0.125
#4 8 Good 125 0.556
#5 8 Normal 75 0.333
#6 8 Bad 25 0.111
Or using base R
df1$prop <- with(df1, ave(count, sleep, FUN=prop.table))
I want to determine the length of the snow season in the following data frame:
DATE SNOW
1998-11-01 0
1998-11-02 0
1998-11-03 0.9
1998-11-04 1
1998-11-05 0
1998-11-06 1
1998-11-07 0.6
1998-11-08 1
1998-11-09 2
1998-11-10 2
1998-11-11 2.5
1998-11-12 3
1998-11-13 6.5
1999-01-01 15
1999-01-02 15
1999-01-03 19
1999-01-04 18
1999-01-05 17
1999-01-06 17
1999-01-07 17
1999-01-08 17
1999-01-09 16
1999-03-01 6
1999-03-02 5
1999-03-03 5
1999-03-04 5
1999-03-05 5
1999-03-06 2
1999-03-07 2
1999-03-08 1.6
1999-03-09 1.2
1999-03-10 1
1999-03-11 0.6
1999-03-12 0
1999-03-13 1
Snow season is defined by a snow depth (SNOW) of more than 1 cm for at least 10 consecutive days (so if there is snow one day in November but after it melts and depth is < 1 cm we consider the season not started).
My idea would be to determine:
1) the date of snowpack establishement (in my example 1998-11-08)
2) the date of "disappearing" (here 1999-03-11)
3) calculate the length of the period (nb of days between 1998-11-05 and 1999-03-11)
For the 3rd step I can easily get the number between 2 dates using this method.
But how to define the dates with conditions?
This is one way:
# copy data from clipboard
d <- read.table(text=readClipboard(), header=TRUE)
# coerce DATE to Date type, add event grouping variable that numbers the groups
# sequentially and has NA for values not in events.
d <- transform(d, DATE=as.Date(DATE),
event=with(rle(d$SNOW >= 1), rep(replace(ave(values, values, FUN=seq), !values, NA), lengths)))
# aggregate event lengths in days
event.days <- aggregate(DATE ~ event, data=d, function(x) as.numeric(max(x) - min(x), units='days'))
# get those events greater than 10 days
subset(event.days, DATE > 10)
# event DATE
# 3 3 122
You can also use the event grouping variable to find the start dates:
starts <- aggregate(DATE ~ event, data=d, FUN=head, 1)
# 1 1 1998-11-04
# 2 2 1998-11-06
# 3 3 1998-11-08
# 4 4 1999-03-13
And then merge this with event.days:
merge(event.days, starts, by='event')
# event DATE.x DATE.y
# 1 1 0 1998-11-04
# 2 2 0 1998-11-06
# 3 3 122 1998-11-08
# 4 4 0 1999-03-13