Related
Here is a reproducible example of the situation I need help for. I have a database (db1) in which weekly ratings of behavioral outcomes are recorded. The variable "Week" corresponds to the number of the week from the beginning of the year (e.g., Week = 1 indicates the week between January 1st and 7th, and so on...) and the variable "Score" to the value obtained by the subject on the criterion measure. In the real data set, I have several participants and a different number of ratings for each subject; however, in this example there is only one subject to make things easier.
library(magrittr)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1
# Week Score
#1 14 34
#2 18 21
#3 19 45
#4 20 32
#5 21 56
#6 23 45
#7 24 23
#8 25 48
What I need to do is to identify the highest number of ratings occurred in consecutive weeks in the database. In the example, the highest number is 4 because the ratings were consecutive from week 18 to 21. Here I added a column for demonstration, but it might not be necessary for the solution.
x2 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y2 <- c(34, 21, 45, 32, 56, 45, 23, 48)
z2 <- c(1, 1, 2, 3, 4, 1, 2, 3)
db2 <- cbind(x2, y2, z2) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db2
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 23 45 1
#7 24 23 2
#8 25 48 3
Finally, because every subject has to have a total of five consecutive ratings, I need to add a row with a missing datum where the highest number of consecutive weeks is below five (so that I can impute the missing data later on). However, there might be ratings before and after the sequence. If that is the case, I want to add the row based on the minimal distance between the first or last week of the longest series of consecutive weeks from the other existing rating. In the example, that means that the row with missing datum will be added after 21 because there are 4 missing weeks between week 14 and 18 whereas only 1 between week 21 and 23.
x3 <- c(14, 18, 19, 20, 21, 22, 23, 24, 25)
y3 <- c(34, 21, 45, 32, 56, NA, 45, 23, 48)
z3 <- c(1, 1, 2, 3, 4, 5, 1, 2, 3)
db3 <- cbind(x3, y3, z3) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db3
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 22 NA 5
#7 23 45 1
#8 24 23 2
#9 25 48 3
For your information, this is not going to be part of the main statistical analyses but rather one of several ways I want to use to test the sensitivity of my model. So do not worry about whether it makes sense from a methodological point of view. In addition, if possible, a tidyverse solution would be greatly appreciated.
Thanks so much to anyone who will take the time.
The code is relatively easier, if you want to do it just for max group and if more than one, just for one.
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5) %>%
group_by(grp = cumsum(consecutive == 1)) %>%
filter(sum(dummy) > 0) %>% #filter out group(s) with max consecutive
ungroup() %>% select(-dummy) %>%
filter(grp == min(grp)) %>% # filter out first such group, if there are more than 1
complete(consecutive = 1:5) %>%
select(-grp) %>%
mutate(Week = first(Week) + consecutive -1)
# A tibble: 5 x 3
consecutive Week Score
<dbl> <dbl> <dbl>
1 1 18 21
2 2 19 45
3 3 20 32
4 4 21 56
5 5 22 NA
OLD ANSWER Another tidyverse strategy (this can be modified to suit your additional column requirements which you have not given in sample)
library(tidyverse)
db1
#> Week Score
#> 1 14 34
#> 2 18 21
#> 3 19 45
#> 4 20 32
#> 5 21 56
#> 6 23 45
#> 7 24 23
#> 8 25 48
library(data.table)
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5,
dummy2 = rleid(dummy)) %>%
group_split(dummy2, .keep = F) %>%
map_if( ~.x$dummy[[1]], ~.x %>% complete(consecutive = seq(max(consecutive), 5, 1), fill = list(Week = 1)) %>%
mutate(Week = cumsum(Week))) %>%
map_dfr(~.x %>% select(-dummy))
#> # A tibble: 9 x 3
#> Week Score consecutive
#> <dbl> <dbl> <dbl>
#> 1 14 34 1
#> 2 18 21 1
#> 3 19 45 2
#> 4 20 32 3
#> 5 21 56 4
#> 6 22 NA 5
#> 7 23 45 1
#> 8 24 23 2
#> 9 25 48 3
Created on 2021-06-10 by the reprex package (v2.0.0)
if I understand correctly
library(data.table)
library(tidyverse)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1 %>%
mutate(grp = cumsum(c(0, diff(Week)) > 1)) %>%
group_by(grp) %>%
mutate(n_grp = n()) %>%
ungroup() %>%
filter(n_grp == max(n_grp, na.rm = TRUE)) %>%
complete(grp,
n_grp,
nesting(Week = seq(from = first(Week), length = 5))) %>%
select(-c(grp, n_grp)) %>%
rows_upsert(db1, by = c("Week", "Score"))
#> # A tibble: 9 x 2
#> Week Score
#> <dbl> <dbl>
#> 1 18 21
#> 2 19 45
#> 3 20 32
#> 4 21 56
#> 5 22 NA
#> 6 14 34
#> 7 23 45
#> 8 24 23
#> 9 25 48
Created on 2021-06-10 by the reprex package (v2.0.0)
You can also use the following solution. Midway through this solution before we use add_row to add your additional rows, we can filter the whole data set for we use group_split I filtered the whole data set to keep only those groups with the maximum observations which means they have longer consecutive Weeks than others. So after we split by grouping variable we may end of with 2 or more groups of equal consecutive Weeks so then you can choose whichever your like based on your preference:
library(dplyr)
library(purrr)
library(tibble)
db1 %>%
mutate(Consecutive = +(Week - lag(Week, default = first(Week)) == 1),
grp = cumsum(Consecutive == 0)) %>%
group_by(grp) %>%
mutate(Consecutive = row_number()) %>%
group_by(grp, .drop = TRUE) %>%
add_count() %>%
ungroup() -> db2 # We create our grouping variable `grp` here
db2 %>%
filter(n == max(n)) %>%
group_split(grp) %>%
map_dfr(~ add_row(.x, Week = .x$Week[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
Consecutive = .x$Consecutive[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
grp = .x$grp[1])) %>%
bind_rows(db2 %>%
filter(n != max(n))) %>%
select(-c(grp, n)) %>%
arrange(Week)
# A tibble: 9 x 3
Week Score Consecutive
<dbl> <dbl> <dbl>
1 14 34 1
2 18 21 1
3 19 45 2
4 20 32 3
5 21 56 4
6 22 NA 5
7 23 45 1
8 24 23 2
9 25 48 3
I want to discretize a column which contains of a continous variable.
the data looks like ;
c(0,25,77,423,6,8,3,65,32,22,10,0,8,0,15,0,10,1,2,4,5,5,6)
I want turn the numbers into categorical by discretizing, but zeros represent a different category. Sometimes directly discretizing could keep different numbers with zero.
I thought if I keep zeros out then discretize my wish comes true. But in a dataframe column I can't do it because of indexes:
here is an example dput() output
structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
for example, if I'd like to use 2 breaks, categories should be; zero and the other 3 discretized ones, totally 4 categories. it should be better if I could write function that discretizes a column that can be directly created with dplyr::mutate()
thanks in advance.
If I understood it correctly, your goal is to keep "0" as a separate category when discretizing. Here's a solution using arules::discretize to make a new function that can accomplish this:
library(arules)
#> Loading required package: Matrix
#>
#> Attaching package: 'arules'
#> The following objects are masked from 'package:base':
#>
#> abbreviate, write
library(tidyverse)
df <- structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
discretize_keep <- function(vec, keep, ...) {
vec2 <- vec
vec2[vec2==keep] <- NA
dsc <- arules::discretize(vec2, ...)
fct_explicit_na(dsc, na_level = str_glue("[{keep}]"))
}
df %>%
mutate(discrete_column = discretize_keep(dummy_column, keep = 0, breaks = 3))
#> dummy_column discrete_column
#> 1 0 [0]
#> 2 25 [15,423]
#> 3 77 [15,423]
#> 4 423 [15,423]
#> 5 6 [6,15)
#> 6 8 [6,15)
#> 7 3 [1,6)
#> 8 65 [15,423]
#> 9 32 [15,423]
#> 10 22 [15,423]
#> 11 10 [6,15)
#> 12 0 [0]
#> 13 8 [6,15)
#> 14 0 [0]
#> 15 15 [15,423]
#> 16 0 [0]
#> 17 10 [6,15)
#> 18 1 [1,6)
#> 19 2 [1,6)
#> 20 4 [1,6)
#> 21 5 [1,6)
#> 22 5 [1,6)
#> 23 6 [6,15)
If you have breaks c(20,50) like below, you can try cut to discretize dummy_column, e.g.,
breaks <- c(20, 50)
df %>%
mutate(discrete = cut(dummy_column, c(-1, 0, breaks, max(dummy_column))))
which gives
dummy_column discrete
1 0 (-1,0]
2 25 (20,50]
3 77 (50,423]
4 423 (50,423]
5 6 (0,20]
6 8 (0,20]
7 3 (0,20]
8 65 (50,423]
9 32 (20,50]
10 22 (20,50]
11 10 (0,20]
12 0 (-1,0]
13 8 (0,20]
14 0 (-1,0]
15 15 (0,20]
16 0 (-1,0]
17 10 (0,20]
18 1 (0,20]
19 2 (0,20]
20 4 (0,20]
21 5 (0,20]
22 5 (0,20]
23 6 (0,20]
I have a bit of code that I used in an excel spreadsheet that used min and max that I'm trying to transfer over to R.
I have two columns, "mini" and "maxi" which represent a range of possible values. The third column I'm trying to populate is the proportion of that range that falls between 5 and 19. Looking at the first row in the example, if "mini" was 10 and "maxi" was 15, the value of the 5-19 column should be 1, since the range falls completely in that span. In row 9, the "mini" is 1 and the "maxi" is 3, meaning it falls completely outside of the 5-19 range, and should therefore be 0. Row 3 however, straddles this range, and only 25% falls in the range of 5-19, so the output value should be 0.25.
Edit I have updated R and although several solutions worked before, I am now getting the error:
Error in mutate_impl(.data, dots, caller_env()) :
attempt to bind a variable to R_UnboundValue
Here's an example of how the DF looks:
ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6
The code that I used previously was something like this:
=MAX((MIN(maxi,19)-MAX(mini,5)+1),0)/(maxi-mini+1)
I was initially trying to use something like
percentoutput <- mutate(DF, output = MAX((MIN(maxi,19) - MAX(mini,5) + 1),0)/(maxi-mini + 1))
This resulted in the ouput column being full of NAs.
I wasn't sure if this is a situation where I'd need to run an apply function, but I'm not sure how to go about setting it up. Any guidance is appreciated!
Here is an example DF:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), min = c(10,
17, 2, 40, 40, 21, 21, 17, 1, 4), max = c(15, 20, 5, 59, 59,
39, 39, 20, 3, 6)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(ID = structure(list(), class = c("collector_double",
"collector")), mini = structure(list(), class = c("collector_double",
"collector")), maxi = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
We can calculate ratio of min to max values that are in range of 5:19 using rowwise.
library(dplyr)
df %>% rowwise() %>% mutate(ratio = mean(min:max %in% 5:19))
# ID min max ratio
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 15 1
# 2 2 17 20 0.75
# 3 3 2 5 0.25
# 4 4 40 59 0
# 5 5 40 59 0
# 6 6 21 39 0
# 7 7 21 39 0
# 8 8 17 20 0.75
# 9 9 1 3 0
#10 10 4 6 0.667
and similarly in base R using apply :
df$ratio <- apply(df[-1], 1, function(x) mean(x[1]:x[2] %in% 5:19))
Here is a vectorized version using data.table:
DT[, portion := {
mn <- pmax(mini, lb)
mx <- pmin(maxi, ub)
fifelse(mn <= mx, (mx - mn + 1L) / (maxi - mini + 1L), 0)
}]
Or equivalently in base R:
DF$mn <- pmax(DF$mini, lb)
DF$mx <- pmin(DF$maxi, ub)
DF$portion <- ifelse(DF$mn <= DF$mx, (DF$mx - DF$mn + 1L) / (DF$maxi - DF$mini + 1L), 0)
output:
ID mini maxi portion
1: 1 10 15 1.0000000
2: 2 17 20 0.7500000
3: 3 2 5 0.2500000
4: 4 40 59 0.0000000
5: 5 40 59 0.0000000
6: 6 21 39 0.0000000
7: 7 21 39 0.0000000
8: 8 17 20 0.7500000
9: 9 1 3 0.0000000
10: 10 4 6 0.6666667
data:
library(data.table)
DT <- fread("ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6")
lb <- 5L
ub <- 19L
We can use map2
library(dplyr)
library(purrr)
df %>%
mutate(ratio = map2_dbl(min, max, ~ mean(.x:.y %in% 5:19)))
In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers are completely random.
I would like to pull all data points that are greater than or less than 2 SD using a winsorization.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
So far I have identified my outliers for the variables of test1 and test2 for each group using the following code:
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
# identify the outliers for each variable by each group
with(df, ave(test1, group, FUN = outlier))
with(df, ave(test2, group, FUN = outlier))
# add these new-found outliers to the data set
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure:
1). how to tailor the command to a 90% winsorization (2 SD), 2). ensuring the winsorization accounts for the 2 different groups, 3). and including multiple variables in that winsorization.
Additionally, but not necessary...is there a way to see what the 'winsorize' function changed the numbers from to what the numbers were changed to?
Make first clear, how you want to winsorize your data. You have several options.
Use the mean+/-2sd limits as extreme values and replace all values outside by those
Use the observed value next to the mean+/-2sd limits
Use the 90% quantile
In option 1 and 3 you will possibly introduce values into your winsorized variable, which were not observed, in option 2 you will only have observed values. Note also, that the (5%, 95%)-quantile will not necessarily be near to 2*sd if you don't have reasonably well behaved normally distributed data.
For the winsorization process you can use DescTools::Winsorize(), which accepts both, probs and values for the limits.
Implementation 1)
x <- rnorm(100)
w1 <- Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x))
For 2) you could use something like
w2 <- Winsorize(x,
minval = max(Coalesce(x[x <= mean(x)-2*sd(x)], mean(x)-2*sd(x))),
maxval = min(Coalesce(x[x >= mean(x)+2*sd(x)], mean(x)+2*sd(x))))
Provide some escalating values for cases where there are no values outside the limits. Coalesce() returns the first non empty value, so Winsorize() will always get a valid limit.
Option 3) is the default for the function
w3 <- Winsorize(x, probs=c(0.05, 0.95))
Define a function for the groupwise apply as (here for option 1):
df$w1 <- unsplit(
tapply(df$test1, df$group,
function(x) Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x)) )
, f=df$group)
The replaced values can be found with
cbind(x, w1)[x!=w1,]
Here's a start - hopefully someone has a better solution for you.
library(tidyverse)
df <- tibble::tribble(
~id, ~group, ~test1, ~test2,
1, 0, 57, 82,
2, 0, 77, 80,
3, 0, 67, 90,
4, 0, 15, 70,
5, 0, 58, 72,
6, 1, 18, 44,
7, 1, 44, 44,
8, 1, 18, 46,
9, 1, 20, 44,
10, 1, 14, 38
)
df
#> # A tibble: 10 x 4
#> id group test1 test2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82
#> 2 2 0 77 80
#> 3 3 0 67 90
#> 4 4 0 15 70
#> 5 5 0 58 72
#> 6 6 1 18 44
#> 7 7 1 44 44
#> 8 8 1 18 46
#> 9 9 1 20 44
#> 10 10 1 14 38
library(DescTools)
df %>%
group_by(group) %>%
mutate(
test2_winsorized = DescTools::Winsorize(
test2,
maxval = quantile(df$test2, 0.90),
minval = quantile(df$test2, 0.10)
),
test1_winsorized = DescTools::Winsorize(
test1,
maxval = quantile(df$test1, 0.90),
minval = quantile(df$test1, 0.10)
)
)
#> # A tibble: 10 x 6
#> # Groups: group [2]
#> id group test1 test2 test2_winsorized test1_winsorized
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82 82 57
#> 2 2 0 77 80 80 68
#> 3 3 0 67 90 82.8 67
#> 4 4 0 15 70 70 15
#> 5 5 0 58 72 72 58
#> 6 6 1 18 44 44 18
#> 7 7 1 44 44 44 44
#> 8 8 1 18 46 46 18
#> 9 9 1 20 44 44 20
#> 10 10 1 14 38 43.4 14.9
Created on 2019-06-06 by the reprex package (v0.2.1)
I would like to count the number of observations within each group using conditions in R.
For example, I would like to count how many observations for ID "A" in every 10 days.
ID (A,A,A,A,A,A,A,A)
Day (7,14,17,25,35,37,42,57)
X (9,20,14,24,23,30,20,40)
Output Image
(In the first 10 days, we have one observation for ID "A". Days:7
In the next 10 days, we have two observations for ID "A". Days:14,17)
ID (A,A,A,A,A,A,A,A)
Day_10 (1,2,3,4,5,6)
Count_10 (1,2,1,2,1,1)
Also it would be great if I can calculate the number of observations before and after the certain values. For the given X value, I would like to know how many observation between [X-10, X+10] within ID "A".
The output image would be as follows:
ID (A,A,A,A,A,A,A,A)
X (9,20,14,24,23,30,40,50)
Count_X10 (3,3,3,3,3,3,2,1)
Count_X10: for a given X(=9) there are three observations within ID "A" [-1,19]
Here are the data loaded as a data.frame to keep the observations connected. Note that I added a second group to to show how to handle that
df <-
data.frame(
ID = rep(c("A","B"), each = 8)
, Day = c(7,14,17,25,35,37,42,57)
, X = c(9,20,14,24,23,30,20,40)
)
Then, I used dplyr to pass the data through a series of steps. First, I split by the ID column, then used lapply to run a function on each of those ID groups, including calculating two columns of interest (then returning the whole data.frame). Finally, I stitch the rows back together with bind_rows
df %>%
split(.$ID) %>%
lapply(function(x){
x$nextTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay, thisDay + 10))
})
x$plusMinusTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay - 10, thisDay + 10))
})
return(x)
}) %>%
bind_rows()
The result is
ID Day X nextTen plusMinusTen
1 A 7 9 3 3
2 A 14 20 2 3
3 A 17 14 2 4
4 A 25 24 2 3
5 A 35 23 3 4
6 A 37 30 2 3
7 A 42 20 1 3
8 A 57 40 1 1
9 B 7 9 3 3
10 B 14 20 2 3
11 B 17 14 2 4
12 B 25 24 2 3
13 B 35 23 3 4
14 B 37 30 2 3
15 B 42 20 1 3
16 B 57 40 1 1
But any condition you are interested good be added to that lapply step.
Your sample data :
df = data.frame(
ID = rep('A', 8),
Day = c(7, 14, 17, 25, 35, 37, 42, 57),
X = c(9, 20, 14, 24, 23, 30, 40, 50),
stringsAsFactors = FALSE)
Note: You give two different values for vector X. I suppose it is c(9, 20, 14, 24, 23, 30, 40, 50), and not c(9, 20, 14, 24, 23, 30, 20, 40).
First calculation:
library(dplyr)
output1 = df %>%
mutate(Day_10 = ceiling(Day/10)) %>%
group_by(ID, Day_10) %>%
summarise(Count_10 = n())
The mutate step creates the ranges of 10 days by rounding Day/10. Then we group by ID and Day_10 and we count the number of observations within each group.
> output1
ID Day_10 Count_10
<chr> <dbl> <int>
1 A 1 1
2 A 2 2
3 A 3 1
4 A 4 2
5 A 5 1
6 A 6 1
Second calculation:
output2 = df %>%
group_by(ID) %>%
mutate(Count_X10 = sapply(X, function(x){sum(Day >= x-10 & Day <= x+10)})) %>%
select(-Day)
We group by ID, and for each X we count the number of days with this ID that are between X-10 and X+10.
> output2
ID X Count_X10
<chr> <dbl> <int>
1 A 9 3
2 A 20 3
3 A 14 3
4 A 24 3
5 A 23 3
6 A 30 3
7 A 40 3
8 A 50 2
Note: I suppose there's a mistake in the desired output you give, because for instance, when X = 50, there are 2 observations within [40, 60] with ID "A": days 42 and 57.