Calculating median based on the group in long format

Calculating median based on the group in long format - r

I have formated my data in long
df1<-read.table(text=" ID Temp location
1 12 4
1 18 3
1 17 5
1 10 1
1 19 1
1 15 4
1 16 5
1 10 3
1 11 5
1 15 1
2 20 3
2 10 3
2 17 1
2 13 5
2 12 1
2 14 4
2 20 5
2 13 1
2 13 3
2 10 3
3 12 4
3 18 3
3 18 3
3 15 1
3 17 1
3 15 4
3 10 1
3 11 3
3 13 1
3 14 1",header=TRUE)
I want to calculate the median ( round up) based on Temp and location for 3 groups (Id). The question is what is the median for id1, id2,id3, if location=1. In other words, 10,19 and 15, give a median of 15 or for id2, we have 17,12 and 13, give a median of 13.5, roundup=14. and so on.
So I need to get this data:
AM1 15
AM2 14
AM3 14
Thanks for your help and sorry I was unable to show my effort.

You could also use data.table.
library(data.table)
setDT(df1)[location == 1, .(Median = base::round(median(as.numeric(Temp)))), by = .(ID = paste0(“AM”, ID))]

One option is to filter first, then do a group by and median
library(dplyr)
library(stringr)
df1 %>%
filter(location ==1) %>%
group_by(ID = str_c("AM", ID)) %>%
summarise(Median = median(Temp))
# A tibble: 3 x 2
# ID Median
# <chr> <int>
#1 AM1 15
#2 AM2 13
#3 AM3 14
Also, can be made more compact, but inefficient
df1 %>%
group_by(ID) %>%
summarise(Median = median(Temp[location == 1]))

Related

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7

You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7

Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

New variable from grouped calculation in R

I have a dataset:
library(dplyr)
my_df <- data.frame(day = c(1,1,1,2,2,2,3,3,3), age = c(18, 18, 18, 25, 18, 35, 76, 76, 15))
my_df
# day age
# 1 1 18
# 2 1 18
# 3 1 18
# 4 2 25
# 5 2 18
# 6 2 35
# 7 3 76
# 8 3 76
# 9 3 15
For each row, I want to know the frequency and percentage of age for a given value of day. For example, I can calculate this with a dplyr chain:
my_df %>%
group_by(day, age) %>%
summarize(n=n()) %>%
group_by(day) %>%
mutate(pct = n/sum(n))
# day age n pct
# 1 1 18 3 1
# 2 2 18 1 0.333
# 3 2 25 1 0.333
# 4 2 35 1 0.333
# 5 3 15 1 0.333
# 6 3 76 2 0.667
How can I add the vales of n values back onto my original df? Desired output:
# day age n
# 1 1 18 3
# 2 1 18 3
# 3 1 18 3
# 4 2 25 1
# 5 2 18 1
# 6 2 35 1
# 7 3 76 2
# 8 3 76 2
# 9 3 15 1

For your desired output we could use add_count()
library(dplyr)
my_df %>%
add_count(day, age)
day age n
1 1 18 3
2 1 18 3
3 1 18 3
4 2 25 1
5 2 18 1
6 2 35 1
7 3 76 2
8 3 76 2
9 3 15 1

I would store this as a variable, as such:
my_helper_df <- my_df %>%
group_by(day, age) %>%
summarize(n=n()) %>%
group_by(day) %>%
mutate(pct = n/sum(n))
Then left_join to the original df, as so:
final_df <- dplyr::left_join(df, my_helper_df, by = c("day", "age"))

Randomize two sets of number in R, not repeating values between groups

I have this file:
ID
1
1
1
3
3
3
7
7
7
And I need to assign two sets randomly, (1,2,3) and (5,15,25).
To do this I used this:
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F),set2=sample(c(5,15,25), size=n(), replace=F))
and I obtained this:
ID set1 set2
1 1 15
3 1 25
7 1 25
1 2 5
3 2 15
7 2 5
1 3 25
3 3 5
7 3 15
but I need different values for set2 in set1 and ID, like this:
ID set1 set2
1 1 15
3 1 25
7 1 5
1 2 5
3 2 15
7 2 25
1 3 25
3 3 5
7 3 15
Set2 cannot be repeated into ID or set1
some suggestion to control these 2 sets?

Change your dplyr code to the following. Using a 'group_by()` step will have the second sampling occur only within the group.
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F)) %>%
group_by(set1) %>%
mutate(set2=sample(c(5,15,25), size=n(), replace=F)) %>%
ungroup()
# A tibble: 8 x 3
ID set1 set2
<dbl> <dbl> <dbl>
1 1 2 15
2 1 3 5
3 1 1 25
4 3 3 15
5 3 2 5
6 3 1 5
7 7 2 25
8 7 3 25

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)

Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7

While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?

As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))

We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))

You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculating median based on the group in long format - r

You could also use data.table. library(data.table) setDT(df1)[location == 1, .(Median = base::round(median(as.numeric(Temp)))), by = .(ID = paste0(“AM”, ID))]

Related

Converting time-dependent variable to long format using one variable indicating day of update

New variable from grouped calculation in R

Randomize two sets of number in R, not repeating values between groups

calculate difference between rows, but keep the raw value by group

Sum of group but keep the same value for each row in r

Categories

Resources