Normalize all rows with first element within group - r

Is there an elegant method to normalize a column with a group-specific norm with dplyr?
Example:
I have a data frame:
df = data.frame(year=c(1:2, 1:2),
group=c("a", "a", "b", "b"),
val=c(100, 200, 300, 900))
i.e:
year group val
1 1 a 100
2 2 a 200
3 1 b 300
4 2 b 900
I want to normalize val by the value in year=1 of the given group. Desired output:
year group val val_norm
1 1 a 100 1
2 2 a 200 2
3 1 b 300 1
4 2 b 900 3
e.g. in row 4 the norm = 300 (year==1 & group=="b") hence val_norm = 900/300 = 3.
I can achieve this by extracting a ancillary data frame with just norms and then doing a left join on the original data frame.
What is a more elegant way to achieve this without creating a temporary data frame?

We can group by 'group', then divide the 'val' by the 'val' where 'year' is 1 (year==1). Here, I am selecting the first observation (in case there are duplicate 'year' of 1 for each 'group').
library(dplyr)
df %>%
group_by(group) %>%
mutate(val_norm = val/val[year==1][1L])
# year group val val_norm
# <int> <fctr> <dbl> <dbl>
#1 1 a 100 1
#2 2 a 200 2
#3 1 b 300 1
#4 2 b 900 3
If we need elegance and efficiency, data.table can be tried
library(data.table)
setDT(df)[, val_norm := val/val[year==1][1L] , by = group]

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

I have a data frame with two categorical variables.
samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
samples groups
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.
desired output
samples groups
1 A 1
2 A 1
3 A 2
4 B 1
5 B 1
You can sample minimum of number of rows or x for each group :
library(dplyr)
x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))
# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1
However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.
However, as #tmfmnk mentioned we don't need to call n() here. Try :
df %>% group_by(samples, groups) %>% slice_sample(n = x)
One option with data.table:
df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]
samples groups
1: A 1
2: A 1
3: A 2
4: B 1
5: B 1

Widening a data frame by copying values from a conditionally-identified row into new columns

I have a data set for a meta-analysis that contains pre-test data in a set of columns, post-test data in another set of columns, and one column for condition (i.e., treatment [Condition == 1] versus control [Condition == 0]). I need to widen this data set such that I create a new set of columns for control observations' pre-test data and post-test data which is placed alongside that of the original treatment data. These data are grouped by ID. This means that I need to conditionally copy only observations that are "control" into a set of columns alongside the "treatment" observations, but within each ID group.
I know that's an obnoxious way to describe it, so here's an example of the data set I have:
data_before.df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),
Condition = c(0,1,2,0,1,2,0,1,2),
Pre_M = c(1,2,3,4,5,6,7,8,9),
Post_M = c(90,80,70,60,50,40,30,20,10))
data_before.df
And here's what I need to get to:
data_after.df <- data.frame(ID = c(1,1,2,2,3,3),
Condition = c(1,2,1,2,1,2),
Pre_M = c(2,3,5,6,8,9),
Post_M = c(80,70,50,40,20,10),
Control_Pre_M = c(1,1,4,4,7,7),
Control_Post_M = c(90,90,60,60,30,30))
data_after.df
Here is one option with dplyr. After grouping by 'ID', create create two new column with 'Control' as part of the column by looping over the column that end with 'M' and subsetting the value where 'Condition' is 0, ungroup and filter out the row where 'Condition' is 0
library(dplyr)
library(stringr)
data_before.df %>%
group_by(ID) %>%
mutate_at(vars(ends_with('M')), list(Control = ~.[Condition == 0])) %>%
ungroup %>%
filter(Condition != 0) %>%
rename_at(vars(ends_with('Control')), ~
str_replace(., '(.*)_Control', 'Control_\\1'))
# A tibble: 6 x 6
# ID Condition Pre_M Post_M Control_Pre_M Control_Post_M
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 2 80 1 90
#2 1 2 3 70 1 90
#3 2 1 5 50 4 60
#4 2 2 6 40 4 60
#5 3 1 8 20 7 30
#6 3 2 9 10 7 30
Or an option with merge from base R
merge(subset(data_before.df, Condition != 0),
subset(data_before.df, Condition == 0,
select = c("ID", "Pre_M", "Post_M")), by = 'ID')
Or a join with data.table
library(data.table)
setDT(data_before.df)[Condition != 0][data_before.df[Condition == 0,
.(ID, Control_Pre_M = Pre_M, Control_Post_M = Post_M)], on = .(ID)]
# ID Condition Pre_M Post_M Control_Pre_M Control_Post_M
#1: 1 1 2 80 1 90
#2: 1 2 3 70 1 90
#3: 2 1 5 50 4 60
#4: 2 2 6 40 4 60
#5: 3 1 8 20 7 30
#6: 3 2 9 10 7 30

How to subset data based on combination of criteria in R

I have a several million rows of data and I need to create a subset. No success despite of trying hard and searching all over the web. The question is:
How to create a subset including only the smallest values of value for all ID & item combinations?
The data structure looks like this:
> df = data.frame(ID = c(1,1,1,1,2,2,2,2),
item = c('A','A','B','B','A','A','B','B'),
value = c(10,5,3,2,7,8,9,10))
> df
ID item value
1 1 A 10
2 1 A 5
3 1 B 3
4 1 B 2
5 2 A 7
6 2 A 8
7 2 B 9
8 2 B 10
The the result should look like this:
ID item value
1 A 5
1 B 2
2 A 7
2 B 9
Any hints greatly appreciated. Thank you!
We can use aggregate from baseR with grouping variables 'ID' and 'item' to get the min of 'value'
aggregate(value~., df, min)
# ID item value
#1 1 A 5
#2 2 A 7
#3 1 B 2
#4 2 B 9
Or using dplyr
library(dplyr)
df %>%
group_by(ID, item) %>%
summarise(value = min(value))
Or with data.table
library(data.table)
setDT(df)[, .(value = min(value)) , .(ID, item)]
Or another option would be to order and get the first row after grouping
setDT(df)[order(value), head(.SD, 1), .(ID, item)]

Add column in r with the preview values in another column

I have a data frame with 2 columns. Patient_Id and time (when visit the doctor).
I would like to add a new column "timestart" which have 0 at the first row for each different Patient_id and the other rows with the same id have the preview value from column time.
I think to do this with loop for, but I am new user in R and I don’t know how.
Thanks in advance.
We can group by 'Patient_id' and create the new column with the lag of 'time'
library(dplyr)
df1 %>%
group_by(Patient_id) %>%
mutate(timestart = lag(time, default = 0))
# Patient_id time timestart
# <int> <int> <int>
#1 1 1 0
#2 1 2 1
#3 1 3 2
#4 2 1 0
#5 2 2 1
#6 2 3 2
data
df1 <- data.frame(Patient_id = rep(1:2, each = 3), time = 1:3)

Resources