Calculate the slope for each individual - r

I have a longitudinal data
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4)
time<-c(0,12,36,0,7,23,68,0,23,0,32,45)
Age<-rnorm(12,45,9)
Sexe<-c("F","F","F","M","M","M","M","M","M","F","F","F")
biology1<-rnorm(12,12,3)
biology2<-rnorm (12,100,20)
biology3<-rnorm(12,45,9)
biology4<-rnorm(12,20,2)
Death<-c(0,0,1,0,0,0,0,0,0,0,0,1)
data<-data.frame(ID,time,Age,Sexe,biology1,biology2,biology3,biology4,Death)
I would like to calculate the slope (from the begining to the end of the folow-up) for each numerical variable (biology1,biology2,biology3,biology4) and for each individual irres; mainly a function to calculate the slope for each variable without retyping a new line of codes for each variable.I have no idea how to do it.

Here's an approach with dplyr. Here are the tricks:
Use group_by to group the data for each individual.
Use summarise to perform an action for each group.
Use across to do so for multiple columns
Use starts_with to select the appropriate columns
Use list(slope = ...) to name the columns.
Use $coef to extract the coefficients and [2] to get the slope rather than the intercept.
library(dplyr)
data %>%
group_by(ID) %>%
summarise(across(starts_with("biology"),
list(slope = ~lm(. ~ time)$coef[2])))
# A tibble: 4 x 5
ID biology1_slope biology2_slope biology3_slope biology4_slope
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0459 -1.61 -0.204 -0.00106
2 2 0.131 -0.553 0.0783 -0.0340
3 3 -0.0462 -0.427 -0.402 -0.191
4 4 -0.0524 -1.10 0.379 -0.0736

Here is a data.table option with lm
dt[,
lapply(
.SD,
function(x) coef(lm(x ~ time, data = cbind(x, dt[, "time"])))["time"]
), ID,
.SDcols = patterns("^biology")
]
which gives results like
ID biology1 biology2 biology3 biology4
1: 1 0.07223152 0.07187708 -2.960618e-02 0.022861337
2: 2 -0.05728224 0.13207814 -2.349493e-01 -0.018541982
3: 3 -0.03925044 -0.63219541 -3.166489e-05 -0.009484951
4: 4 -0.01801599 0.04758699 -6.547484e-03 -0.004253647

Related

R: For a list of dfs, filter for a value in Column1, to extract mean and SD of another Column2 (only of rows with the filtered value in Column1)

I have a big dataset ('links_cl', each participant of a study has several 100 rows), which I need to subset into dfs, one for each participant.
For those 42 dfs, I then need to do the same operation again and again. After spending half a day trying to write my own function, trying to find a solution online, I now have to ask here.
So, I am looking for a way to
subset the huge dataset several times and have one in my environment for every participant, without using the same code 42 times. What I did so far 'by hand' is:
Subj01 <- subset(links_cl, Subj == 01, select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE"))
filter for Column 'Condition' (either == 1,2,3 or 4), and describe/get the mean and sd of 'RT_LINK', which I so far also did 'manually'.
Subj01 %>% filter(Condition == 01) %>% describe(Subj01$RT_LINK)
But here I just get the description of the whole df of Subj01, so I would have to find 4x41 means by hand. It would be great to just have an output with the means and SDs of every participant, but I have no idea where to start and how to tell R to do this.
I tried this, but it won't work:
subsetsubj <- function(x,y) {
Subj_x <- links_cl %>%
subset(links_cl,
Subj == x,
select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE")) %>%
filter(Condition == y) %>%
describe(Subj_x$RT_LINK)
}
I also tried putting all dfs into a List and work with that, but it lead to nowhere.
If there is a solution without the subsetting, that would also work. This just seemed a logical step to me. Any idea, any help how to solve it?
You don't really need to split the dataset up into one dataframe for each patient. I would recommend a standard group_by()/summarize() approach, like this:
links_cl %>%
group_by(Subj, Condition) %>%
summarize(mean_val = mean(RT_LINK),
sd_val = sd(RT_LINK))
Output:
Subj Condition mean_val sd_val
<int> <int> <dbl> <dbl>
1 1 1 0.0375 0.873
2 1 2 0.103 1.05
3 1 3 0.184 0.764
4 1 4 0.0375 0.988
5 2 1 -0.0229 0.962
6 2 2 -0.156 0.820
7 2 3 -0.175 0.999
8 2 4 -0.0763 1.12
9 3 1 0.272 1.02
10 3 2 0.0172 0.835
# … with 158 more rows
Input:
set.seed(123)
links_cl <- data.frame(
Subj = rep(1:42, each =100),
Condition = rep(1:4, times=4200/4),
RT_LINK = rnorm(4200)
)

Calculate/Summarize new variable of 2 rows depending on second column gives NA

I am hoping for help from the swarm intelligence!
In preparing my dataframe of laboratory values across different time points, I encounter the following issue.
My dataframe is in long format consisting of 5 variables: the subject ID (sid), timepoint (zeitpunkt), intervention group of the clinical trial (gruppe), the laboratory parameter (parameter), the actual value of the parameter (messwert) and the unit of the value (einheit).
A tibble: 6 x 6
sid zeitpunkt gruppe parameter messwert einheit
<dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 t0 DGE ACPA(citrull. Prot.-Ak) EIA/Se 1000 U/ml
2 1 t3 DGE ACPA(citrull. Prot.-Ak) EIA/Se 1000 U/ml
3 1 t2 DGE Apolipoprot. A1 HP 1.31 g/l
4 1 t2 DGE Apolipoprotein B 0.61 g/l
5 1 t0 DGE aPTT Pathromtin SL 34.3 sek.
6 1 t1 DGE aPTT Pathromtin SL 37.3 sek.
>
I would now like to compare the change in value of each parameter in percent from baseline (t0) through the last timepoint (t3)in each individual, so that I can comopare the mean of the two interventions perspectively. Problematic: some values are missing, some might be existing twice but slightly differing in case the laboratory measured twice or samples were missing. Also, some parametres were only assessed at baseline.
I tried to calculate the change in percent by summarizing this new variable to a new
dataframe: labor_change <-
labor_long %>%
group_by(sid, gruppe, parameter, zeitpunkt) %>%
arrange(sid, parameter, zeitpunkt)%>%
summarize(messwert=mean(messwert))%>%
ungroup()%>%
group_by(sid, gruppe, parameter) %>%
summarize(changet3t0 =
(messwert[zeitpunkt == "t3"]-messwert[zeitpunkt = "t0"]/messwert[zeitpunkt == "t0"])*100)
My problem now: As soon as I use values from two different timepoints (aka 2 different rows, differing in a second variable "timepoint") in the code, R returns me the desired dataframe, but filled with NA only:
# Groups: sid, gruppe, parameter [6]
sid gruppe parameter changet3t0
<dbl> <chr> <chr> <dbl>
1 1 DGE aPTT Pathromtin SL NA
2 1 DGE Basophile % NA
3 1 DGE Basophile absolut NA
4 1 DGE Calcium NA
5 1 DGE Creatinkinase (CK) HP NA
6 1 DGE CRP HP NA
>
As soon as I eliminate one timepoint out of the calculation, R gives me the desired calculated value. Any idea how I can fix this?
As mentioned by #Martin Gal at one place you are using = instead of == and currently you are using (x-y/x) but instead what you want is (x-y)/x which is different.
I would also suggest to use match instead of == for comparison since match would return an NA if there is no "t3" or "t0" in the data.
library(dplyr)
labor_change <- labor_long %>%
group_by(sid, gruppe, parameter, zeitpunkt) %>%
arrange(sid, parameter, zeitpunkt)%>%
summarize(messwert=mean(messwert))%>%
group_by(sid, gruppe, parameter) %>%
summarize(changet3t0 = (messwert[match('t3', zeitpunkt)]-messwert[match("t0", zeitpunkt)])/
messwert[match("t0",zeitpunkt)]*100, .groups = 'drop')

R - adding values for one column based on a function using another column

I have a dataset that looks like this
head(dataset)
Distance Lag time Kurtosis
7.406100 10
144.1700 1
77.31800 1
81.15400 1
4.249167 6
I want to add values to the kurtosis column. To calculate kurtosis I need to group the Distances by Lag time (i.e., all distances for Lag time 1 will give me one value for kurtosis etc.).
To get kurtosis I usually use the package "psych" and function describe()
Is there a kind of loop I could add to do this?
Since describe produces a dataframe as output and what you want is just one column (also named kurtosis) you'll need to subset the describe output
library(dplyr)
library(psych)
df %>%
group_by(Lag_Time) %>%
mutate(Kurtosis = describe(Distance)[1,"kurtosis"])
Distance Lag_Time Kurtosis
<dbl> <dbl> <dbl>
1 7.41 10 NA
2 144. 1 -2.33
3 77.3 1 -2.33
4 81.2 1 -2.33
5 4.25 6 NA
You should be able to do this using dplyr
library(dplyr)
library(magrittr)
dataset <- dataset %>%
dplyr::group_by('Lag time') %>%
dplyr::mutate(Kurtosis = describe(Distance)$kurtosis)

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Aggregating by indeces and reweighting in R

I have a ton of Price data that is indexed by State, Date, and UPC (Product Code). I want to aggregate away UPC, and combine the Prices by taking a weighted average. I will try to explain it, but you may just want to read the code below.
Each observation in the dataset is: UPC, date, state, price, and weight. I would like to aggregate away the UPC index in this way:
Take all data points with the same date and state, and multiple their prices by their weights and sum them up. This obviously creates a weighted average, which I call the priceIndex. However, for some date&state combos the weights do not add up to 1. Therefore, I want to create two additional columns: One for the total of the weights for each date&state combo. The second is for a reweighted average: I.e., if the original two weights were .5 and .3, change them to .5/(.5+.3)=.625 and .3/(.5+.3)=.375, and then recalculate the weighted average into another price index.
This is what I mean:
upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601)
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2)
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45)
# This is what I have:
data <- data.frame(upc,date,state,price,weight)
data
# These are a few of the weighted calculations:
# .3*26+85*.5+24*.2 = 55.1
# 28*.2+84*.5+19*.15 = 50.45
# 27*.6+98*.3 = 45.6
# Etc. etc.
# Here is the reweighted calculation for date=200602 & state==1:
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45
# Or, equivalently:
# (28*.2+84*.5+19*.15)/.85 = 50.45
# This is what I want:
date=c(200601,200602,200603,200601,200602,200603)
state=c(1,1,1,2,2,2)
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575)
totalWeight=c(1,.85,.9,1,1,.85)
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412)
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex)
index
Also, not that it should matter, but there are about 35 states, 150 UPCs, and 84 dates in the dataset -- so there are a lot of observations.
Thanks a lot in advance.
We can use one of the group by summarise operation. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data), grouped by 'date', 'state', we get the sum of product of 'price' and 'weight', and sum(weight) as temporary variables, and then create the 3 variables in the list based on that.
library(data.table)
setDT(data)[, {tmp1 = sum(price*weight)
tmp2 = sum(weight)
list(priceIndex=tmp1, totalWeight=tmp2,
reweigthedIndex = tmp1/tmp2)}, .(date, state)]
# date state priceIndex totalWeight reweightedIndex
#1: 200601 1 55.100 1.00 55.10000
#2: 200602 1 50.450 0.85 59.35294
#3: 200603 1 45.600 0.90 50.66667
#4: 200603 2 46.575 0.85 54.79412
#5: 200601 2 42.500 1.00 42.50000
#6: 200602 2 51.000 1.00 51.00000
Or using dplyr, we can use summarise to create the 3 columns after doing grouping by 'date' and 'state'.
library(dplyr)
data %>%
group_by(date, state) %>%
summarise(priceIndex = sum(price*weight),
totalWeight = sum(weight),
reweightedIndex = priceIndex/totalWeight)
# date state priceIndex totalWeight reweightedIndex
# (dbl) (dbl) (dbl) (dbl) (dbl)
#1 200601 1 55.100 1.00 55.10000
#2 200601 2 42.500 1.00 42.50000
#3 200602 1 50.450 0.85 59.35294
#4 200602 2 51.000 1.00 51.00000
#5 200603 1 45.600 0.90 50.66667
#6 200603 2 46.575 0.85 54.79412

Resources