Convert long to wide dataset using data.table::dcast or tidyr - r

Given the following data in long format. Would like to do this for an arbitrary number of timepoints.
dat <- structure(list(srdr_id = c("172507", "172507", "172507", "172507",
"172619", "172619", "172619", "172619"), arm = c("CBT_Educ",
"CBT_MI", "CBT_Educ", "CBT_MI", "MI", "Educ", "MI", "Educ"),
timepoint = c(0, 0, 3, 3, 0, 0, 3, 3), n = c(102, 103, 100,
101, 58, 61, 45, 53), mean = c(37.69, 40.23, 34.53, 31.8,
4.6, 4.3, 4.4, 4.1), sd = c(16.06, 14.23, 19.78, 19.67, 2.2,
2.2, 2.3, 2.5)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-8L))
Long dataset:
srdr_id arm timepoint n mean sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 0 102 37.7 16.1
2 172507 CBT_MI 0 103 40.2 14.2
3 172507 CBT_Educ 3 100 34.5 19.8
4 172507 CBT_MI 3 101 31.8 19.7
5 172619 MI 0 58 4.6 2.2
6 172619 Educ 0 61 4.3 2.2
7 172619 MI 3 45 4.4 2.3
8 172619 Educ 3 53 4.1 2.5
I would like to create a wide dataset, such that within each srdr_id and arm the three variables (n, mean and sd) appear in the same row.
srdr_id arm n.0 mean.0 sd.0 n.3 mean.3 sd.3
1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
5 172619 MI 58 4.6 2.2 45 4.4 2.3
6 172619 Educ 61 4.3 2.2 53 4.1 2.5
The following failed with:
Error in is.formula(formula) : object 'srdr_id' not found
data.table::dcast(data = dat, srdr_id + arm, value.var = c(n_analyzed, mean, sd))

A common workflow for this type of situation is gathering all the metrics, renaming them, and then spreading again. See below:
tidyverse:
dat %>%
gather("measure", "val", n, mean, sd) %>%
mutate(measure = paste0(measure, ".", timepoint)) %>%
select(-timepoint) %>%
spread(measure, val)
# A tibble: 4 x 8
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 37.7 34.5 102 100 16.1 19.8
2 172507 CBT_MI 40.2 31.8 103 101 14.2 19.7
3 172619 Educ 4.3 4.1 61 53 2.2 2.5
4 172619 MI 4.6 4.4 58 45 2.2 2.3
data.table:
library(data.table)
dt <- as.data.table(dat)
melt(dt, id.vars = c("srdr_id", "arm", "timepoint"))[
,`:=`(variable = paste0(variable, ".", timepoint), timepoint = NULL)
] %>%
dcast(srdr_id + arm ~ variable, value.var = "value")
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
1: 172507 CBT_Educ 37.69 34.53 102 100 16.06 19.78
2: 172507 CBT_MI 40.23 31.80 103 101 14.23 19.67
3: 172619 Educ 4.30 4.10 61 53 2.20 2.50
4: 172619 MI 4.60 4.40 58 45 2.20 2.30

One alternative (probably not the most elegant), is to use group_by() and summarise() from the library dplyr.
Here, you don't have to make some calculations (all values are already in your inital dataset), so you can use functions like first() and last() to specify with values you want.
dat %>%
group_by(srdr_id, arm) %>%
summarise(
n0 = first(n), mean0 = first(mean), sd0 = first(sd),
n3 = last(n), mean3 = last(mean), sd3 = last(sd)
)
# srdr_id arm n0 mean0 sd0 n3 mean3 sd3
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
# 2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
# 3 172619 Educ 61 4.3 2.2 53 4.1 2.5
# 4 172619 MI 58 4.6 2.2 45 4.4 2.3

Related

Add or subtract a number based on the combination of columns/conditions

I would like to adjust hdl, ldl, and trig by adding or subtracting a constant based on the combination of the dummy variable columns. The constants are:
med
hdl
ldl
trig
med1
-3.5
34.5
20.1
med2
-6.3
24.7
0
med3
-5
42.3
12
med4
0
23
22
med5
-2.1
12
22.2
Specifically, the rules are:
When a person is only taking one 1 medication, then add/subtract the corresponding constant. For example, For ID#12 who's only taking med5:
Adj_hdl = original hdl - 2.1
Adj_ldl = original ldl + 12
Adj_trig = original trig + 22.2
When a person is taking more than 1 medication, the the adjusted hdl/ldl/trig would be added/substracted using the constant with the largest absolute value .
For example, for ID #1, he/she's taking med2, med3, and med5. Then his/her adjusted lipids would be:
Adj_hdl = original hdl - 6.3 (since among med2,3,5, med2 has the constant with the largest absolute value across all meds for hdl)
Adj_ldl = original ldl + 42.3
Adj_trig = original trig + 22.2
As such, the end product would be a data set with additional 3 columns, adj_hdl, adj_ldl, and adj_trig for each of the IDs.
Mock data set:
set.seed(100)
id = 1:100
hdl = rnorm(100, mean = 50, sd = 3)
ldl = rnorm(100, mean = 120, sd = 10)
trig = rnorm(100, mean = 150, sd = 12)
med1 = rbinom(100, size = 1, prob = 0.4)
med2 = rbinom(100, size = 1, prob = 0.6)
med3 = rbinom(100, size = 1, prob = 0.55)
med4 = rbinom(100, size = 1, prob = 0.45)
med5 = rbinom(100, size = 1, prob = 0.72)
data = cbind(id, hdl, ldl, trig, med1, med2, med3, med4, med5)
It's messy but I believe it works.
library(dplyr)
d <- as.data.frame(data)
df: Comparison dataframe
med hdl ldl trig
1 med1 -3.5 34.5 20.1
2 med2 -6.3 24.7 0.0
3 med3 -5.0 42.3 12.0
4 med4 0.0 23.0 22.0
5 med5 -2.1 12.0 22.2
d |>
rowwise() |>
mutate(across(ldl:trig, ~ .x + max(df[[cur_column()]][df$med %in% (colnames(cur_data()[5:9])[which(cur_data()[5:9] == 1)])]),
.names = "adj_{.col}")) |>
mutate(adj_hdl = hdl - max(abs(df$hdl[df$med %in% (colnames(cur_data()[5:9])[which(cur_data()[5:9] == 1)])])))
Revised (added ifelse instead of second mutate)
d |>
rowwise() |>
mutate(across(hdl:trig, ~ {
adj <- max(abs(df[[cur_column()]][df$med %in% (colnames(cur_data()[5:9 [which(cur_data()[5:9] == 1)])]))
ifelse(cur_column() == "hdl", .x - adj, .x + adj) }, .names = "adj_{.col}"))
# A tibble: 100 × 12
# Rowwise:
id hdl ldl trig med1 med2 med3 med4 med5 adj_ldl adj_trig adj_hdl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 48.5 117. 150. 0 1 1 0 1 159. 173. 42.2
2 2 50.4 134. 146. 0 0 1 0 1 176. 168. 45.4
3 3 49.8 115. 160. 0 0 1 0 1 158. 182. 44.8
4 4 52.7 128. 156. 0 1 1 1 1 171. 178. 46.4
5 5 50.4 105. 162. 0 1 0 1 1 130. 184. 44.1
6 6 51.0 116. 138. 1 1 1 1 1 158. 160. 44.7
7 7 48.3 112. 143. 0 1 1 0 1 155. 165. 42.0
8 8 52.1 116. 138. 0 0 0 1 1 139. 160. 50.0
9 9 47.5 132. 114. 1 0 1 0 1 175. 136. 42.5
10 10 48.9 119. 154. 0 1 0 0 0 144. 154. 42.6

Create end_time variable on row A from start_time variable on row B

I would like to create an end_time variable for each participant based on their start_times and time_end_of_experiment (minus say 10 ms), but quite unsure how to do this.
Here's a minimal working example
df <- data.frame("subject_nr" = c("1", "1", "1", "2", "2"),
"start_time" = c(50, 52, 55, 53, 54.5),
"time_end_of_experiment" = c(60, 60, 60, 55.5, 55.5))
subject_nr start_time time_end_of_experiment
1 1 50.0 60.0
2 1 52.0 60.0
3 1 55.0 60.0
4 2 53.0 55.5
5 2 54.5 55.5
Here's what the final product should look like
subject_nr start_time end_time time_end_of_experiment
1 1 50.0 51.9 60
2 1 52.0 54.9 60
3 1 55.0 59.9 60
4 2 53.0 54.4 55.5
5 2 54.5 55.4 55.5
Here is how we could do it:
First use lead to substract 0.1 from lead(start_time)
then for the last value in the groups use an ifelse statement to substract 0.1 from time_end_of_experiment:
library(dplyr)
df %>%
group_by(subject_nr) %>%
mutate(end_time = lead(start_time, default = last(start_time))-.1, .before=3) %>%
mutate(end_time = ifelse(start_time == last(start_time), time_end_of_experiment-.1, end_time))
subject_nr start_time end_time time_end_of_experiment
<chr> <dbl> <dbl> <dbl>
1 1 50 51.9 60
2 1 52 54.9 60
3 1 55 59.9 60
4 2 53 54.4 55.5
5 2 54.5 55.4 55.5

Looping linear regression output in a data frame in r

I have a dataset below in which I want to do linear regression for each country and state and then cbind the predicted values in the dataset:
Final data frame after adding three more columns:
I have done it for one country and one area but want to do it for each country and area and put the predicted, upper and lower limit values back in the data set by cbind:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
data_1 <- data[(data$country=="US" & data$Area=="G"),]
model <- lm(amount ~ week, data = data_1)
pre <- predict(model,newdata = data_1,interval = "prediction",level = 0.95)
pre
How can I loop this for other combination of country and Area?
...and a Base R solution:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
splitVar <- paste0(data$country,"-",data$Area)
dfList <- split(data,splitVar)
result <- do.call(rbind,lapply(dfList,function(x){
model <- lm(amount ~ week, data = x)
cbind(x,predict(model,newdata = x,interval = "prediction",level = 0.95))
}))
result
...the results:
country Area week amount fit lwr upr
UK-A.11 UK A 1 45 36.6 -6.0463638 79.24636
UK-A.12 UK A 2 34 37.1 -1.3409128 75.54091
UK-A.13 UK A 3 23 37.6 0.6671656 74.53283
UK-A.14 UK A 4 43 38.1 -0.3409128 76.54091
UK-A.15 UK A 5 43 38.6 -4.0463638 81.24636
US-G.1 US G 1 12 20.8 -27.6791493 69.27915
US-G.2 US G 2 23 21.7 -21.9985147 65.39851
US-G.3 US G 3 34 22.6 -19.3841749 64.58417
US-G.4 US G 4 32 23.5 -20.1985147 67.19851
US-G.5 US G 5 12 24.4 -24.0791493 72.87915
US-I.6 US I 1 12 20.8 -33.8985900 75.49859
US-I.7 US I 2 34 30.5 -18.8046427 79.80464
US-I.8 US I 3 45 40.2 -7.1703685 87.57037
US-I.9 US I 4 65 49.9 0.5953573 99.20464
US-I.10 US I 5 45 59.6 4.9014100 114.29859
We can also use function augment from package broom to get your desired information:
library(purrr)
library(broom)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(models = map(data, ~ lm(amount ~ week, data = .)),
aug = map(models, ~ augment(.x, interval = "prediction"))) %>%
unnest(aug) %>%
select(country, Area, amount, week, .fitted, .lower, .upper)
# A tibble: 15 x 7
# Groups: country, Area [3]
country Area amount week .fitted .lower .upper
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US G 12 1 20.8 -27.7 69.3
2 US G 23 2 21.7 -22.0 65.4
3 US G 34 3 22.6 -19.4 64.6
4 US G 32 4 23.5 -20.2 67.2
5 US G 12 5 24.4 -24.1 72.9
6 US I 12 1 20.8 -33.9 75.5
7 US I 34 2 30.5 -18.8 79.8
8 US I 45 3 40.2 -7.17 87.6
9 US I 65 4 49.9 0.595 99.2
10 US I 45 5 59.6 4.90 114.
11 UK A 45 1 36.6 -6.05 79.2
12 UK A 34 2 37.1 -1.34 75.5
13 UK A 23 3 37.6 0.667 74.5
14 UK A 43 4 38.1 -0.341 76.5
15 UK A 43 5 38.6 -4.05 81.2
Here is a tidyverse way to do this for every combination of country and Area.
library(tidyverse)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(model = map(data, ~ lm(amount ~ week, data = .x)),
result = map2(model, data, ~data.frame(predict(.x, newdata = .y,
interval = "prediction",level = 0.95)))) %>%
ungroup %>%
select(-model) %>%
unnest(c(data, result))
# country Area week amount fit lwr upr
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 US G 1 12 20.8 -27.7 69.3
# 2 US G 2 23 21.7 -22.0 65.4
# 3 US G 3 34 22.6 -19.4 64.6
# 4 US G 4 32 23.5 -20.2 67.2
# 5 US G 5 12 24.4 -24.1 72.9
# 6 US I 1 12 20.8 -33.9 75.5
# 7 US I 2 34 30.5 -18.8 79.8
# 8 US I 3 45 40.2 -7.17 87.6
# 9 US I 4 65 49.9 0.595 99.2
#10 US I 5 45 59.6 4.90 114.
#11 UK A 1 45 36.6 -6.05 79.2
#12 UK A 2 34 37.1 -1.34 75.5
#13 UK A 3 23 37.6 0.667 74.5
#14 UK A 4 43 38.1 -0.341 76.5
#15 UK A 5 43 38.6 -4.05 81.2
And one more:
library(tidyverse)
data %>%
mutate(CountryArea=paste0(country,Area) %>% factor %>% fct_inorder) %>%
split(.$CountryArea) %>%
map(~lm(amount~week, data=.)) %>%
map(predict, interval = "prediction",level = 0.95) %>%
reduce(rbind) %>%
cbind(data, .)
country Area week amount fit lwr upr
1 US G 1 12 20.8 -27.6791493 69.27915
2 US G 2 23 21.7 -21.9985147 65.39851
3 US G 3 34 22.6 -19.3841749 64.58417
4 US G 4 32 23.5 -20.1985147 67.19851
5 US G 5 12 24.4 -24.0791493 72.87915
6 US I 1 12 20.8 -33.8985900 75.49859
7 US I 2 34 30.5 -18.8046427 79.80464
8 US I 3 45 40.2 -7.1703685 87.57037
9 US I 4 65 49.9 0.5953573 99.20464
10 US I 5 45 59.6 4.9014100 114.29859
11 UK A 1 45 36.6 -6.0463638 79.24636
12 UK A 2 34 37.1 -1.3409128 75.54091
13 UK A 3 23 37.6 0.6671656 74.53283
14 UK A 4 43 38.1 -0.3409128 76.54091
15 UK A 5 43 38.6 -4.0463638 81.24636

Summarizing using function requiring multiple parameters in R

I'm trying to get the area under the curve of some data for each run of a set of simulation runs. My data is of the form:
run year data1 data2 data3
--- ---- ----- ----- -----
1 2001 2.3 45.6 30.2
1 2002 2.4 35.4 23.4
1 2003 2.6 45.6 23.6
2 2001 2.3 45.6 30.2
2 2002 2.4 35.4 23.4
2 2003 2.6 45.6 23.6
3 2001 ... and so on
So, I'd like to get the area under the curve for each data trace for run 1, run 2, ... where the x axis is always the year column and the y axis is each data column. So, as output I want something like:
run Data1_auc Data2_auc Data3_auc
--- --------- --------- ---------
1 4.5 6.7 27.5
2 3.4 6.8 35.4
3 4.5 7.8 45.6
(Theses are not actual areas for the data above)
I want to use the pracma package 'trapz' function to compute the area which takes x and y values: trapz(x, y) where x=year in my case and y=Data column.
I've tried
dataCols <- colnames(myData %>% select(-c("run","year"))
myData <- group_by(run) %>% summarize_at(vars(dataCols), list(auc = trapz(year,.)))
but I can't get it to work without error. I've tried different variations on this, but can't seem it get it right.
Is this possible? If so, how do I do it?
library(dplyr)
library(pracma)
set.seed(1)
df <- tibble(
run = rep(1:3, each = 3),
year = rep(2001:2003, 3),
data1 = runif(9, 2, 3),
data2 = runif(9, 30, 50),
data3 = runif(9, 20, 40)
)
df
#> # A tibble: 9 x 5
#> run year data1 data2 data3
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2001 2.27 31.2 27.6
#> 2 1 2002 2.37 34.1 35.5
#> 3 1 2003 2.57 33.5 38.7
#> 4 2 2001 2.91 43.7 24.2
#> 5 2 2002 2.20 37.7 33.0
#> 6 2 2003 2.90 45.4 22.5
#> 7 3 2001 2.94 40.0 25.3
#> 8 3 2002 2.66 44.4 27.7
#> 9 3 2003 2.63 49.8 20.3
df %>%
group_by(run) %>%
summarise_at(vars(starts_with("data")), list(auc = ~trapz(year, .)))
#> # A tibble: 3 x 4
#> run data1_auc data2_auc data3_auc
#> <int> <dbl> <dbl> <dbl>
#> 1 1 4.79 66.5 68.7
#> 2 2 5.10 82.3 56.4
#> 3 3 5.45 89.2 50.5

Adding intermediate observations to a data frame (manual interpolating)

I've got a data frame like below with vector coordinates:
df <- structure(list(x0 = c(22.6, 38.5, 73.7), y0 = c(62.9, 56.6, 27.7
), x1 = c(45.8, 49.3, 80.8), y1 = c(69.9, 21.9, 14)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 3 x 4
x0 y0 x1 y1
<dbl> <dbl> <dbl> <dbl>
1 22.6 62.9 45.8 69.9
2 38.5 56.6 49.3 21.9
3 73.7 27.7 80.8 14
For visualisation purposes I need to manually interpolate points, i.e. add an intermediate row between each two rows of df, where the starting coordinates x0, y0 are the ending coordinates of original, previous row, while ending coordinates x1, y1 are the starting coordinates of original, next row. I also need to preserve information if an observation is from original dataset or it is manually added. So the expected output would be:
# A tibble: 5 x 5
x y pass_end_x pass_end_y source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 45.8 69.9 38.5 56.6 added
3 38.5 56.6 49.3 21.9 original
4 49.3 21.9 73.7 27.7 added
5 73.7 27.7 80.8 14 original
How can I do that in efficient and elegant way (preferably in tidyverse)?
To do this, all I'm going to do is swap the column names of the start and end points, and then use lead to get the next value of x1 and y1. Then we just add the source tag, and bind_rows
library(tidyverse)
df2 <- df
names(df2) <- names(df2)[c(3,4,1,2)] # swap names
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added")
df <- df %>% mutate(source = "original") %>% bind_rows(., df2)
Resulting in:
# A tibble: 6 x 5
x0 y0 x1 y1 source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 38.5 56.6 49.3 21.9 original
3 73.7 27.7 80.8 14 original
4 45.8 69.9 38.5 56.6 added
5 49.3 21.9 73.7 27.7 added
6 80.8 14 NA NA added
If you need the rows in order:
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added", ID = seq(1,n()*2, by =2)+1)
df <- df %>% mutate(source = "original", ID = seq(1,n()*2, by =2)) %>% bind_rows(., df2) %>% arrange(ID)
# A tibble: 6 x 6
x0 y0 x1 y1 source ID
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 22.6 62.9 45.8 69.9 original 1
2 45.8 69.9 38.5 56.6 added 2
3 38.5 56.6 49.3 21.9 original 3
4 49.3 21.9 73.7 27.7 added 4
5 73.7 27.7 80.8 14 original 5
6 80.8 14 NA NA added 6

Resources