Generate all possible subsets of a given set and do some calculations - r

I have a data frame that looks like this
subj trial factor rt
1 1 Early 324
1 2 Early 405
1 3 Early 293
1 4 Early 738
1 5 Late 310
1 6 Late 389
1 7 Late 350
1 8 Late 782
1 9 Late 513
1 10 Late 401
2 1 Early 420
2 2 Early 230
2 3 Early 309
2 4 Late 456
2 5 Late 241
2 6 Late 400
2 7 Late 189
2 8 Late 329
2 9 Late 519
2 10 Late 230
3 1 Early 299
3 2 Early 499
3 3 Late 403
3 4 Late 389
3 5 Late 356
3 6 Late 365
3 7 Late 234
3 8 Late 345
3 9 Late 300
3 10 Late 402
As you can see there are unequal number of trials for both conditions.
What I want to do is for each participant, calculate the number of trials per condition (For participant 1 it would be Early = 3 and Late = 7, for participant 2 is Early = 4, Late = 6, and participant 3 is Early = 2 and Late 8).
The number of trials of Early condition will determine the size of the subsets I want to generate. So again, for participant 1, I want to generate all the possible combinations of 3 trials out of the 7 trials in the Late condition and calculate a mean for each combination. I don't know if I'm explaining it correctly.
So, it would go something like this. Since participant 1 only has 3 trials in the early condition, I will calculate a mean rt score for those 3 trials. But for the late condition, I want to generate all possible combinations of trials like 4 5 6, 4 5 7, 4 5 8, 4 5 9, 4 5 10, 4 6 7, 4 6 8, 4 6 9, 4 6 10 etc and then calculate the mean rt score for each combination of trials and then a general mean for the late condition.
I don't know how to go about doing this. I know expand.grid() function can help with the combination part, but I don't really know how to make the number of combinations be defined by the number of trials of the early condition since this will vary for each participant.
I don't know if I was clear enough, but I hope someone can help shade some light on it.
Thanks guys!

Here is a base R solution. You can define a customized function combavg to calculate the mean of combinations
combavg <- function(x) {
r <- data.frame(t(combn(which(x$factor == "Late"),sum(x$factor == "Early"), function(v) c(v,mean(x$rt[v])))))
names(r)[ncol(r)] <- "rt.avg"
r
}
and then use the following line to get the result
res <- Map(combavg,split(df,df$subj))
such that
> res
$`1`
X1 X2 X3 X4 rt.avg
1 5 6 7 8 457.75
2 5 6 7 9 390.50
3 5 6 7 10 362.50
4 5 6 8 9 498.50
5 5 6 8 10 470.50
6 5 6 9 10 403.25
7 5 7 8 9 488.75
8 5 7 8 10 460.75
9 5 7 9 10 393.50
10 5 8 9 10 501.50
11 6 7 8 9 508.50
12 6 7 8 10 480.50
13 6 7 9 10 413.25
14 6 8 9 10 521.25
15 7 8 9 10 511.50
$`2`
X1 X2 X3 rt.avg
1 4 5 6 365.6667
2 4 5 7 295.3333
3 4 5 8 342.0000
4 4 5 9 405.3333
5 4 5 10 309.0000
6 4 6 7 348.3333
7 4 6 8 395.0000
8 4 6 9 458.3333
9 4 6 10 362.0000
10 4 7 8 324.6667
11 4 7 9 388.0000
12 4 7 10 291.6667
13 4 8 9 434.6667
14 4 8 10 338.3333
15 4 9 10 401.6667
16 5 6 7 276.6667
17 5 6 8 323.3333
18 5 6 9 386.6667
19 5 6 10 290.3333
20 5 7 8 253.0000
21 5 7 9 316.3333
22 5 7 10 220.0000
23 5 8 9 363.0000
24 5 8 10 266.6667
25 5 9 10 330.0000
26 6 7 8 306.0000
27 6 7 9 369.3333
28 6 7 10 273.0000
29 6 8 9 416.0000
30 6 8 10 319.6667
31 6 9 10 383.0000
32 7 8 9 345.6667
33 7 8 10 249.3333
34 7 9 10 312.6667
35 8 9 10 359.3333
$`3`
X1 X2 rt.avg
1 3 4 396.0
2 3 5 379.5
3 3 6 384.0
4 3 7 318.5
5 3 8 374.0
6 3 9 351.5
7 3 10 402.5
8 4 5 372.5
9 4 6 377.0
10 4 7 311.5
11 4 8 367.0
12 4 9 344.5
13 4 10 395.5
14 5 6 360.5
15 5 7 295.0
16 5 8 350.5
17 5 9 328.0
18 5 10 379.0
19 6 7 299.5
20 6 8 355.0
21 6 9 332.5
22 6 10 383.5
23 7 8 289.5
24 7 9 267.0
25 7 10 318.0
26 8 9 322.5
27 8 10 373.5
28 9 10 351.0
DATA
df <- structure(list(subj = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), trial = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L), factor = c("Early", "Early", "Early",
"Early", "Late", "Late", "Late", "Late", "Late", "Late", "Early",
"Early", "Early", "Late", "Late", "Late", "Late", "Late", "Late",
"Late", "Early", "Early", "Late", "Late", "Late", "Late", "Late",
"Late", "Late", "Late"), rt = c(324L, 405L, 293L, 738L, 310L,
389L, 350L, 782L, 513L, 401L, 420L, 230L, 309L, 456L, 241L, 400L,
189L, 329L, 519L, 230L, 299L, 499L, 403L, 389L, 356L, 365L, 234L,
345L, 300L, 402L)), class = "data.frame", row.names = c(NA, -30L
))

The following code splits the data set by subj and lapply a function to each subset. This function fun uses combn to determine the combinations of indices when factor == "Late" and computes the mean value of each rt indexed by those combinations.
fun <- function(DF){
n <- sum(DF[["factor"]] == "Early")
late <- which(DF[["factor"]] == "Late")
cmb <- combn(late, n)
apply(cmb, 2, function(i) mean(DF[i, "rt"]))
}
sp <- split(df1, df1$subj)
lapply(sp, fun)
#$`1`
# [1] 457.75 390.50 362.50 498.50 470.50 403.25 488.75
# [8] 460.75 393.50 501.50 508.50 480.50 413.25 521.25
#[15] 511.50
#
#$`2`
# [1] 365.6667 295.3333 342.0000 405.3333 309.0000 348.3333
# [7] 395.0000 458.3333 362.0000 324.6667 388.0000 291.6667
#[13] 434.6667 338.3333 401.6667 276.6667 323.3333 386.6667
#[19] 290.3333 253.0000 316.3333 220.0000 363.0000 266.6667
#[25] 330.0000 306.0000 369.3333 273.0000 416.0000 319.6667
#[31] 383.0000 345.6667 249.3333 312.6667 359.3333
#
#$`3`
# [1] 396.0 379.5 384.0 318.5 374.0 351.5 402.5 372.5 377.0
#[10] 311.5 367.0 344.5 395.5 360.5 295.0 350.5 328.0 379.0
#[19] 299.5 355.0 332.5 383.5 289.5 267.0 318.0 322.5 373.5
#[28] 351.0

Related

Binned physiological time series data in R: calculate duration spent in each bin

I have a dataset containing changes in mean arterial blood pressure (MAP) over time from multiple participants. Here is an example dataframe:
df=structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Time = structure(1:14, .Label = c("11:02:00",
"11:03:00", "11:04:00", "11:05:00", "11:06:00", "11:07:00", "11:08:00",
"13:30:00", "13:31:00", "13:32:00", "13:33:00", "13:34:00", "13:35:00",
"13:36:00"), class = "factor"), MAP = c(90.27999878, 84.25, 74.81999969,
80.87000275, 99.38999939, 81.51000214, 71.51000214, 90.08999634,
88.75, 84.72000122, 83.86000061, 94.18000031, 98.54000092, 51
)), class = "data.frame", row.names = c(NA, -14L))
I have binned the data into groups: e.g. MAP 40-60, 60-80, 80-100 and added a unique flag (1, 2 or 3) in an additional column map_bin. This is my code so far:
library(dplyr)
#Mean Arterial Pressure
#Bin 1=40-60; Bin 2=60-80; Bin 3=80-100
map_bin=c("1","2","3")
output <- as_tibble(df) %>%
mutate(map_bin = case_when(
MAP >= 40 & MAP < 60 ~ map_bin[1],
MAP >= 60 & MAP < 80 ~ map_bin[2],
MAP >= 80 & MAP < 100 ~ map_bin[3]
))
For each ID I wish to calculate, in an additional column, the total time MAP is in each bin. I expect the following output:
ID
Time
MAP
map_bin
map_bin_dur
1
11:02:00
90.27999878
3
5
1
11:03:00
84.25
3
5
1
11:04:00
74.81999969
2
2
1
11:05:00
80.87000275
3
5
1
11:06:00
99.38999939
3
5
1
11:07:00
81.51000214
3
5
1
11:08:00
71.51000214
2
2
2
13:30:00
90.08999634
3
6
2
13:31:00
88.75
3
6
2
13:32:00
84.72000122
3
6
2
13:33:00
83.86000061
3
6
2
13:34:00
94.18000031
3
6
2
13:35:00
98.54000092
3
6
2
13:36:00
51
1
1
Where map_bin_dur is the time in minutes that MAP for each individual resided in each bin. e.g. ID 1 had a MAP in Bin 3 for 5 minutes in total.
If you have Time column of 1 min-duration always you can use add_count -
library(dplyr)
output <- output %>% add_count(ID, map_bin, name = 'map_bin_dur')
output
# ID Time MAP map_bin map_bin_dur
# <int> <fct> <dbl> <chr> <int>
# 1 1 11:02:00 90.3 3 5
# 2 1 11:03:00 84.2 3 5
# 3 1 11:04:00 74.8 2 2
# 4 1 11:05:00 80.9 3 5
# 5 1 11:06:00 99.4 3 5
# 6 1 11:07:00 81.5 3 5
# 7 1 11:08:00 71.5 2 2
# 8 2 13:30:00 90.1 3 6
# 9 2 13:31:00 88.8 3 6
#10 2 13:32:00 84.7 3 6
#11 2 13:33:00 83.9 3 6
#12 2 13:34:00 94.2 3 6
#13 2 13:35:00 98.5 3 6
#14 2 13:36:00 51 1 1

How can I identify the first row with value lower than the first row in different column in groups in R?

I have a data set that looks like this:
unique score value day
1 2 52 33.75 1
2 2 39 36.25 2
3 3 47 41.25 1
4 3 26 41.00 2
5 3 17 32.25 3
6 3 22 28.00 4
7 3 11 19.00 5
8 3 9 14.75 6
9 3 20 15.50 7
10 4 32 18.00 1
11 4 20 20.25 2
12 5 32 26.00 1
13 5 31 28.75 2
14 5 25 27.00 3
15 5 27 28.75 4
16 6 44 31.75 1
17 6 25 30.25 2
18 6 31 31.75 3
19 6 37 34.25 4
20 6 28 30.25 5
I would like to identify the first row in each group (unique) where the score is lower than the value on day 1.
I have tried this:
result<-df %>%
group_by(unique.id) %>%
filter(dailyMyoActivity < globaltma[globalflareday==1])
But it doesn't seem to do exactly what I want it to do.
Is there a way of doing this?
If I understood your rationale correctly, and if your dataset is already ordered by day, this dplyr solution may come in handy
library(dplyr)
df %>%
group_by(unique) %>%
filter(score < value[day==1]) %>%
slice(1)
Output
# A tibble: 3 x 4
# Groups: unique [3]
# unique score value day
# <int> <int> <dbl> <int>
# 1 3 26 41 2
# 2 5 25 27 3
# 3 6 25 30.2 2
This could help:
library(dplyr)
df %>% group_by(unique) %>% mutate(Index=ifelse(score<value & day==1,1,0))
# A tibble: 20 x 5
# Groups: unique [5]
unique score value day Index
<int> <int> <dbl> <int> <dbl>
1 2 52 33.8 1 0
2 2 39 36.2 2 0
3 3 47 41.2 1 0
4 3 26 41 2 0
5 3 17 32.2 3 0
6 3 22 28 4 0
7 3 11 19 5 0
8 3 9 14.8 6 0
9 3 20 15.5 7 0
10 4 32 18 1 0
11 4 20 20.2 2 0
12 5 32 26 1 0
13 5 31 28.8 2 0
14 5 25 27 3 0
15 5 27 28.8 4 0
16 6 44 31.8 1 0
17 6 25 30.2 2 0
18 6 31 31.8 3 0
19 6 37 34.2 4 0
20 6 28 30.2 5 0
Then you filter by Index==1
We could also use slice
library(dplyr)
df1 %>%
group_by(unique) %>%
slice(which(score < value[day == 1])[1])
# A tibble: 3 x 4
# Groups: unique [3]
# unique score value day
# <int> <int> <dbl> <int>
#1 3 26 41 2
#2 5 25 27 3
#3 6 25 30.2 2
data
df1 <- structure(list(unique = c(2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L), score = c(52L, 39L,
47L, 26L, 17L, 22L, 11L, 9L, 20L, 32L, 20L, 32L, 31L, 25L, 27L,
44L, 25L, 31L, 37L, 28L), value = c(33.75, 36.25, 41.25, 41,
32.25, 28, 19, 14.75, 15.5, 18, 20.25, 26, 28.75, 27, 28.75,
31.75, 30.25, 31.75, 34.25, 30.25), day = c(1L, 2L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
Given that you have asked for identifying the first row which fulfills the criterion score < value a new column which gives you the row number has been added.
result <- df %>%
mutate(row_nr = row_number()) %>%
group_by(unique) %>%
filter(score < value) %>%
slice(1)

(R) Apply running row sums in a table object with 2 variables

The following is a replicated sample of data which records the duration of 300 absences. month is the first month of the absence and length is the number of concurrent months the absence lasted.
df <- data.frame("month" = sample(c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"),300, replace = TRUE),
"length" = sample.int(6, size = 300, replace = TRUE))
df$month <- factor(df$month, levels(df$month)[c(5,4,8,1,9,7,6,2,12,11,10,3)])
Using table(df$length) you can see how many separate absences lasted for exactly each value of length.
1 2 3 4 5 6
55 45 42 56 51 51
But because length is incremental, if I wanted to show the total number of absences that reached (but not necessarily lasted) a certain number of months, I could use rev(cumsum(rev(table(df$length)))) which gives:
1 2 3 4 5 6
300 245 200 158 102 51
I am interested in seeing this cumulative view by month. rev(cumsum(rev(table(df$month,df$length))))
returns a vector and not a table.
The result I would like is to take this
table(df$month, df$length)
1 2 3 4 5 6
jan 5 5 4 5 3 2
feb 5 7 2 7 9 3
mar 5 3 2 2 9 4
apr 6 7 4 4 3 11
may 5 5 3 5 5 2
jun 4 4 2 7 4 5
jul 4 3 5 5 1 4
aug 4 0 5 3 6 7
sep 4 5 4 4 3 3
oct 4 2 1 6 5 4
nov 5 2 3 5 2 2
dec 4 2 7 3 1 4
and turn it into this, where the reverse cumulative count of length is calculated for each month.
1 2 3 4 5 6
jan 24 19 14 10 5 2
feb 33 28 21 19 12 3
mar 25 20 17 15 13 4
apr 35 29 22 18 14 11
may 25 20 15 12 7 2
jun 26 22 18 16 9 5
jul 22 18 15 10 5 4
aug 25 21 21 16 13 7
sep 23 19 14 10 6 3
oct 22 18 16 15 9 4
nov 19 14 12 9 4 2
dec 21 17 15 8 5 4
Is there a way to do this using table()? If not, I am open to any solution. Thanks in advance.
We can use rowCumsums on the reverse columns using index with seq (:) reversed for the column index and then reverse the index again
library(matrixStats)
tbl <- table(df$month, df$length)
tbl[] <- rowCumsums(tbl[, ncol(tbl):1])[, ncol(tbl):1]
tbl
#
# 1 2 3 4 5 6
# jan 24 19 14 10 5 2
# feb 33 28 21 19 12 3
# mar 25 20 17 15 13 4
# apr 35 29 22 18 14 11
# may 25 20 15 12 7 2
# jun 26 22 18 16 9 5
# jul 22 18 15 10 5 4
# aug 25 21 21 16 13 7
# sep 23 19 14 10 6 3
# oct 22 18 16 15 9 4
# nov 19 14 12 9 4 2
# dec 21 17 15 8 5 4
Or in base R, it would be cumsum with apply
tbl[] <- t(apply(tbl[, ncol(tbl):1], 1, cumsum))[, ncol(tbl):1]
data
tbl <- structure(c(5L, 5L, 5L, 6L, 5L, 4L, 4L, 4L, 4L, 4L, 5L, 4L, 5L,
7L, 3L, 7L, 5L, 4L, 3L, 0L, 5L, 2L, 2L, 2L, 4L, 2L, 2L, 4L, 3L,
2L, 5L, 5L, 4L, 1L, 3L, 7L, 5L, 7L, 2L, 4L, 5L, 7L, 5L, 3L, 4L,
6L, 5L, 3L, 3L, 9L, 9L, 3L, 5L, 4L, 1L, 6L, 3L, 5L, 2L, 1L, 2L,
3L, 4L, 11L, 2L, 5L, 4L, 7L, 3L, 4L, 2L, 4L), .Dim = c(12L, 6L
), .Dimnames = structure(list(c("jan", "feb", "mar", "apr", "may",
"jun", "jul", "aug", "sep", "oct", "nov", "dec"), c("1", "2",
"3", "4", "5", "6")), .Names = c("", "")), class = "table")
If you create a data frame rather than a table-class object, you can use Reduce with + as the function and accumulate = T to get a cumsum. Before creating the "table" (in quotes since the class is not "table") I made a factor version of the month column so the months would stay in the same order.
df$month_fac <- with(df, factor(month, levels = unique(month)))
tbl <- data.table::dcast(df, month_fac ~ length)
tbl[ncol(tbl):2] <- Reduce('+', rev(tbl[-1]), accumulate = TRUE)
The output is the tbl object, but I didn't bother showing it because you didn't set a seed so the (random) values will be different from the output shown in the question.

Moving average and moving slope in R

I am looking to separately calculate a 7-day moving average and 7-day moving slope of 'oldvar'.
My sincere apologies that I didn't add the details below in my original post. These are repeated observations for each id which can go from a minimum of 3 observations per id to 100 observations per id. The start day can be different for different IDs, and to make things complicated, the days are not equally spaced, so some IDs have missing days.
Here is the data structure. Please note that 'average' is the variable that I am trying to create as moving 7-day average for each ID:
id day outcome average
1 1 15 100 NA
2 1 16 110 NA
3 1 17 190 NA
4 1 18 130 NA
5 1 19 140 NA
6 1 20 150 NA
7 1 21 160 140
8 1 22 100 140
9 1 23 180 150
10 1 24 120 140
12 2 16 90 NA
13 2 17 110 NA
14 2 18 120 NA
12 2 20 130 NA
15 3 16 110 NA
16 3 18 200 NA
17 3 19 180 NA
18 3 21 170 NA
19 3 22 180 168
20 3 24 210 188
21 3 25 160 180
22 3 27 200 184
Also, would appreciate advice on how to calculate a moving 7-day slope using the same.
Thank you and again many apologies for being unclear the first time around.
The real challenge is to create a data.frame after completing the missing rows. One solution could be using zoo library. The rollapply function will provide a way to assign NA value for the initial rows.
Using data from OP as is, the solution could be:
library(zoo)
library(dplyr)
# Data from OP
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
day = c(15L,16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 16L, 17L, 18L, 20L,
16L, 18L, 19L, 21L, 22L, 24L, 25L, 27L),
outcome = c(100L, 110L,190L, 130L, 140L, 150L, 160L, 100L, 180L, 120L, 90L, 110L, 120L,
130L, 110L, 200L, 180L, 170L, 180L, 210L, 160L, 200L)),
.Names = c("id", "day", "outcome"), row.names = c(NA, -22L), class = "data.frame")
# Make a list without missing day for each id
df_complete <- merge(
expand.grid(id=unique(df$id), day=min(df$day):max(df$day)),
df, all=TRUE)
# Valid range of day for each ID group
df_id_wise_range <- df %>% group_by(id) %>%
summarise(min_day = min(day), max_day = max(day)) %>% as.data.frame()
# id min_day max_day
# 1 1 15 24
# 2 2 16 20
# 3 3 16 27
# Join original df and df_complete and then use df_id_wise_range to
# filter it for valid range of day for each group
df_final <- df_complete %>%
left_join(df, by=c("id","day")) %>%
select(-outcome.y) %>%
inner_join(df_id_wise_range, by="id") %>%
filter(day >= min_day & day <= max_day) %>%
mutate(outcome = outcome.x) %>%
select( id, day, outcome) %>%
as.data.frame()
# Now apply mean to get average
df_average <- df_final %>% group_by(id) %>%
mutate(average= rollapply(outcome, 7, mean, na.rm = TRUE, by = 1,
fill = NA, align = "right", partial = 7)) %>% as.data.frame()
df_average
# The result
# id day outcome average
#1 1 15 100 NA
#2 1 16 110 NA
#3 1 17 190 NA
#4 1 18 130 NA
#5 1 19 140 NA
#6 1 20 150 NA
#7 1 21 160 140.0
#8 1 22 100 140.0
#9 1 23 180 150.0
#10 1 24 120 140.0
#11 2 16 90 NA
#12 2 17 110 NA
#13 2 18 120 NA
#....
#....
#19 3 19 180 NA
#20 3 20 NA NA
#21 3 21 170 NA
#22 3 22 180 168.0
#23 3 23 NA 182.5
#24 3 24 210 188.0
#25 3 25 160 180.0
#26 3 26 NA 180.0
#27 3 27 200 184.0
The steps to calculate moving slope are:
First create a function to return slope
Use function as as part of rollapplyr
#Function to calculate slope
slop_e <- function(z) coef(lm(b ~ a, as.data.frame(z)))[[2]]
#Apply function
z2$slope <- rollapplyr(zoo(z2), 7, slop_e , by.column = FALSE, fill = NA, align = "right")
z2
a b mean_a slope
1 1 21 NA NA
2 2 22 NA NA
3 3 23 NA NA
4 4 24 NA NA
5 5 25 NA NA
6 6 26 NA NA
7 7 27 4 1
8 8 28 5 1
9 9 29 6 1
10 10 30 7 1
11 11 31 8 1
12 12 32 9 1
13 13 33 10 1
14 14 34 11 1
15 15 35 12 1
16 16 36 13 1
17 17 37 14 1
18 18 38 15 1
19 19 39 16 1
20 20 40 17 1

Merge two tables in R; column names differ with A and B options

I have two datasets that I'm trying to merge together. The first one contains information for every test subject with a unique ID (in rows). The second set contains measurements for every test subject (in columns), however each subject was measured twice so the unique ID reads "IDa and IDb." I'd like to find a way to merge these two tables based on the unique ID, regardless of whether it is measurement A or B.
Here's a small sample of the 2 datasets, and a table of the intended output. Any help would be appreciated!
UniqueID Site State Age Height
Tree001 FK OR 23 70
Tree002 FK OR 45 53
Tree003 NM OR 35 84
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1996 4 2
1997 7 8 7 3
1998 3 2 9 4 7
1999 11 9 2 12 3 13
2010 8 8 4 6 11 4
2011 10 5 6 3 8 9
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
Site FK FK FK FK NM NM
State OR OR OR OR OR OR
Age 23 23 45 45 35 35
Height 70 70 53 53 84 84
1996 4 2
1997 7 8 7 3
1998 3 2 9 4 7
1999 11 9 2 12 3 13
2010 8 8 4 6 11 4
2011 10 5 6 3 8 9
This can be one approach.
df1 <- structure(list(UniqueID = structure(1:3, .Label = c("Tree001",
"Tree002", "Tree003"), class = "factor"), Site = structure(c(1L,
1L, 2L), .Label = c("FK", "NM"), class = "factor"), State = structure(c(1L,
1L, 1L), .Label = "OR", class = "factor"), Age = c(23L, 45L,
35L), Height = c(70L, 53L, 84L)), .Names = c("UniqueID", "Site",
"State", "Age", "Height"), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(UniqueID = c(1996L, 1997L, 1998L, 1999L, 2010L,
2011L), Tree001A = c(4L, 7L, 3L, 11L, 8L, 10L), Tree001B = c(NA,
8L, 2L, 9L, 8L, 5L), Tree002A = c(2L, 7L, 9L, 2L, 4L, 6L), Tree002B = c(NA,
NA, 4L, 12L, 6L, 3L), Tree003A = c(NA, 3L, 7L, 3L, 11L, 8L),
Tree003B = c(NA, NA, NA, 13L, 4L, 9L)), .Names = c("UniqueID",
"Tree001A", "Tree001B", "Tree002A", "Tree002B", "Tree003A", "Tree003B"
), class = "data.frame", row.names = c(NA, -6L))
> df1
UniqueID Site State Age Height
1 Tree001 FK OR 23 70
2 Tree002 FK OR 45 53
3 Tree003 NM OR 35 84
> df2
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1 1996 4 <NA> 2 <NA> <NA> <NA>
2 1997 7 8 7 <NA> 3 <NA>
3 1998 3 2 9 4 7 <NA>
4 1999 11 9 2 12 3 13
5 2010 8 8 4 6 11 4
6 2011 10 5 6 3 8 9
# Use transpose function to change df1
df3 <- as.data.frame(t(df1[,-1]))
colnames(df3) <- df1[,1]
# Change rownames to UniqueID
df3$UniqueID <- rownames(df3)
# ROwnames to numeric
rownames(df3) <- c(1:4)
# Modify dataframe so that you have two columns for each subject
df3 <- df3[,c(4,1,1,2,2,3,3)]
colnames(df3) <- c("UniqueID", "Tree001A", "Tree001B", "Tree002A",
"Tree002B", "Tree003A", "Tree003B")
# Change classes of columns of df2 to factor
df2 <- data.frame(sapply(df2,function(x) class(x)<- as.factor(x)))
# Now combine two data frames
new <- rbind(df3,df2)
> new
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1 Site FK FK FK FK NM NM
2 State OR OR OR OR OR OR
3 Age 23 23 45 45 35 35
4 Height 70 70 53 53 84 84
5 1996 4 <NA> 2 <NA> <NA> <NA>
6 1997 7 8 7 <NA> 3 <NA>
7 1998 3 2 9 4 7 <NA>
8 1999 11 9 2 12 3 13
9 2010 8 8 4 6 11 4
10 2011 10 5 6 3 8 9

Resources