adjusting value of column based on if duplicate row - iteratively R - r

Say I have this dataset:
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df
time val
1 100 1
2 101 3
3 101 1
4 101 2
5 102 3
6 102 1
7 103 2
8 105 3
9 109 1
10 109 2
11 109 1
We can identify duplicate times in the 'time' column like this:
df[duplicated(df$time),]
What I want to do is to adjust the value of time (add 0.1) if it's duplicate. I could do this like this:
df$time <- ifelse(duplicated(df$time),df$time+.1,df$time)
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.1 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.1 1
The issue here is that we still have duplicate values e.g.rows 3 and 4 (that they differ in the column 'val' is irrelevant). Rows 10 and 11 have the same problem. Rows 5 and 6 are fine.
Is there a way of doing this iteratively - i.e. adding 0.1 to first duplicate, 0.2 to second duplicate (of same time value) etc. This way row 4 would become 101.2, and row 11 would become 109.2 . The number of duplicates per value is unknown but will never equal 10 (usually maximum 4).

As in the top answer for the related question linked by #Henrik, this uses data.table::rowid
library(data.table)
setDT(df)
df[, time := time + 0.1*(rowid(time) - 1)]
# time val
# 1: 100.0 1
# 2: 101.0 3
# 3: 101.1 1
# 4: 101.2 2
# 5: 102.0 3
# 6: 102.1 1
# 7: 103.0 2
# 8: 105.0 3
# 9: 109.0 1
# 10: 109.1 2
# 11: 109.2 1

Here's a one line solution using base R -
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df$new_time <- df$time + duplicated(df$time)*0.1*(ave(seq_len(nrow(df)), df$time, FUN = seq_along) - 1)
df
# time val new_time
# 1 100 1 100.0
# 2 101 3 101.0
# 3 101 1 101.1
# 4 101 2 101.2
# 5 102 3 102.0
# 6 102 1 102.1
# 7 103 2 103.0
# 8 105 3 105.0
# 9 109 1 109.0
# 10 109 2 109.1
# 11 109 1 109.2

With dplyr:
library(dplyr)
df %>%
group_by(time1 = time) %>%
mutate(time = time + (0:(n()-1))*0.1) %>%
ungroup() %>%
select(-time1)
or with row_number() (suggested by Henrik):
df %>%
group_by(time1 = time) %>%
mutate(time = time + (row_number()-1)*0.1) %>%
ungroup() %>%
select(-time1)
Output:
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.2 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.2 1

Related

Calculate mean of all groups except the current group

I have a data frame with two grouping variables, 'mkt' and 'mdl', and some values 'pr':
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2),
mdl = c('a','a','b','b','b','a','b','a','b'),
pr = c(120,120,110,110,145,130,145,130, 145))
df
mkt mdl pr
1 1 a 120
2 1 a 120
3 1 b 110
4 1 b 110
5 2 b 145
6 2 a 130
7 2 b 145
8 2 a 130
9 2 b 145
Within each 'mkt', the mean 'pr' for each 'mdl' should be calculated as the mean of 'pr' of all other 'mdl' in the same 'mkt', except the current 'mdl'.
For example, for the group defined by mkt == 1 and mdl == a, the 'avgother' is calculated as the average of 'pt' for mkt == 1 (same 'mkt') and mdl == b (all other 'mdl' than the current group a).
Desired result:
# mkt mdl pr avgother
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
First get the average of each mkt and mdl values and for each mkt exclude the current value and get the average of remaining values.
library(dplyr)
library(purrr)
df %>%
group_by(mkt, mdl) %>%
summarise(avgother = mean(pr)) %>%
mutate(avgother = map_dbl(row_number(), ~mean(avgother[-.x]))) %>%
ungroup %>%
inner_join(df, by = c('mkt', 'mdl'))
# mkt mdl avgother pr
# <dbl> <chr> <dbl> <dbl>
#1 1 a 110 120
#2 1 a 110 120
#3 1 b 120 110
#4 1 b 120 110
#5 2 a 145 130
#6 2 a 145 130
#7 2 b 130 145
#8 2 b 130 145
#9 2 b 130 145
Using data.table, calculate sum and length by 'mkt'. Then, within each mkt-mdl group, calculate mean as (mkt sum - group sum) / (mkt length - group length)
library(data.table)
setDT(df)[ , `:=`(s = sum(pr), n = .N), by = mkt]
df[ , avgother := (s - sum(pr)) / (n - .N), by = .(mkt, mdl)]
df[ , `:=`(s = NULL, n = NULL)]
# mkt mdl pr avgother
# 1: 1 a 120 110
# 2: 1 a 120 110
# 3: 1 b 110 120
# 4: 1 b 110 120
# 5: 2 b 145 130
# 6: 2 a 130 145
# 7: 2 b 145 130
# 8: 2 a 130 145
# 9: 2 b 145 130
Consider base R with multiple ave calls for different level grouping calculation using the decomposed version of mean with sum / count:
df <- within(df, {
avgoth <- (ave(pr, mkt, FUN=sum) - ave(pr, mkt, mdl, FUN=sum)) /
(ave(pr, mkt, FUN=length) - ave(pr, mkt, mdl, FUN=length))
})
df
# mkt mdl pr avgoth
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
For the sake of completeness, here is another data.table approach which uses grouping by each i, i.e., join and aggregate simultaneously.
For demonstration, an enhanced sample dataset is used which has a third market with 3 products:
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3]),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3))
library(data.table)
mdt <- setDT(df)[, .(mdl, s = sum(pr), .N), by = .(mkt)]
df[mdt, on = .(mkt, mdl), avgother := (sum(pr) - s) / (.N - N), by = .EACHI][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
The temporay table mdt contains the sum and count of prices within each mkt but replicated for each product mdl within the market:
mdt
mkt mdl s N
1: 1 a 460 4
2: 1 a 460 4
3: 1 b 460 4
4: 1 b 460 4
5: 2 b 695 5
6: 2 a 695 5
7: 2 b 695 5
8: 2 a 695 5
9: 2 b 695 5
10: 3 a 6 3
11: 3 b 6 3
12: 3 c 6 3
Having mkt and mdl in mdt allows for grouping by each i (by = .EACHI)
Here is an approach which computes avgother directly by subsetting pr values which do not belong to the actual value of mdl before computing the averages.
This is quite different to the other answers posted so far which justifies to post this as a separate answer, IMHO.
# enhanced sample dataset covering more corner cases
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3,4),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3],'d'),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3, 9))
library(data.table)
setDT(df)[, avgother := sapply(mdl, function(m) mean(pr[m != mdl])), by = mkt][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
13: 4 d 9 NaN
Difference between approaches
The other answers share more or less the same approach (although implemented in different manners)
compute sums and counts of pr for each mkt
compute sums and counts of prfor each mkt and mdl
subtract mkt/mdl sums and counts from mkt sums and counts
compute avgother
This approach
groups by mkt
loops through mdl within each mkt,
subsets pr to drop values which do not belong to the actual value of mdl
before computing mean() directly.
Caveat concerning performance: Although the code essentially is a one-liner it does not imply it is the fastest.

Using pivotlonger on multiple variables of horse racing dataframe in R

Hi and Thanks in advance for any assistance the group can give.
I have a dataset which gives the performance ratings for 7 race horses
over their last 3 races. The performance ratings are DaH1, DaH2 and DaH3 where
DaH1 is the performance rating for the last race etc.
I also have data for race distances over which the races were ran, where the distances are
Dist1, Dist2 and Dist3 and they correspond to the performance ratings. ie. Horse 2 has a
performance rating of 124 for DaH1, with a race distance, Dist1, of 12.
The dataset is:
horse_data <- tibble(
DaH1=c(0, 124, 121, 123, 0, NA, 110),
DaH2=c(124, 117, 125, 120, 125, 0, NA),
DaH3=c(121, 119, 123, 119, NA, 0, 123),
Dist1 =c(10,12,10.3,11,11.5,14,10),
Dist2 =c(10,10.1,12,8,9.5,10.25,8.75),
Dist3 =c(11.5,12.5,9.8,10,10,15,10),
horse =c(1,2,3,4,5,6,7),
)
I am trying to use pivot_longer to convert the data to a better dataset for performing
calculations depending upon race distances.
So far I used this code:
tidyData <- horse_data %>%
pivot_longer(
values_to="Rating",
cols=c(DaH1, DaH2, DaH3),
names_prefix="DaH",
names_to="RaceIdx"
)
To achieve:
> tidyData
# A tibble: 21 x 6
Dist1 Dist2 Dist3 horse RaceIdx Rating
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 10 10 11.5 1 1 0
2 10 10 11.5 1 2 124
3 10 10 11.5 1 3 121
4 12 10.1 12.5 2 1 124
5 12 10.1 12.5 2 2 117
6 12 10.1 12.5 2 3 119
7 10.3 12 9.8 3 1 121
8 10.3 12 9.8 3 2 125
9 10.3 12 9.8 3 3 123
10 11 8 10 4 1 123
# ... with 11 more rows
Where RaceIdx is the race number.
This has achieved the desired result for 'Rating' column but I need to be able to convert
Dist1, Dist2 and Dist3 in to a separate column 'Distance' that matches up each horses
corresponding DaH rating with Dist.
To illustrate, I am trying to end up with a dataset as follows:
Distance horse RaceIdx Rating
<dbl> <dbl> <chr> <dbl>
1 10 1 1 0
2 10 1 2 124
3 11 1 3 121
4 12 2 1 124
5 10.1 2 2 117
6 12.5 2 3 119
7 10.3 3 1 121
8 12 3 2 125
9 9.8 3 3 123
10 11 4 1 123
# ... with 11 more rows
I need to filter the Ratings by Distance.
Then I hope to be able to produce average ratings for each horse ratings where the
race Distance is between 10 and 11.
Many Thanks in advance.
We can specify the names_sep with a regex lookaround
library(dplyr)
library(tidyr)
horse_data %>%
pivot_longer(cols = -c(horse), names_to = c('.value', 'RaceIdx'),
names_sep="(?<=[A-Za-z])(?=[0-9])") %>%
rename(Distance = Dist, Rating = DaH)
# A tibble: 21 x 4
# horse RaceIdx Rating Distance
# <dbl> <chr> <dbl> <dbl>
# 1 1 1 0 10
# 2 1 2 124 10
# 3 1 3 121 11.5
# 4 2 1 124 12
# 5 2 2 117 10.1
# 6 2 3 119 12.5
# 7 3 1 121 10.3
# 8 3 2 125 12
# 9 3 3 123 9.8
#10 4 1 123 11
# … with 11 more rows

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

How to calculate the cumulative data difference with preceding data by group?

the reduced raw data is as follow
Data group
2016/1/10 1
2016/2/4 1
2016/3/25 1
2016/4/13 1
2016/5/5 1
2016/7/1 2
2016/8/1 2
2016/10/1 2
2016/12/1 2
2016/12/31 2
what the final data i want to get is like:
Data group cum_diff_preceding
2016/1/10 1 0
2016/2/4 1 25
2016/3/25 1 125
2016/4/13 1 182
2016/5/5 1 270
2016/7/1 2 0
2016/8/1 2 31
2016/10/1 2 153
2016/12/1 2 336
2016/12/31 2 380
the calculation method is as follow:
for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is 0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)
my major code is as follow
>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}}
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)
but it not work.
or when i change my code to
>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)
it work, but the result format is
> ddply(df,.(group),fun_forcast)
group V1 V2 V3 V4 V5
1 1 0 25 125 182 270
2 2 0 31 153 336 380
i don't know how to take the results back into cum_diff_preceding in original data.frame.
please
We can do this with ave from base R
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1] 0 25 125 182 270 0 31 153 336 456
Or use dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
#10 2016-12-31 2 456
By converting the dates to numeric, and generalizing the formula:
df %>%
group_by(group) %>%
mutate(numdata = as.numeric(Data),
cum_diff_preceding = (1:n())*numdata-cumsum(numdata)) %>%
select(-numdata)
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
# 10 2016-12-31 2 456

Replicate row value following a factor

Given the following data frame:
df <- data.frame(patientID = rep(c(1:4), 3),
condition = c(rep("A", 4), rep("B",4), rep("C",4)),
weight = round(rnorm(12, 70, 7), 1),
height = round(c(rnorm(4, 170, 10), rep(0, 8)), 1))
> head(df)
patientID condition weight height
1 1 A 71.43 168.5
2 2 A 59.89 177.3
3 3 A 72.15 163.4
4 4 A 70.14 166.1
5 1 B 66.21 0.0
6 2 B 66.62 0.0
How can I copy the height for each patient from condition A into the other two conditions? I tried using for loops, data.table and dplyr without success.
How can I achieve this using either methods?
If your data is as it looks - sorted by condition, patientID, and the patients per condition are identical, then you can just make use of recycling as follows:
require(data.table)
setDT(df)[, height := height[condition == "A"]]
But I understand that's a lot of ifs there.
So, without assuming anything about the data, with one exception that condition,patientID pairs are unique, you can do:
require(data.table)
setDT(df)[, height := height[condition == "A"], by=patientID]
Once again, this makes use of recycling, but within each group - as it doesn't assume the data is ordered.
Both of the above methods on the sample data give:
# patientID condition weight height
# 1: 1 A 73.3 169.5
# 2: 2 A 76.3 173.4
# 3: 3 A 63.6 145.5
# 4: 4 A 56.2 164.7
# 5: 1 B 67.7 169.5
# 6: 2 B 77.3 173.4
# 7: 3 B 76.8 145.5
# 8: 4 B 70.9 164.7
# 9: 1 C 76.6 169.5
# 10: 2 C 73.0 173.4
# 11: 3 C 66.7 145.5
# 12: 4 C 71.6 164.7
The same idea can be translated to dplyr as well, which I'll leave it to you to try. Hint: it just requires group_by and mutate.
No need for the fancy stuff here. Just use the $ operator and [ subsetting.
> df$height <- df$height[df$patientID]
> df
patientID condition weight height
1 1 A 67.4 175.1
2 2 A 66.8 179.0
3 3 A 49.7 159.7
4 4 A 64.5 165.3
5 1 B 66.0 175.1
6 2 B 70.8 179.0
7 3 B 58.7 159.7
8 4 B 74.3 165.3
9 1 C 70.9 175.1
10 2 C 75.6 179.0
11 3 C 61.3 159.7
12 4 C 74.5 165.3
This should do the trick. It assumes that the first level of the condition factor is always the one with the true data.
idx <- tapply(rownames(df), list(df$patientID, df$condition), identity)
idx<-na.omit(cbind(as.vector(idx[,-1]),as.vector(idx[,1])))
df[as.vector(idx[,1]),"height"] <- df[as.vector(idx[,2]), "height"]
And from #Arun's suggestion
df$height<-with(df, ave(ifelse(condition=="A",height,-1),
factor(patientID), FUN=max))
where you can be explicit about the condition level to pull values from

Resources