Replicate row value following a factor - r

Given the following data frame:
df <- data.frame(patientID = rep(c(1:4), 3),
condition = c(rep("A", 4), rep("B",4), rep("C",4)),
weight = round(rnorm(12, 70, 7), 1),
height = round(c(rnorm(4, 170, 10), rep(0, 8)), 1))
> head(df)
patientID condition weight height
1 1 A 71.43 168.5
2 2 A 59.89 177.3
3 3 A 72.15 163.4
4 4 A 70.14 166.1
5 1 B 66.21 0.0
6 2 B 66.62 0.0
How can I copy the height for each patient from condition A into the other two conditions? I tried using for loops, data.table and dplyr without success.
How can I achieve this using either methods?

If your data is as it looks - sorted by condition, patientID, and the patients per condition are identical, then you can just make use of recycling as follows:
require(data.table)
setDT(df)[, height := height[condition == "A"]]
But I understand that's a lot of ifs there.
So, without assuming anything about the data, with one exception that condition,patientID pairs are unique, you can do:
require(data.table)
setDT(df)[, height := height[condition == "A"], by=patientID]
Once again, this makes use of recycling, but within each group - as it doesn't assume the data is ordered.
Both of the above methods on the sample data give:
# patientID condition weight height
# 1: 1 A 73.3 169.5
# 2: 2 A 76.3 173.4
# 3: 3 A 63.6 145.5
# 4: 4 A 56.2 164.7
# 5: 1 B 67.7 169.5
# 6: 2 B 77.3 173.4
# 7: 3 B 76.8 145.5
# 8: 4 B 70.9 164.7
# 9: 1 C 76.6 169.5
# 10: 2 C 73.0 173.4
# 11: 3 C 66.7 145.5
# 12: 4 C 71.6 164.7
The same idea can be translated to dplyr as well, which I'll leave it to you to try. Hint: it just requires group_by and mutate.

No need for the fancy stuff here. Just use the $ operator and [ subsetting.
> df$height <- df$height[df$patientID]
> df
patientID condition weight height
1 1 A 67.4 175.1
2 2 A 66.8 179.0
3 3 A 49.7 159.7
4 4 A 64.5 165.3
5 1 B 66.0 175.1
6 2 B 70.8 179.0
7 3 B 58.7 159.7
8 4 B 74.3 165.3
9 1 C 70.9 175.1
10 2 C 75.6 179.0
11 3 C 61.3 159.7
12 4 C 74.5 165.3

This should do the trick. It assumes that the first level of the condition factor is always the one with the true data.
idx <- tapply(rownames(df), list(df$patientID, df$condition), identity)
idx<-na.omit(cbind(as.vector(idx[,-1]),as.vector(idx[,1])))
df[as.vector(idx[,1]),"height"] <- df[as.vector(idx[,2]), "height"]
And from #Arun's suggestion
df$height<-with(df, ave(ifelse(condition=="A",height,-1),
factor(patientID), FUN=max))
where you can be explicit about the condition level to pull values from

Related

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

adjusting value of column based on if duplicate row - iteratively R

Say I have this dataset:
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df
time val
1 100 1
2 101 3
3 101 1
4 101 2
5 102 3
6 102 1
7 103 2
8 105 3
9 109 1
10 109 2
11 109 1
We can identify duplicate times in the 'time' column like this:
df[duplicated(df$time),]
What I want to do is to adjust the value of time (add 0.1) if it's duplicate. I could do this like this:
df$time <- ifelse(duplicated(df$time),df$time+.1,df$time)
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.1 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.1 1
The issue here is that we still have duplicate values e.g.rows 3 and 4 (that they differ in the column 'val' is irrelevant). Rows 10 and 11 have the same problem. Rows 5 and 6 are fine.
Is there a way of doing this iteratively - i.e. adding 0.1 to first duplicate, 0.2 to second duplicate (of same time value) etc. This way row 4 would become 101.2, and row 11 would become 109.2 . The number of duplicates per value is unknown but will never equal 10 (usually maximum 4).
As in the top answer for the related question linked by #Henrik, this uses data.table::rowid
library(data.table)
setDT(df)
df[, time := time + 0.1*(rowid(time) - 1)]
# time val
# 1: 100.0 1
# 2: 101.0 3
# 3: 101.1 1
# 4: 101.2 2
# 5: 102.0 3
# 6: 102.1 1
# 7: 103.0 2
# 8: 105.0 3
# 9: 109.0 1
# 10: 109.1 2
# 11: 109.2 1
Here's a one line solution using base R -
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df$new_time <- df$time + duplicated(df$time)*0.1*(ave(seq_len(nrow(df)), df$time, FUN = seq_along) - 1)
df
# time val new_time
# 1 100 1 100.0
# 2 101 3 101.0
# 3 101 1 101.1
# 4 101 2 101.2
# 5 102 3 102.0
# 6 102 1 102.1
# 7 103 2 103.0
# 8 105 3 105.0
# 9 109 1 109.0
# 10 109 2 109.1
# 11 109 1 109.2
With dplyr:
library(dplyr)
df %>%
group_by(time1 = time) %>%
mutate(time = time + (0:(n()-1))*0.1) %>%
ungroup() %>%
select(-time1)
or with row_number() (suggested by Henrik):
df %>%
group_by(time1 = time) %>%
mutate(time = time + (row_number()-1)*0.1) %>%
ungroup() %>%
select(-time1)
Output:
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.2 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.2 1

How can I create random additional rows and append it to an existing data frame?

Suppose I have the following df:
head(df1)
international_plan voice_mail_plan number_vmail_messages
1 no yes 25
2 no yes 26
3 no no 0
4 yes no 0
5 yes no 0
6 yes no 0
total_day_minutes total_day_calls total_day_charge total_eve_minutes
1 265.1 110 45.07 197.4
2 161.6 123 27.47 195.5
3 243.4 114 41.38 121.2
4 299.4 71 50.90 61.9
5 166.7 113 28.34 148.3
6 223.4 98 37.98 220.6
total_eve_calls total_eve_charge total_night_minutes total_night_calls
1 99 16.78 244.7 91
2 103 16.62 254.4 103
3 110 10.30 162.6 104
4 88 5.26 196.9 89
5 122 12.61 186.9 121
6 101 18.75 203.9 118
total_night_charge total_intl_minutes total_intl_calls total_intl_charge
1 11.01 10.0 3 2.70
2 11.45 13.7 3 3.70
3 7.32 12.2 5 3.29
4 8.86 6.6 7 1.78
5 8.41 10.1 3 2.73
6 9.18 6.3 6 1.70
number_customer_service_calls churn
1 1 no
2 1 no
3 0 no
4 2 no
5 3 no
6 0 no
I am looking to try rsparkling +h2o framework on "largish" data to enhance my understanding on how to tackle biggish data on local machine.
Instead of downloading large data from net, what if I can scale up existing small data so that I do not waste time of preprocessing but concentrate on ML modeling at scale.
What I am looking for is to randomly add data, i.e., rows, only from the existing data (maintaining the same columns) based on, let's say, some distribution for numeric (normal dist) & categorical columns (maintaining the proportion frequency of levels), so that I increase the dimensions, say, from initial 3333 x 17 to, say, 1000000 x 17, using R. This is for testing purposes only.
Help will be greatly appreciated.
Expected df:
international_plan voice_mail_plan number_vmail_messages
1 no yes 25
2 no yes 26
3 no no 0
4 yes no 0
5 yes no 0
6 yes no 0
-
1000000 no yes 20
total_day_minutes total_day_calls total_day_charge total_eve_minutes
1 265.1 110 45.07 197.4
2 161.6 123 27.47 195.5
3 243.4 114 41.38 121.2
4 299.4 71 50.90 61.9
5 166.7 113 28.34 148.3
6 223.4 98 37.98 220.6
total_eve_calls total_eve_charge total_night_minutes total_night_calls
1 99 16.78 244.7 91
2 103 16.62 254.4 103
3 110 10.30 162.6 104
4 88 5.26 196.9 89
5 122 12.61 186.9 121
6 101 18.75 203.9 118
-
1000000 50 20.22 189.23 100
total_night_charge total_intl_minutes total_intl_calls total_intl_charge
1 11.01 10.0 3 2.70
2 11.45 13.7 3 3.70
3 7.32 12.2 5 3.29
4 8.86 6.6 7 1.78
5 8.41 10.1 3 2.73
6 9.18 6.3 6 1.70
-
1000000 10.23 7.33 8 2.52
number_customer_service_calls churn
1 1 no
2 1 no
3 0 no
4 2 no
5 3 no
6 0 no
-
1000000 2 yes
A quick function for simple if statements will get you the random values, which you could afterwards put together with cbind.data.frame and merge it to your data.
Example data:
set.seed(1)
df <- data.frame(a = factor(c(1,2,1,2,1), 1:2, labels = c("yes", "no")),
b = 1:5,
c = rnorm(5))
a b c
1 yes 1 -0.6264538
2 no 2 0.1836433
3 yes 3 -0.8356286
4 no 4 1.5952808
5 yes 5 0.3295078
The function checks for data type and returns n randomly generated values using the distribution of the variable:
FUN1 <- function(x, n = 1, seed = 1){
set.seed(seed)
if(is.character(x)){
y <- sample(sort(unique(x)), n, replace = T, prob = table(x))
}
if(is.factor(x)){
y <- sample(levels(x), n, replace = T, prob = table(x))
}
if(is.integer(x)){
y <- round(rnorm(n, mean(x), sd(x)))
}
if(!is.integer(x) & is.numeric(x)){
y <- rnorm(n, mean(x), sd(x))
}
return(y)
}
Loop it over the empirical data with lapply:
newvalues <- lapply(df, FUN1, n = 10)
$a
[1] "yes" "yes" "yes" "no" "yes" "no" "no" "no" "no" "yes"
$b
[1] 2 3 2 6 4 2 4 4 4 3
$c
[1] -0.4727769 0.3057584 -0.6738021 1.6623976 0.4459399 -0.6592326 0.5977084 0.8388290 0.6826185 -0.1642204
Now cbind.data.frame them with do.call:
df1 <- do.call("cbind.data.frame", newvalues)
> df1
a b c
1 yes 2 -0.4727769
2 yes 3 0.3057584
3 yes 2 -0.6738021
4 no 6 1.6623976
5 yes 4 0.4459399
6 no 2 -0.6592326
7 no 4 0.5977084
8 no 4 0.8388290
9 no 4 0.6826185
10 yes 3 -0.1642204
and merge them:
df2 <- merge(df, df1, all = TRUE)
a b c
1 yes 1 -0.6264538
2 yes 2 -0.6738021
3 yes 2 -0.4727769
4 yes 3 -0.8356286
5 yes 3 -0.1642204
6 yes 3 0.3057584
7 yes 4 0.4459399
8 yes 5 0.3295078
9 no 2 -0.6592326
10 no 2 0.1836433
11 no 4 0.5977084
12 no 4 0.6826185
13 no 4 0.8388290
14 no 4 1.5952808
15 no 6 1.6623976
The process is rather quick with the exception of the merge. With really big data, this merge may take some time. A quick test with 10 million new rows of three variables took a fraction of a second for the generation and cbind, but about one minute to merge. Considering that the biggest part of your data would be randomly generated anyway, you could just use only the generated dataset, therefore skipping the merging process altogether.
One fast and easy way to preserve proportions, would be to bootstrap (sampling with replacement) from your column vectors/ features.
new_df <- as.data.frame(apply(df, 2, function(x) sample(x, 1e6, replace = TRUE)))
If you want to simulate from empirical distribution for numerical features, you might need to write a custom function

R: How to create a new column for 90th quantile based off previous rows in a data frame

data.frame(c = c(1,7,11,4,5,5))
c
1 1
2 7
3 11
4 4
5 5
6 5
desired dataframe
c c.90th
1 1 NA
2 7 1
3 11 6.4
4 4 10.2
5 5 9.8
6 5 9.4
For the first row, I want it to look at the previous rows, none and get the 90th quantile, NA.
For the second row, I want it to look at the previous rows, 1 and get the 90th quantile, 1.
For the third row, I want it to look at the previous rows, 1, 7 and get the 90th quantile, 6.4.
etc.
A solution using data.table that also works by groups:
library(data.table)
dt <- data.table(c = c(1,7,11,4,5,5),
group = c(1, 1, 1, 2, 2, 2))
cumquantile <- function(y, prob) {
sapply(seq_along(y), function(x) quantile(y[0:(x - 1)], prob))
}
dt[, c90 := cumquantile(c, 0.9)]
dt[, c90_by_group := cumquantile(c, 0.9), by = group]
> dt
c group c90 c90_by_group
1: 1 1 NA NA
2: 7 1 1.0 1.0
3: 11 1 6.4 6.4
4: 4 2 10.2 NA
5: 5 2 9.8 4.0
6: 5 2 9.4 4.9
Try:
dff <- data.frame(c = c(1,7,11,4,5,5))
dff$c.90th <- sapply(1:nrow(dff),function(x) quantile(dff$c[0:(x-1)],0.9,names=F))
Output:
c c.90th
1 NA
7 1.0
11 6.4
4 10.2
5 9.8
5 9.4

Assigning rank of values within groups with NAs

I have such a data frame(df) which is just a sapmle:
group value
1 12.1
1 10.3
1 NA
1 11.0
1 13.5
2 11.7
2 NA
2 10.4
2 9.7
Namely,
df<-data.frame(group=c(1,1,1,1,1,2,2,2,2), value=c(12.1, 10.3, NA, 11.0, 13.5, 11.7, NA, 10.4, 9.7))
Desired output is:
group value order
1 12.1 3
1 10.3 1
1 NA NA
1 11.0 2
1 13.5 4
2 11.7 3
2 NA NA
2 10.4 2
2 9.7 1
Namely, I want to find the
rank of the "value"s from starting from the smallest value
within the "group"s.
How can I do that with R? I will be very glad for any help Thanks a lot.
We could use ave from base R to create the rank column ("order1") of "value" by "group". If we need to have NAs for corresponding NA in "value" column, this can be done (df$order[is.na(..)])
df$order1 <- with(df, ave(value, group, FUN=rank))
df$order1[is.na(df$value)] <- NA
Or using data.table
library(data.table)
setDT(df)[, order1:=rank(value)* NA^(is.na(value)), by = group][]
# group value order1
#1: 1 12.1 3
#2: 1 10.3 1
#3: 1 NA NA
#4: 1 11.0 2
#5: 1 13.5 4
#6: 2 11.7 3
#7: 2 NA NA
#8: 2 10.4 2
#9: 2 9.7 1
You can use the rank() function applied to each group at a time to get your desired result. My solution for doing this is to write a small helper function and call that function in a for loop. I'm sure there are other more elegant means using various R libraries but here is a solution using only base R.
df <- read.table('~/Desktop/stack_overflow28283818.csv', sep = ',', header = T)
#helper function
rankByGroup <- function(df = NULL, grp = 1)
{
rank(df[df$group == grp, 'value'])
}
# Remove NAs
df.na <- df[is.na(df$value),]
df.0 <- df[!is.na(df$value),]
# For loop over groups to list the ranks
for(grp in unique(df.0$group))
{
df.0[df.0$group == grp, 'order'] <- rankByGroup(df.0, grp)
print(grp)
}
# Append NAs
df.na$order <- NA
df.out <- rbind(df.0,df.na)
#re-sort for ordering given in OP (probably not really required)
df.out <- df.out[order(as.numeric(rownames(df.out))),]
This gives exactly the output desired, although I suspect that maintaining the position of the NAs in the data may not be necessary for your application.
> df.out
group value order
1 1 12.1 3
2 1 10.3 1
3 1 NA NA
4 1 11.0 2
5 1 13.5 4
6 2 11.7 3
7 2 NA NA
8 2 10.4 2
9 2 9.7 1

Resources