I have a dataset that looks like this:
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
I want to calculate fat free mass (FFM) and attach the value in a new column in the dataset for each individual. These are the functions to calculate FFM for males and females:
for males (SEX=1):
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
and for females (SEX=2):
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+244*BMI)
}
I want to modify this function so it check for the SEX (1, male or 2 is female) then do the calculation for FFM based on that and apply the function for each individual. Could you please help?
Thanks in advance!
You could use ifelse
data$FFM <- ifelse(data$SEX==1,
FFMCalMale(data$WEIGHT, data$BMI),
FFMCalFemale(data$WEIGHT, data$BMI))
A data.table approach:
mydata <- read.table(
header = T, con <- textConnection
('
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
'), stringsAsFactors = FALSE)
close(con)
library(data.table) ## load data.table
setDT(mydata) ## convert the data to datatable
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+BMI)
}
setkey(mydata, SEX)
mydata[, FFM := ifelse(SEX == 1,
FFMCalMale(WEIGHT, BMI),
FFMCalFemale(WEIGHT, BMI))][]
# ID SEX WEIGHT BMI FFM
# 1: 2 1 70 30 49.30851
# 2: 2 1 70 30 49.30851
# 3: 2 1 70 30 49.30851
# 4: 2 1 70 30 49.30851
# 5: 4 1 85 20 71.63182
# 6: 4 1 85 20 71.63182
# 7: 1 2 65 25 68.43271
# 8: 1 2 65 25 68.43271
# 9: 1 2 65 25 68.43271
# 10: 3 2 50 18 52.68243
# 11: 3 2 50 18 52.68243
Here are two ways, one just taking the dataframe (assuming it contains columns with the names SEX, WEIGHT, and BMI):
dffunc <- function(dataframe) {
ifelse(dataframe$SEX == 1,
9270 * dataframe$WEIGHT / (6680 + 216 * dataframe$BMI),
9270 * dataframe$WEIGHT / (8780 + dataframe$BMI))
}
or as you originally formatted it, but adding the SEX parameter:
func <- function(WEIGHT, BMI, SEX) {
ifelse(SEX == 1,
9270 * WEIGHT / (6680 + 216 * BMI),
9270 * WEIGHT / (8780 + BMI))
}
Related
I am trying to simulate the following "game:
There is a population of 100 units
You randomly sample 10 of these units, record the id's of the units you saw, and then put them back into the population
You then take a second sample, record the id's of the units you saw in this second sample along with the first sample, and then put the second sample back into the population
Repeat this many times
I wrote the following code in R that performs the above procedure:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
The result looks something like this:
> summary_file
iteration_i Count Cumulative unobserved
1 1 10 10 90
2 2 8 18 82
3 3 9 27 73
4 4 8 35 65
5 5 6 41 59
6 6 5 46 54
7 7 7 53 47
8 8 7 60 40
9 9 4 64 36
10 10 3 67 33
11 11 4 71 29
12 12 4 75 25
13 13 1 76 24
14 14 4 80 20
15 15 1 81 19
16 16 2 83 17
17 17 2 85 15
18 18 1 86 14
19 20 1 87 13
20 22 1 88 12
21 23 2 90 10
22 24 1 91 9
23 25 1 92 8
24 27 2 94 6
25 28 1 95 5
26 30 1 96 4
27 35 1 97 3
28 37 1 98 2
29 44 1 99 1
30 46 1 100 0
I would now like to repeat this "game" many times.
I would like to keep the "summary_file" for each "game" (e.g. summary_file_1, summary_file_2, summary_file_3, etc.)
I would then like to create a "total" summary file that shows the number of iterations that were required in each game to observe all units.
This total_summary_file would look something like this:
game_id iterations_required
1 game_1 47
2 game_2 45
3 game_3 44
4 game_4 42
5 game_5 42
Currently, I am just copy/pasting my earlier code several times and storing the results, then I append everything at the end and calculate the summary statistics - but I am trying to find a way to "loop the loop" and do everything at once. I do not know if it is possible to introduce a command like "results_df_i <- do.call(rbind.data.frame, results_i)" into the loop and efficiently create everything at the same time instead of manually copy/pasting the earlier loop.
You're making this a lot less efficient than it could be. To get, say, 100 repeated samples of 10 from the set 1:100 (with replacement), we can do replicate(100, sample(100, 10, TRUE)).
We can then coerce this into a vector and count the number of unique values every 10 entries along the vector until we get to 100. This gives us the number of iterations required to exhaust the samples.
If we put this inside an sapply, we don't even need an explicit loop, which means we can create the results data frame in a single call:
set.seed(1)
n_games <- 10
results <- data.frame(game_id = paste("game", seq(n_games), sep = "_"),
iterations_required = sapply(seq(n_games), function(x) {
samp <- c(replicate(100, sample(100, 10, TRUE)))
sum(sapply(1:100 * 10, function(n) length(unique(samp[1:n]))) < 100)
}))
results
#> game_id iterations_required
#> 1 game_1 59
#> 2 game_2 44
#> 3 game_3 54
#> 4 game_4 59
#> 5 game_5 57
#> 6 game_6 58
#> 7 game_7 96
#> 8 game_8 60
#> 9 game_9 71
#> 10 game_10 33
Created on 2022-06-11 by the reprex package (v2.0.1)
There are lots of ways to get your desired outcome; wrapping your loop in a function and running the function multiple times is another potential solution:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
sample_func <- function(output_file_name) {
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
output_file_name <- summary_file
return(output_file_name)
}
list_of_names <- paste0("game_", 1:10)
output <- lapply(list_of_names, sample_func)
names(output) <- list_of_names
head(output, n = 2)
#> $game_1
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 9 19 81
#> 3 3 8 27 73
#> 4 4 8 35 65
#> 5 5 6 41 59
#> 6 6 6 47 53
#> 7 7 6 53 47
#> 8 8 4 57 43
#> 9 9 4 61 39
#> 10 10 3 64 36
#> 11 11 2 66 34
#> 12 12 1 67 33
#> 13 13 4 71 29
#> 14 14 1 72 28
#> 15 15 2 74 26
#> 16 16 2 76 24
#> 17 17 3 79 21
#> 18 18 4 83 17
#> 19 19 2 85 15
#> 20 20 2 87 13
#> 21 21 1 88 12
#> 22 24 1 89 11
#> 23 25 2 91 9
#> 24 26 1 92 8
#> 25 27 1 93 7
#> 26 30 1 94 6
#> 27 31 1 95 5
#> 28 33 1 96 4
#> 29 34 1 97 3
#> 30 36 1 98 2
#> 31 41 1 99 1
#> 32 66 1 100 0
#>
#> $game_2
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 10 20 80
#> 3 3 7 27 73
#> 4 4 7 34 66
#> 5 5 8 42 58
#> 6 6 5 47 53
#> 7 7 7 54 46
#> 8 8 5 59 41
#> 9 9 1 60 40
#> 10 10 7 67 33
#> 11 11 3 70 30
#> 12 12 3 73 27
#> 13 13 1 74 26
#> 14 14 3 77 23
#> 15 15 4 81 19
#> 16 16 3 84 16
#> 17 17 2 86 14
#> 18 18 1 87 13
#> 19 19 2 89 11
#> 20 20 1 90 10
#> 21 21 2 92 8
#> 22 22 1 93 7
#> 23 25 2 95 5
#> 24 27 1 96 4
#> 25 29 2 98 2
#> 26 30 1 99 1
#> 27 41 1 100 0
The output is a list, so you can use purrr to apply functions to each element (e.g. https://purrr.tidyverse.org/reference/lmap.html) or use the bind_rows() function to create a single dataframe for further use, e.g.
df2 <- bind_rows(output, .id = "game") %>%
group_by("Game" = factor(game, levels = list_of_names)) %>%
summarise(rows_in_output = n(),
number_of_iterations = max(iteration_i))
df2
#> # A tibble: 10 × 3
#> Game rows_in_output number_of_iterations
#> <fct> <int> <int>
#> 1 game_1 32 66
#> 2 game_2 27 41
#> 3 game_3 27 48
#> 4 game_4 32 50
#> 5 game_5 27 35
#> 6 game_6 27 71
#> 7 game_7 28 68
#> 8 game_8 27 48
#> 9 game_9 29 43
#> 10 game_10 29 66
Created on 2022-06-17 by the reprex package (v2.0.1)
Or you can use list2env() to get have each individual dataframe in your environment, e.g.
list2env(output, envir = .GlobalEnv)
ls()
#> [1] "df2" "game_1" "game_10" "game_2" "game_3" "game_4"
#> [7] "game_5" "game_6" "game_7" "game_8" "game_9" "list_of_names"
#> [13] "my_data" "output" "response" "results" "results2" "sample_func"
#> [19] "var_1" "var_2" "var_3"
This seems easily solvable using recursion:
fun <- function(x, i=1, size = 10){
a <- setdiff(x,sample(100, 10, TRUE)) # Remove the seen from x
if(length(a)) Recall(a, i+1) else i # if we have unobserved, call fun again
}
Now we can have as many games as we want:
data.frame(game = paste0('game',seq(10)), results = replicate(10, fun(1:100)))
game results
1 game1 62
2 game2 40
3 game3 51
4 game4 50
5 game5 34
6 game6 83
7 game7 38
8 game8 40
9 game9 53
10 game10 41
You could also do
hist(replicate(1000, fun(1:100)), breaks = 30)
Edit:
Note that this can be editted to take in any size and vector. eg:
fun <- function(x, size = 10, y=x, i=1){
a <- setdiff(x,sample(y, size, TRUE))
cat('i', i, '\t a: ',a, '\n')
if(length(a)>0) Recall(a, size, y, i+1) else i
}
set.seed(117);fun(1:10, 1)
i 1 a: 1 2 4 5 6 7 8 9 10 # 3 removed
i 2 a: 1 2 4 5 7 8 9 10 # 6 removed
i 3 a: 1 2 4 5 7 8 9 # 10 removed
i 4 a: 1 2 4 5 8 9 # 7 removed
i 5 a: 1 2 5 8 9 # 4 removed
i 6 a: 1 2 5 8 9 # Nothing removed
i 7 a: 1 5 8 9 # 2 removed
i 8 a: 1 5 8 # 9 removed
i 9 a: 1 5 # Nothing removed
i 10 a: 1 5 # Nothing removed
i 11 a: 5 # 1 removed
i 12 a: # 5 removed
Using Markov chains, we can produce the cumulative distribution function for the number of iterations required for a game (up to machine precision). The resulting CDF can be sampled directly using findInterval.
We can simplify things slightly by starting with the second iteration, since the first iteration will always result in 90 unseen units.
First, set up a matrix for all possible transitions:
m <- matrix(c(rep(90:1, each = 11), sequence(rep(11,90), 90:1, -1)), ncol = 2, dimnames = list(NULL, c("from", "to")))
m <- m[m[,2] >= 0L,]
Then create a transition matrix with row 1 representing the state where all units have been seen and row 91 representing the state where 10 units have been seen:
mTrans <- matrix(0, 91, 91)
The number of previously unseen units selected follows the hypergeometric distribution.
mTrans[m + 1L] <- dhyper(m[,1] - m[,2], m[,1], 100L - m[,1], 10L)
Row 1 represents an absorbing state since all units have been seen.
mTrans[1, 1] <- 1
mTrans contains the probabilities of each state after the second iteration.
Initialize a while loop and calculate the CDF.
mm <- mTrans %*% mTrans
maxIter <- 1000L
p <- numeric(maxIter)
iter <- 3L
while (p[iter] < 1) {
if ((iter <- iter + 1L) > maxIter) {
p <- c(p, numeric(maxIter))
maxIter <- maxIter*2L
}
mm <- mm %*% mTrans
p[iter] <- mm[91, 1]
}
p <- p[1:iter]
iter
#> [1] 345
Machine precision limits the CDF to less than 345 iterations. Plot the CDF:
plot(p, xlab = "iterations", ylab = "cumulative probability")
Using findInterval we can quickly generate a large number of random samples of the iterations required.
ngames <- 1e6L # one million games
results <- data.frame(game_id = 1:ngames, iterations_required = findInterval(runif(ngames), p))
head(results)
#> game_id iterations_required
#> 1 1 73
#> 2 2 69
#> 3 3 40
#> 4 4 41
#> 5 5 44
#> 6 6 43
Get a histogram of the sample number of iterations required.
hist(results$iterations_required)
OP here! I think I was able to find an answer to my own question:
library(dplyr)
var_1 <- rnorm(100, 10, 10)
var_2 <- rnorm(100, 1, 10)
var_3 <- rnorm(100, 5, 10)
response <- rnorm(100, 1, 1)
my_data <- data.frame(var_1, var_2, var_3, response)
my_data$id <- 1:100
simulate <- function() {
results <- list()
results2 <- list()
for (i in 1:100) {
iteration_i <- i
sample_i <- my_data[sample(nrow(my_data), 10), ]
results_tmp <- data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file <- data.frame(test_1 %>%
group_by(iteration_i) %>%
summarise(Count=n()))
cumulative <- cumsum(summary_file$Count)
summary_file$Cumulative <- cumulative
summary_file$unobserved <- 100 - cumulative
return(summary_file)
}
# now, loop 10 times!
results <- list()
for (i in 1:10) {
game_i <- i
s_i <- simulate()
results_tmp <- data.frame(game_i, s_i)
results[[i]] <- results_tmp
}
final_file <- do.call(rbind.data.frame, results)
Thanks for your help everyone!
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
I have a categorical variable B with 3 levels 1,2,3 also I have another variable A with some values.. sample data is as follows
A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3
. .
. .
. .
. .
All I want is say for every level of B ( say in 1) I need to calculate Val(A)-Min/Max-Min, similarly I need to reproduce the same to other levels (2 & 3)
Solution using dplyr:
set.seed(1)
df=data.frame(A=round(rnorm(21,50,10)),B=rep(1:3,each=7))
library(dplyr)
df %>% group_by(B) %>% mutate(C= (A-min(A))/(max(A)-min(A)))
The output is like
# A tibble: 21 x 3
# Groups: B [3]
A B C
<dbl> <int> <dbl>
1 44 1 0.0833
2 52 1 0.417
3 42 1 0
4 66 1 1
5 53 1 0.458
6 42 1 0
7 55 1 0.542
8 57 2 0.784
9 56 2 0.757
10 47 2 0.514
# ... with 11 more rows
You could use the tapply function:
x = read.table(text="A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3", header = TRUE)
y = tapply(x$A, x$B, function(z) (z - min(z)) / (max(z) - min(z)))
# Or using the scale() function
#y = tapply(x$A, x$B, function(z) scale(z, min(z), max(z) - min(z)))
cbind(x, unlist(y))
Not exactly sure how you want the output, but this should be a decent starting point.
This question already has answers here:
De-aggregate / reverse-summarise / expand a dataset in R [duplicate]
(4 answers)
Closed 6 years ago.
I basically want do the opposite of ddply(df, columns.to.preserve, numcolwise(FUNCTION).
Suppose I have
d <- data.frame(
count=c(2,1,3),
summed.value=c(50,20,30),
averaged.value=c(35,80,20)
)
count summed.value averaged.value
1 2 50 35
2 1 20 80
3 3 30 20
I want to do a row expansion of this data.frame based on the count column while specifying what kind of operation I want to apply to the other columns.
Here is the kind of result I'm looking for:
> d2
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20
Any there built in functions within dplyr or other packages that does this kind of operation?
Edit: This is different from the De-aggregate / reverse-summarise / expand a dataset in R question because I want to go further and actually apply different functions to columns within the table I wish to expand. There are also more useful and answers on this post.
Use dplyr and tidyr, you can do a rowwise transformation for the summed.value which produces a list for each cell and then unnest the column should give you what you need:
library(dplyr); library(tidyr)
d %>% rowwise() %>% summarise(summed.value = list(rep(summed.value/count, count)),
averaged.value = averaged.value, count = 1) %>% unnest()
# Source: local data frame [6 x 3]
# averaged.value count summed.value
# <dbl> <dbl> <dbl>
# 1 35 1 25
# 2 35 1 25
# 3 80 1 20
# 4 20 1 10
# 5 20 1 10
# 6 20 1 10
Another way is to use data.table, where you can specify the row number as group variable, and the data table will automatically expand it:
library(data.table)
setDT(d)
d[, .(summed.value = rep(summed.value/count, count), averaged.value, count = 1), .(1:nrow(d))]
[, nrow := NULL][]
# summed.value averaged.value count
#1: 25 35 1
#2: 25 35 1
#3: 20 80 1
#4: 10 20 1
#5: 10 20 1
#6: 10 20 1
There is a function untable in package reshape for getting the inverse of a table. Then divide the variables that need dividing by count via mutate_at (or mutate_each). mutate_at was introduced in dplyr_0.5.0.
First the untable:
library(reshape)
untable(d, num = d$count)
count summed.value averaged.value
1 2 50 35
1.1 2 50 35
2 1 20 80
3 3 30 20
3.1 3 30 20
3.2 3 30 20
Then the mutate_at for dividing summed.value and count by count:
library(dplyr)
untable(d, num = d$count) %>%
mutate_at(vars(summed.value, count), funs(./count))
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20
Here's a both simple and fully vecotrized base R approach
transform(d[rep(1:nrow(d), d$count), ],
count = 1,
summed.value = summed.value/count)
# count summed.value averaged.value
# 1 1 25 35
# 1.1 1 25 35
# 2 1 20 80
# 3 1 10 20
# 3.1 1 10 20
# 3.2 1 10 20
Or similarly, using data.table
library(data.table)
res <- setDT(d)[rep(1:.N, count)][, `:=`(count = 1, summed.value = summed.value / count)]
res
# count summed.value averaged.value
# 1: 1 25 35
# 2: 1 25 35
# 3: 1 20 80
# 4: 1 10 20
# 5: 1 10 20
# 6: 1 10 20
A base R solution: It tries to replicate each row by the value of the count column and then divide count and summed.value columns by count.
mytext <- 'count,summed.value,averaged.value
2,50,35
1,20,80
3,30,20'
mydf <- read.table(text=mytext,header=T,sep = ",")
mydf <- do.call(rbind,apply(mydf, 1, function(x) {
tempdf <- t(replicate(x[1],x,simplify = T))
tempdf[,1] <- tempdf[,1]/x[1]
tempdf[,2] <- tempdf[,2]/x[1]
return(data.frame(tempdf))
}))
count summed.value averaged.value
1 25 35
1 25 35
1 20 80
1 10 20
1 10 20
1 10 20
I have a data.frame that I am using to set parameters for simulations.
states_grid <- expand.grid(years = c(1:47), start_pct = c(0:99), sim_num = c(1:50))
The above code creates all the states that I would like to simulate. My issue becomes creating a data.frame to hold the outputs. What I would like to do is to create a larger data frame in which we add in an ob_num variable. The ob_num variable will run from 1 to the number of years indicated in column 1.
For example:
years start_pct sim_num ob_num
1: 2 99 1 1
2: 2 99 1 2
3: 3 99 1 1
4: 3 99 1 2
5: 3 99 1 3
6: 4 99 1 1
7: 4 99 1 2
8: 4 99 1 3
9: 4 99 1 4
However I can't think of an efficient way to create this data frame.
Thoughts?
Edit: I tried the below suggestion but that didn't seem to do it.
The below code returns a data.table of the same size (235,000) rows.
states_grid <- expand.grid(years = c(1:(year_max - year_min + 1)),
start_pct = c(0:99),
sim_num = c(1:50))
states_grid <- data.table(states_grid)
setDT(states_grid)[, ob_num := 1:.N, by = years][]
I also tried:
states_grid <- setDT(states_grid)[, ob_num := 1:.N, by = years][]
Both methods return 235K rows.
CJ(years = c(1:47), start_pct = c(0:99), sim_num = c(1:50))[,
.(ob_num = seq_len(years)), by = .(years, start_pct, sim_num)]
# years start_pct sim_num ob_num
# 1: 1 0 1 1
# 2: 1 0 2 1
# 3: 1 0 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# ---
#5639996: 47 99 50 43
#5639997: 47 99 50 44
#5639998: 47 99 50 45
#5639999: 47 99 50 46
#5640000: 47 99 50 47