generate a new dataset based on individual information - r

I have a dataset like this:
df <- data.frame(ID=1:10, baseline = c(1.8,2.4,3.2,2.3,2.1,2.2,3,2.8,2,2.9))
I want to create a new column called "response", this column should be created based on the following equation:
individual response=individual baseline+0.5*sin(2*3.14*(t-7.5)/24)
in this equation, t is generated based on this vector
t=rep(seq(0,24,by=0.1))
so for each ID, there should be 241 responses generated. How could I generate the new dataset containing ID, baseline, time, and response?

Another approach:
t <- rep(seq(0, 24, by = 0.1), each = nrow(df))
vals <- 0.5 * sin(2 * 3.14 * (t - 7.5) / 24)
new_df <- cbind(df, t, response = df$baseline + vals)

Try
library(reshape2)
res <- melt(apply(df[,2, drop=FALSE], 1,
function(x) x+0.5*sin(2*3.14*(t-7.5)/24)))
indx <- rep(1:nrow(df), each=241)
df1 <- cbind(df[indx,], time= rep(t, nrow(df)), response=res[,3])
row.names(df1) <- NULL
dim(df1)
#[1] 2410 4
head(df1,3)
# ID baseline time response
#1 1 1.8 0.0 1.337870
#2 1 1.8 0.1 1.333034
#3 1 1.8 0.2 1.328518
Or
t <- seq(0,24, by=0.1)
indx <- rep(1:nrow(df), each=length(t))
df2 <- within(df[indx,], {response<-baseline+0.5*sin(2*3.14*(t-7.5)/24)
time <- t})
row.names(df2) <- NULL
all.equal(df1, df2)
#[1] TRUE

Related

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

Thousand separator to numeric columns in R

I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given

Efficient Montecarlo simulation over a grid in R

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

stuck on for loop with variable depth

I have a data frame called df2 that has 1501 data points
Depth <- seq(0, 1500, by = 1)
Temp <- rev(seq(1, 10, by = 0.006))
D0 <- 0
Dend <- 1000
r <- 2
days <- 100
D <- rep(NA, days+1)
D <- D0
Temp <- T0
for (time in seq_len(steps)){
if (tail(D,1) >= Dend) break
D[time + 1] <- r + D[time]
Temp[time] <- Temp[time]
}
I can't seem to couple Temp with D. Using this line of code (Temp[time] <- Temp[time]), I get Temp every m for 1500.
One approach to simplify things a bit with seq using by= and length.out=.
Then we can use merge to join the results back to df2. It needs to be a data.frame with names to merge onto, so I changed your cbind to data.frame.
Depth <- seq(0, 1500, by = 1)
Temp <- rev(seq(1, 10, by = 0.006))
df2 <- data.frame(Depth, Temp)
D0 <- 0
days <- 107
r <- 40
Result <- data.frame(Day = 0:days,
Depth =seq(from = D0, by= r ,length.out = days + 1))
Result <- merge(Result,df2,all.x=TRUE)
Result
# Depth Day Temp
#1 0 0 10.00
#2 40 1 9.76
#3 80 2 9.52
#4 120 3 9.28
#5 160 4 9.04
#...
By using all.x=TRUE we will get NA when there is no value in df2 for that Depth.

how to combine the results of apply iteratively

I wanted to multiply to each list element in say l1 with b1's col1 and store it in a separate column. Basically this is what i wanted to do :
res = 0
for item in a
for col_item in b
res = res + item * col_item
E.g.
l1 = list(c('17-Nov-14', 10), c('17-Apr-15', 20))
b1 = data.frame(col1 = c(10, 20), res=c(0))
result = data.frame(col1= c(10, 20), res = c(2*10+4*10+3*10, 2*20+4*20+3*20))
I have a working code but can be improved.
test <- function(param, df) {
df$res <- as.integer(param[2]) * df$col1
df
}
t <- lapply(l1, test, b)
result <- cbind(t[[1]]$col1, t[[1]]$res + t[[2]]$res + t[[3]]$res)
We can simplify the computation with a little algebra. If we factor out the element of b1$col1, then we can precompute the sum of the list and perform a vectorized multiplication against it:
b1$res <- sum(unlist(l1))*b1$col1;
b1;
## col1 res
## 1 10 90
## 2 20 180
For your new problem definition, we need to extract the required element out of each list component vector:
b1$res <- sum(as.integer(sapply(l1,`[`,2L)))*b1$col1;
b1;
## col1 res
## 1 10 300
## 2 20 600
If you are looking for a method to reduce your list after lapply, you can use the Reduce function:
Reduce(function(df1, df2) data.frame(col = df1[1], res = df1[2] + df2[2]), myList)
# col1 res
# 1 10 90
# 2 20 180
Suppose myList <- lapply(...).

Resources