stuck on for loop with variable depth - r

I have a data frame called df2 that has 1501 data points
Depth <- seq(0, 1500, by = 1)
Temp <- rev(seq(1, 10, by = 0.006))
D0 <- 0
Dend <- 1000
r <- 2
days <- 100
D <- rep(NA, days+1)
D <- D0
Temp <- T0
for (time in seq_len(steps)){
if (tail(D,1) >= Dend) break
D[time + 1] <- r + D[time]
Temp[time] <- Temp[time]
}
I can't seem to couple Temp with D. Using this line of code (Temp[time] <- Temp[time]), I get Temp every m for 1500.

One approach to simplify things a bit with seq using by= and length.out=.
Then we can use merge to join the results back to df2. It needs to be a data.frame with names to merge onto, so I changed your cbind to data.frame.
Depth <- seq(0, 1500, by = 1)
Temp <- rev(seq(1, 10, by = 0.006))
df2 <- data.frame(Depth, Temp)
D0 <- 0
days <- 107
r <- 40
Result <- data.frame(Day = 0:days,
Depth =seq(from = D0, by= r ,length.out = days + 1))
Result <- merge(Result,df2,all.x=TRUE)
Result
# Depth Day Temp
#1 0 0 10.00
#2 40 1 9.76
#3 80 2 9.52
#4 120 3 9.28
#5 160 4 9.04
#...
By using all.x=TRUE we will get NA when there is no value in df2 for that Depth.

Related

Nested for loops using row lags and multipliers defined in loop

End goal is to take a dataframe and create a new column based on multiplication and addition of prior rows, i.e. if my multipliers are 0.1, 0.2, and 0.3, my addition is z + [lag(z) * 0.1] ,then I want to take column Z and transform it 3 times as such (skipping the first row):
z <- 1:4*10
df <- data.frame(z)
Z
Z_0.1
Z_0.2
Z_0.3
10
10
10
10
20
21
22
23
30
32.1
34.4
36.9
40
43.21
46.88
51.07
I have been able to get the correct values by manually feeding in the rate and overwriting the existing column:
for (i in 1:nrow(df)) {
if (i ==1)
df[i,1] <- df[i,1]
else
df[i,1] <- df[i,1] + (df[i-1,1] * 0.1)
}
Separately, I can also create column placeholders for the new values:
for (i in seq(0.1, 0.3, by = 0.1)) {
cola <- paste('col', i, sep = "_")
df[[cola]] <- 0
}
However, I cannot seem to combine these loops and get the outcome in the above sample table. I have tried this:
for (i in 1:nrow(df2)) {
for (j in seq(0.1, 0.3, by = 0.1)) {
cola <- paste('col', j, sep = "_")
df[[cola]] <- 0
if (i ==1)
df[[cola]] <- df[i,1]
else
df[[cola]] <- df[i,1] + (df[i-1,1] * j)
}
}
But it fills all the new columns with the same values for the whole column
Z
Z_0.1
Z_0.2
Z_0.3
10
77.02
81.85
86.68
20
77.02
81.85
86.68
30
77.02
81.85
86.68
40
77.02
81.85
86.68
Appreciate any suggestions. I'm not married to for loops if anyone has an alternative suggestion.
Like this maybe?
Z <- 1:4*10
y <- seq(0.1, 0.3, by = 0.1)
df <- data.frame(Z)
for (i in 1:(length(Z)-1)+1){
for (j in seq_along(y)){
df[1,paste0('Z_', y[j])] = Z[1]
df[i, paste0('Z_', y[j])] = Z[i]+(df[i-1, paste0('Z_', y[j])]*y[j])
}
}
df
#> Z Z_0.1 Z_0.2 Z_0.3
#> 1 10 10.00 10.00 10.00
#> 2 20 21.00 22.00 23.00
#> 3 30 32.10 34.40 36.90
#> 4 40 43.21 46.88 51.07
Created on 2022-09-09 by the reprex package (v2.0.1)

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

Efficient Montecarlo simulation over a grid in R

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

Alternative nested for loops for counting value occurrence in R dataframe

I am working on a large dataset, i what to count how many time two columns have the same values. Here is an example of the dataset:
id = rep(replicate(4, paste(sample(LETTERS, 3, replace=F), collapse="")), 12500)
names = rep(replicate(3125, paste(sample(letters, 5, replace=T), collapse="")), 16)
times = sample(c(3,6,24), 50000, replace = T)
df = data.frame(id=id, names=names, times=times)
count <- list()
ids <- as.vector(unique(df$id))
nms <- as.vector(unique(df$names))
for(i in 1:length(ids)){
vec <- c()
for(j in 1:length(nms)){
vec[j] <- nrow(df[df$id == ids[i] & df$names == nms[j], ])
}
count[[i]] <- vec
}
My real data have about 50000 x 10 dimension and the id and name fields are randomly scattered. Can anyone suggest a better way to handle this? because my approach is working but too slow. dplyr or plyr methods?
Thanks,
EDIT:
short version of my dataframe:
id = rep(replicate(3, paste(sample(LETTERS, 3, replace=F), collapse="")), 5)
names = rep(replicate(3, paste(sample(letters, 5, replace=T), collapse="")), 5)
times = sample(c(3,6,24), 15, replace = T)
df = data.frame(id=id, names=names, times=times)
df
id names times
1 DEW xxsre 24
2 QHY xkbhr 24
3 DQE tuyfk 6
4 DEW xxsre 24
5 QHY xkbhr 24
6 DQE tuyfk 3
7 DEW xxsre 3
8 QHY xkbhr 24
9 DQE tuyfk 3
10 DEW xxsre 24
11 QHY xkbhr 24
12 DQE tuyfk 3
13 DEW xxsre 24
14 QHY xkbhr 3
15 DQE tuyfk 3
output:
> count
[[1]]
[1] 5 0 0
[[2]]
[1] 0 5 0
[[3]]
[1] 0 0 5
each list item is for id, and the list vec is for names count. in other words as.vector(unique(df$id)) and as.vector(unique(df$names)) respectively.
You can use data.table, which is likely the fastest solution:
library(data.table)
# convert your dataset into a data.table
setDT(df)
output <- df [ , .N, by = .(id, names)]
head(output)
> id names N
> 1: FYG vlrcd 4
> 2: FAL mjhhs 4
> 3: BZU rfnvc 4
> 4: HJA zhssf 4
> 5: FYG pxtne 4
> 6: FAL qgeqr 4
If you want the output to be a list, you can convert the output in different ways:
L1 <- as.list(as.data.frame(t(output))) # or
L2 <- split(output, list(output$id, output$names)) # or
L3 <- split(output, seq(nrow(output)))
Does this do what you want?
library(dplyr)
count <- df %>%
group_by(id, names) %>%
summarise(n=sum(times))
count
Without using plyr and dplyr you can reduce computing time by 25%.
To a reasonnable computing time, I subsetted the first 1000 rows of your data.
library(microbenchmark)
id = rep(replicate(4, paste(sample(LETTERS, 3, replace=F), collapse="")), 12500)
names = rep(replicate(3125, paste(sample(letters, 5, replace=T), collapse="")), 16)
times = sample(c(3,6,24), 50000, replace = T)
df = data.frame(id=id, names=names, times=times)
df = df[1:1000,]
ids <- as.vector(unique(df$id))
nms <- as.vector(unique(df$names))
Then I define 3 functions, default, summation, and sum+preallocation
default<-function(ids,nms,df){
count <- list()
for(i in 1:length(ids)){
vec <- c()
for(j in 1:length(nms)){
vec[j] <- nrow(df[df$id == ids[i] & df$names == nms[j], ])
}
count[[i]] <- vec
}
}
summation<-function(ids,nms,df){
count <- list()
for(i in 1:length(ids)){
vec <- c()
for(j in 1:length(nms)){
vec[j] <- sum(df$id == ids[i] & df$names == nms[j])
}
count[[i]] <- vec
}
}
summation_and_preallocation<-function(ids,nms,df){
count <- list()
for(i in 1:length(ids)){
vec <- integer(length = length(nms))
for(j in 1:length(nms)){
vec[j] <- sum(df$id == ids[i] & df$names == nms[j])
}
count[[i]] <- vec
}
}
Tests with microbenchmark show:
m<-microbenchmark(default(ids,nms,df),summation(ids,nms,df),summation_and_preallocation(ids,nms,df),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
default(ids, nms, df) 994.5040 1012.1560 1040.7012 1042.5689 1072.4689 1074.8893 10
summation(ids, nms, df) 735.0831 740.6620 741.2254 742.1361 742.9321 743.7806 10
summation_and_preallocation(ids, nms, df) 729.1192 733.0536 753.8661 736.8319 791.5001 804.2335 10
How does it compare with dplyr solution from #Adrian?
dplyr_count(ids, nms, df) 3.154741 3.206819 49.06034 3.275624 3.701375 457.943 10
So about 200 times faster for dplyr!

generate a new dataset based on individual information

I have a dataset like this:
df <- data.frame(ID=1:10, baseline = c(1.8,2.4,3.2,2.3,2.1,2.2,3,2.8,2,2.9))
I want to create a new column called "response", this column should be created based on the following equation:
individual response=individual baseline+0.5*sin(2*3.14*(t-7.5)/24)
in this equation, t is generated based on this vector
t=rep(seq(0,24,by=0.1))
so for each ID, there should be 241 responses generated. How could I generate the new dataset containing ID, baseline, time, and response?
Another approach:
t <- rep(seq(0, 24, by = 0.1), each = nrow(df))
vals <- 0.5 * sin(2 * 3.14 * (t - 7.5) / 24)
new_df <- cbind(df, t, response = df$baseline + vals)
Try
library(reshape2)
res <- melt(apply(df[,2, drop=FALSE], 1,
function(x) x+0.5*sin(2*3.14*(t-7.5)/24)))
indx <- rep(1:nrow(df), each=241)
df1 <- cbind(df[indx,], time= rep(t, nrow(df)), response=res[,3])
row.names(df1) <- NULL
dim(df1)
#[1] 2410 4
head(df1,3)
# ID baseline time response
#1 1 1.8 0.0 1.337870
#2 1 1.8 0.1 1.333034
#3 1 1.8 0.2 1.328518
Or
t <- seq(0,24, by=0.1)
indx <- rep(1:nrow(df), each=length(t))
df2 <- within(df[indx,], {response<-baseline+0.5*sin(2*3.14*(t-7.5)/24)
time <- t})
row.names(df2) <- NULL
all.equal(df1, df2)
#[1] TRUE

Resources