Hello everyone I have two data frame trying to do bootstrapping with below script1 in my script1 i am taking number of rows from data frame one and two. Instead of taking rows number from entire data frame I wanted split individual columns as a data frame and remove the zero values and than take the row number than do the bootstrapping using below script. So trying with script2 where I am creating individual data frame from for loop as I am new to R bit confused how efficiently do add the script1 function to it
please suggest me below I am providing script which is running script1 and the script2 I am trying to subset each columns creating a individual data frame
Script1
set.seed(2)
m1 <- matrix(sample(c(0, 1:10), 100, replace = TRUE), 10)
m2 <- matrix(sample(c(0, 1:5), 50, replace = TRUE), 5)
m1 <- as.data.frame(m1)
m2 <- as.data.frame(m2)
nboot <- 1e3
n_m1 <- nrow(m1); n_m2 <- nrow(m2)
temp<- c()
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
value <- colSums(m2)/colSums(m1[boot,])
temp <- rbind(temp, value)
}
boot_data<- apply(temp, 2, median)
script2
for (i in colnames(m1)){
m1_subset=(m1[m1[[i]] > 0, ])
m1_subset=m1_subset[i]
m2_subset=m2[m2[[i]] >0, ]
m2_subset=m2_subset[i]
num_m1 <- nrow(m1_subset); n_m2 <- nrow(m2_subset)# after this wanted add above script changing input
}
If I understand correctly, you want to do the sampling and calculation on each column individually, after removing the 0 values. I. modified your code to work on a single vector instead of a dataframe (i.e., using length() instead of nrow() and sum() instead of colSums(). I also suggest creating the empty matrix for your results ahead of time, and filling in -- it will be fasted.
temp <- matrix(nrow = nboot, ncol = ncol(m1))
for (i in seq_along(m1)){
m1_subset = m1[m1[,i] > 0, i]
m2_subset = m2[m2[,i] > 0, i]
n_m1 <- length(m1_subset); n_m2 <- length(m2_subset)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
temp[j, i] <- sum(m2_subset)/sum(m1_subset[boot])
}
}
boot_data <- apply(temp, 2, median)
boot_data <- setNames(data.frame(t(boot_data)), names(m1))
boot_data
Related
I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))
I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame
I am trying to generate a matrix with the following data, Is there any way to create empty rows to make matrix the same size?
#Generating original data
n <- c(12,24)
mu <- c(6.573,6.5)
sigma <- sqrt(0.25)
Diseased.Data <- round(rnorm(n[1],mu[1],sigma),4)
Healthy.Data <- round(rnorm(n[2],mu[2],sigma),4)
g <- c(2,3,4)
cstar.pool <- (mu[1]+mu[2])/2
#generating pooled data
for(i in 1:3){
assign(paste("pool.dis.data",i,sep = ""),replicate(n[1]/g[i],mean(sample(Diseased.Data,g[i]))))
assign(paste("pool.hel.data",i,sep = ""),replicate(n[2]/g[i],mean(sample(Healthy.Data,g[i]))))
}
#generating the pooled diseased data matrix
dis.mat1<- matrix(data = pool.dis.data1,length(pool.dis.data1),1)
dis.mat2 <- matrix(data = pool.dis.data2,length(pool.dis.data2),1)
dis.mat3 <- matrix(data = pool.dis.data3,length(pool.dis.data3),1)
dis.mat2 <- rbind(dis.mat2,NA)
dis.mat2 <- rbind(dis.mat2,NA)
dis.mat3 <- rbind(dis.mat3,NA)
dis.mat3 <- rbind(dis.mat3,NA)
dis.mat3 <- rbind(dis.mat3,NA)
dis.matrix <- matrix(NA, max(length(pool.dis.data1),length(pool.dis.data2),length(pool.dis.data3)),3)
dis.matrix[,1] <- cbind(dis.mat1)
dis.matrix[,2] <- cbind(dis.mat2)
dis.matrix[,3] <- cbind(dis.mat3)
I'd say your best bet is to start out with an empty matrix of the size you need. You can tell matrix to specify the dimensions on creation like so:
new <- matrix( data = NA, nrow = 10, ncol = 20 )
So you just need to create a value for each dimension, based on your input data:
num.rows <- max( length(n), length(mu), ... )
num.columns <- [ I'd just enter a numeric value here ]
new <- matrix( data = NA, nrow = num.rows, ncol = num.columns )
Then you can fill the columns as needed, making sure to leave any excess empty. For example:
new[(1:length(n)),3] <- n
The "1:length(n)" part there will tell R to stop filling the column once the values you've given it have been entered. Otherwise R will continue filling, and you'll get repeated values, which I'm guessing you don't want.
my problem is similar to the question as followingthe problem of R-input Format
I have tried the above code in the above link and revised some part to suit my data. my data is like follow
I want my data can be created as a data frame with 4 variable vectors. The code what I have revised is
formatMhsmm <- function(data){
nb.sequences = nrow(data)
nb.variables = ncol(data)
data_df <- data.frame(matrix(unlist(data), ncol = 4, byrow = TRUE))
# iterate over these in loops
rows <- 1: nb.sequences
# build vector with id value
id = numeric(length = nb.sequences)
for( i in rows)
{
id[i] = data_df[i,2]
}
# build vector with time value
time = numeric (length = nb.sequences)
for( i in rows)
{
time[i] = data_df[i,3]
}
# build vector with observation values
sequences = numeric(length = nb.sequences)
for(i in rows)
{
sequences[i] = data_df[i, 4]
}
data.df = data.frame(id,time,sequences)
# creation of hsmm data object need for training
N <- as.numeric(table(data.df$id))
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
return(train)
}
library(mhsmm)
dataset <- read.csv("location.csv", header = TRUE)
train <- formatMhsmm(dataset)
print(train)
The output observation is not the data of 4th col, it's a list of (4, 8, 12,...,396, 1, 1, ..., 56, 192,...,6550, 68, NA, NA,...) It has picked up 1/4 data of each col. Why it is like this?
Thank you very much!!!!
Why don't you simply count yout observations by Id, and create the hsmm.data object directly? Supposing yout dataframe is called "data", we have:
N <- as.numeric(table(data$id))
train <- list(x=data$location, N = N)
class(train) <- "hsmm.data"
Extracted from http://www.jstatsoft.org/v39/i04/paper
I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)