I am trying to multiply the values stored in a list containing 1,000 values with another list containing ages. Ultimately, I want to store 1,000 rows to a dataframe.
I wonder if it's better to use lapply fucntion or for loop function here.
list 1
lambdaSamples1 <- lapply(
floor(runif(numSamples, min = 1, max = nrow(mcmcMatrix))),
function(x) mcmcMatrix[x, lambdas[[1]]])
*the out put is 1,000 different values in a list. *
list 2
ager1= 14:29
What I want to do is
for (i in 1: numSamples) {
assign(paste0("newRow1_", i), 1-exp(-lambdaSample1[[i]]*ager1))
}
now I got 1,000 rows of values that I want to store in a predetermiend dataframe, outDf_1 (nrow=1000, ncol = ager1).
I tried
`
for (i in 1:numSamples) {
outDf_1[i,] <- newRow1_i
}
I want to store newRow1_1, ,,,,,, , newRow1_1000 to each of the 1,000 row of outDf_1 dataframe.
SHould I approach different way?
I think you're overcomplicating this a bit. Many operations in R are vectorized so you shoudln't need lapply or for loops for this. You didn't give us any data to work with but the code below should do what you want in a more straightforward and fast way.
lambdaSamples1 <- mcmcMatrix[sample(nrow(mcmcMatrix), numSamples, replace=T),
lambdas[[1]]]
outDF_1 <- 1 - exp(-lambdaSamples1 %*% t(ager1))
Just note that this makes outDF_1 a matrix, not a data frame.
To do this for multiple ages, you could use a loop to save your resulting matrices in a list:
outDF <- list()
x <- 5
for (i in seq_len(x)) {
lambdaSamples <- mcmcMatrix[sample(nrow(mcmcMatrix), numSamples, replace=T),
lambdas[[1]]]
outDF[[i]] <- 1 - exp(-lambdaSamples %*% t(ager[[i]]))
}
Here, ager1, ..., agerx are expected to be stored in a list (ager).
Related
I am creating 15 rows in a dataframe, like this. I cannot show my real code, but the create row function involves complex calculations that can be put in a function. Any ideas on how I can do this using lapply, apply, etc. to create all 15 in parallel and then concatenate all the rows into a dataframe? I think using lapply will work (i.e. put all rows in a list, then unlist and concatenate, but not exactly sure how to do it).
for( i in 1:15 ) {
row <- create_row()
# row is essentially a dataframe with 1 row
rbind(my_df,row)
}
Something like this should work for you,
create_row <- function(){
rnorm(10, 0,1)
}
my_list <- vector(100, mode = "list")
my_list_2 <- lapply(my_list, function(x) create_row())
data.frame(t(sapply(my_list_2,c)))
The create_row function is just make the example reproducible, then we predefine an empty list, then fill it with the result from the create_row() function, then convert the resulting list to a data frame.
Alternatively, predefine a matrix and use the apply functions, over the row margin, then use the t (transpose) function, to get the output correct,
df <- data.frame(matrix(ncol = 10, nrow = 100))
t(apply(df, 1, function(x) create_row(x)))
I have 100 matrices which each have 604800 columns, and 101 rows.
For each matrix, I need to reduce the number of columns to 60480 by computing the 10 column averages.
For example, for a vector
c(1,2,3,4,5,6,7,8,9,10,...)
The 5 column average would be:
c(3,8,13,18,...)
The code I am using to do this is:
col.av = tapply(col, rep(1:(length(col)/10), each = 10), mean)
Where col is one of my 101 x 604800 matrices. I have a for loop which iterates over the 100 matrices, however my problem is in the length of time needed to compute one run.
If I am just using one matrix, it takes 20 minutes+ to execute which is not feasible.
Are there any suggestions on how I can improve the speed of computation?
Thanks
If you are fine with for loop, this one works for your case:
col.av <- matrix(0, nrow(col), ncol(col)/10)
for (i in 1:ncol(col.av)) {
col.av[,i] <- rowMeans(col[,(10*(i-1)+1):(10*i)])
}
Or without a for-loop and a custom function for readability. You can always wrap this in your for-loop or a call to apply.
#generate data
nc=604800
nr=101
test_m <- matrix(rnorm(nc*nr),ncol=nc)
#function to get rowmeans by 'window'-columns
get_rowmeans <- function(mm, window=10){
indices <- seq(1,ncol(mm),by=window)
res <- sapply(indices, function(i){
return(rowMeans(mm[,i:(i+(window-1))]))
})
res
}
tt <- get_rowmeans(test_m)
#check one
> all(tt[,1]==rowMeans(test_m[,1:10]))
[1] TRUE
I am relatively new to R and have a complicated situation to solve. I have uploaded a list of over 1000 data frames into R and called this list x. What I want to do is take certain data frames and take the mean and variance of the entire data frames (excluding the first column of each) and save these into two separate vectors. For example I wish to take the mean and variance of every third data frame in the list starting from element (3) and going to element (54).
So what I ultimately want are two vectors:
meanvector=c(mean(data frame(3)), mean(data frame(6)),..., mean(data frame(54)))
variancevector=c(var(data frame (3)), var(data frame (6)), ..., var(data frame(54)))
This problem is way above my knowledge level but I am thinking I can do this effectively using some sort of loop but I do not know how to go about making such loop. Any help would be much appreciated! Thank you in advance.
You can use lapply and pass indices as follows:
ids <- seq(3, 54, by=3)
out <- do.call(rbind, lapply(ids, function(idx) {
t <- unlist(x[[idx]][, -1])
c(mean(t), var(t))
}))
If x is a list of 1000 dataframes, you can use lapply to return the means and variances of a subset of this list.
ix = seq(1, 1000, 3)
lapply(x[ix], function(df){
#exclude the first column
c(mean(df[,-1]), var(df[,-1]))
})
I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])