I am attempting to write a function which accepts a dataframe, and then generates subset dataframes within a for() loop. As a first step, I tried the following:
dfcreator<-function(X,Z){
for(i in 1:Z){
df<-subset(X,Stratum==Z) #build dataframe from observations where index=value
assign(paste0("pop", Z),df) #name dataframe
}
}
This however does not save anything in to memory, and when I try to specify a return() I am still not getting what I need. For reference, I am using the
Sweden data set (which is native to RStudio).
EDIT Per Melissa's Advice!
I tried to implement the following code:
sampler <- function(df, n,...) {
return(df[sample(nrow(df),n),])
}
sample_list<-map2(data_list, stratumSizeVec, sampler)
where stratumSizeVec is a 1X7 df and data_list is a list of seven dfs. When I do this, I get seven samples in sample list all of the same size equal to stratumSizeVec[1]. Why is map2 not inputting the in the following manner
sampler(data_list$pop0,stratumSizeVec[1])
sampler(data_list$pop1,stratumSizeVec[2])
...
sampler(data_list$pop6,stratumSizeVec[7])
Furthermore, is there a way to "nest" the map2 function within lapply?
I'm confused as to why you never actually utilize i anywhere in your loop. It looks like you're creating Z copies of the data set where Stratum == Z - is that what you are after?
as for your code, I would use the following:
data_list <- split(df, df$Stratum)
names(data_list) <- paste0("pop", sort(unique(df$Stratum)))
This doesn't define a function, we are calling base-R function (namely split) which splits up a data frame based on some vector (here, we use df$Stratum). The result is a list of data frames, each with a single value of Stratum.
Random sampling from rows
sampled_data <- lapply(data_list, function(df, n,...) { # n is the number of rows to take, the dots let you send other information to the `sample` function.
df[sample(nrow(df), n, ...),]
},
n = 5,
replace = FALSE # this is default, but the purpose of using the ... notation is to allow this (and any other options in the `sample` function) to be changed.
)
You can also define the function separately:
sampler <- function(df, n,...) {
df[sample(nrow(df), n, ...),]
}
sampled_data <- lapply(data_list, sampler, n = 10) # replace 10 with however many samples you want.
purrr:map2 method
As defined, the sampler function does not need to be modified, each element of the first list (data_list) is put into the first argument of sampler, and the corresponding element of the 2nd "list" (sampleSizeVec) is put into the 2nd argument.
library(purrr)
map2(data_list, sampleSizeVec, sampler, replace = FALSE) # replace = FALSE not needed, there as an example only.
Related
I'd like to use a loop function to recognise names from a list/dataframe as an actual list/dataframe name in the R script (for data analysis or manipulation).
I will create some pseudo data to try to help show what i'm trying to do.
Here is code to create 3 lists
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
This code creates a list containing those list names
vars <- c("height","weight","income")
The code below doesn't run, but I would like to use a loop code like this, where it takes the name from the list position and uses it in script as a list name. Thus it's using the name to calculate the mean, and it's using the name to create a new object.
for (i in 1:3)
{mean_**vars[i]** = mean(**vars[i]**) }
The result should be 3 objects "mean_height", "mean_weight", "mean_income" which contain the mean scores
I'm not so much interested in the calculating of mean scores, I'm interested in the ability to use the names from the list. I want to be able to expand this to other analyses that are repetitive.
Apologies if above hasn't been articulated too well, I'm quite new to R, so I hope it makes some sense.
Any help will be most useful, or if you can point me in the right direction that would be great.
This may be what you're looking for, where lapply applies the mean function to each of the items in vars (a list of dataframes). Note that you want to make the list of dataframes using the variable names.
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
vars <- list(height, weight, income)
lapply(vars, function(x) mean(x))
Then create an output dataframe using that:
df1 <- data.frame(lapply(vars, function(x) mean(x)))
colnames(df1) <- c("mean_height", "mean_weight", "mean_income")
df1
From your additional comment, using vars <- list(height, weight, income) should allow you do this:
mean(height)
mean(vars[[1]])
[1] 160.48
[1] 160.48
This should work to output dynamically named variables:
vars <- list(height = height, weight = weight, income = income)
for (i in names(vars)){
assign(paste("mean_", i, sep = ""), mean(vars[[i]]))
}
mean_height
mean_weight
mean_income
[1] 163.28
[1] 90.465
[1] 109686.5
However, I'd suggest not programming that way since it can cause issues and it's not very scalable. E.g., you could end up with 10000 variables.
I guess what you want is something like below, which produces three objects into your global environment for the means of weight, height, and income from list list, i.e.,
list2env(setNames(Map(mean,lst),paste0("mean_",names(lst))),envir = .GlobalEnv)
DATA
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
lst <- list(height,weight,income)
A more common approach in R is to use lists of data, rather than separate variables.
Like this:
# make this reproducible
set.seed(123)
# make an empty list for the data
raw_data <- list()
# then fill the list. The data can be of varying length in a list.
raw_data$height <- sample(120:200,200,TRUE)
raw_data$weight <- sample(40:140,200,TRUE)
raw_data$income <- sample(20000:200000,200, TRUE)
Then looping becomes a one-liner and your names are preserved, using the *apply family of functions:
mean_data <- lapply(raw_data, mean)
# print that
mean_data
$height
[1] 159.06
$weight
[1] 90.83
$income
[1] 114000.7
Note what we didn't have to do:
know the number of variables.
have variables all the same length.
build a loop and keep track of names.
All handled automagically. Nice.
I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.
I am looking for a best practice to store multiple vector results of an evaluation performed at several different values. Currently, my working code does this:
q <- 55
value <- c(0.95, 0.99, 0.995)
a <- rep(0,q) # Just initialize the vector
b <- rep(0,q) # Just initialize the vector
for(j in 1:length(value)){
for(i in 1:q){
a[i]<-rnorm(1, i, value[j]) # just as an example function
b[i]<-rnorm(1, i, value[j]) # just as an example function
}
df[j] <- data.frame(a,b)
}
I am trying to find the best way to store individual a and b for each value level
To be able to iterate through the variable "value" later for graphing
To have the value of the variable "value" and/or a description of it available
I'm not exactly sure what you're trying to do, so let me know if this is what you're looking for.
q = 55
value <- c(sd95=0.95, sd99=0.99, sd995=0.995)
a = sapply(value, function(v) {
rnorm(q, 1:q, v)
})
In the code above, we avoid the inner loop by vectorizing. For example, rnorm(55, 1:55, 0.95) will give you 55 random normal deviates, the first drawn from a distribution with mean=1, the second from a distribution with mean=2, etc. Also, you don't need to initialize a.
sapply takes the place of the outer loop. It applies a function to each value in value and returns the three vectors of random draws as the data frame a. I've added names to the values in value and sapply uses those as the column names in the resulting data frame a. (It would be more standard to make value a list, rather than a vector with named elements. You can do that with value <- list(sd95=0.95, sd99=0.99, sd995=0.995) and the code will otherwise run the same.)
You can create multiple data frames and store them in a list as follows:
q <- list(a=10, b=20)
value <- list(sd95=0.95, sd99=0.99, sd995=0.995)
df.list = sapply(q, function(i) {
sapply(value, function(v) {
rnorm(i, 1:i, v)
})
})
This time we have two different values for q and we wrap the sapply code from above inside another call to sapply. The inner sapply does the same thing as before, but now it gets the value of q from the outer sapply (using the dummy variable i). We're creating two data frames, one called a and the other called b. a has 10 rows and b has 20 (due to the values we set in q). Both data frames are stored in a list called df.list.
This post contains two questions. The first is some related with the second.
First, suppose that I want define one function that receives two arguments: one data frame and one variable(column) and I would like to do some counts or statistics. In first time, I have to determine the variable position. For example, suppose that my two first rows of the df are
> df
person age rent
1 23 1000
2 35 1.500
and my function is like this
> myfun<- function(df, var)
{
# determining the variable
ind<- which(names(df) %in% var )
# selecting the variable
v <- df[,ind]
# rest of function
....
}
I think that it may be more easy... Is there some way to determine v directly?
Second Question: I have a large list of data frames(samples of one population). All data frame have the same variables and one of these variable is the rent. I would like to calculate the mean of the rent variable for each sample and I would like to use the lapply function. For one sample, I can do the following code
> mean(sample$rent , na.rm = T)
All that I want is do something like this
> apply(list, mean( , variablefix = rent))
One option is create a new mean function with the rent argument being fix or with only one argument and apply the lappy function:
>mean_rent <- function(df){...}
>lapply(df, mean_rent)
But, I want a way to use the apply function directly in only one line
Some ideas?
Question One: you can also use the names (i.e a character string) or a variable containing the name to index data.frames (and vectors,matrices etc.), so you just have to do:
myfun<- function(df, var) {
# select the column
v <- df[,var]
# rest of function
}
but it is more common to define the function on a vector and then just call it with myfun(df[,var])
Question Two: Instead of assigning the new function to a variable, you can also just pass it on directly, i.e.
lapply(list_of_dfs, function(df){ mean( df$rent ) })
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])