dynamically call data.frame and fill rows - r

I have the following problem:
I am creating a flexible number of data.frames (or matrices would do the job as well) which are filled until the last one, which is not full and should therefore be treated differently
I now want to call the data.frame platen and want to fill it afterwards with data. I would like to do something like
assign(paste0("plate", n)[x,y], %some data%)
where [x,y] is the position in the data.frame(/matrix) and n is the n'th data.frame
however it allways gives me
incorrect number of dimensions

You can't. See ?assign:
‘assign’ does not dispatch assignment methods, so it cannot be
used to set elements of vectors, names, attributes, etc.
(if you can't set elements of vectors you can't do it for dataframes either).
It's probably better to create a list of dataframes, rather than manually labelling plate1, plate2 etc. For example, supposing that plates was a list such that plates[[1]] was plate 1, plates[[2]] was plate 2, etc:
n <- length(plates)
plates[[n]] # is the plateN dataframe
plates[[n]][x, y] <- some_data
In terms of constructing the plates list, it will depend on your particular example (how you are doing it at the moment), but probably a for loop will be helpful here, e.g.:
plates <- list()
# supposing you know `n`, the number of plates
for (i in 1:n) {
plates[[i]] <- code_that_constructs_the_dataframe
}
If you're feeling adventurous, lapply is quite nice, e.g.
# suppose each plate was constructed by reading in a CSV
list_of_csvs <- list.files(pattern='*.csv')
plates <- lapply(list_of_csvs, read.csv)
# then plates[[1]] is the dataframe from the first csv, etc
but as you seem new to R, perhaps it is best to stick with the for loops for now.

Related

R add to a list in a loop, using conditions

I have a data.frame dim = (200,500)
I want to do a shaprio.test on each column of my dataframe and append to a list. This is what I'm trying:
colstoremove <- list();
for (i in range(dim(I.df.nocov)[2])) {
x <- shapiro.test(I.df.nocov[1:200,i])
colstoremove[[i]] <- x[2]
}
However this is failing. Some pointers? (background is mainly python, not much of an R user)
Consider lapply() as any data frame passed into it runs operations on columns and the returned list will be equal to number of columns:
colstoremove <- lapply(I.df.noconv, function(col) shapiro.test(col)[2])
Here is what happens in
for (i in range(dim(I.df.nocov)[2]))
For the sake of example, I assume that I.df.nocov contains 100 rows and 5 columns.
dim(I.df.nocov) is the vector of I.df.nocov dimensions, i.e. c(100, 5)
dim(I.df.nocov)[2] is the 2nd dimension of I.df.nocov, i.e. 5
range(x)is a 2-element vector which contains minimal and maximal values of x. For example, range(c(4,10,1)) is c(1,10). So range(dim(I.df.nocov)[2]) is c(5,5).
Therefore, the loop iterate twice: first time with i=5, and second time also with i=5. Not surprising that it fails!
The problem is that R's function range and Python's function with the same name do completely different things. The equivalent of Python's range is called seq. For example, seq(5)=c(1,2,3,4,5), while seq(3,5)=c(3,4,5), and seq(1,10,2)=c(1,3,5,7,9). You may also write 1:n, it is the same as seq(n), and m:n is same as seq(m,n) (but the priority of ':' is very high, so 1:2*x is interpreted as (1:2)*x.
Generally, if something does not work in R, you should print the subexpressions from the innerwise to the outerwise. If some subexpression is too big to be printed, use str(x) (str means "structure"). And never assume that functions in Python and R are same! If there is a function with same name, it usually does a different thing.
On a side note, instead of dim(I.df.nocov)[2] you could just write ncol(I.df.nocov) (there is also a function nrow).

R - create new dataframes by random selection of consisting dataframes

i'm a R-beginner and i have a little problem. I want to create new dataframes by a random selection of consisting dataframes.
I have 4 (categories), each divided into 10 dataframes and i want to create 10 new dataframes, containing 1 dataframe from each category.
For example, these are my dataframes:
Cat_1_Data_1 Cat_2_Data_1 Cat_3_Data_1 Cat_4_Data_1
Cat_1_Data_2 Cat_2_Data_2 Cat_3_Data_2 Cat_4_Data_2
Cat_1_Data_3 Cat_2_Data_3 Cat_3_Data_3 Cat_4_Data_3
Cat_1_Data_4 Cat_2_Data_4 Cat_3_Data_4 Cat_4_Data_4
Cat_1_Data_5 Cat_2_Data_5 Cat_3_Data_5 Cat_4_Data_5
Cat_1_Data_6 Cat_2_Data_6 Cat_3_Data_6 Cat_4_Data_6
Cat_1_Data_7 Cat_2_Data_7 Cat_3_Data_7 Cat_4_Data_7
Cat_1_Data_8 Cat_2_Data_8 Cat_3_Data_8 Cat_4_Data_8
Cat_1_Data_9 Cat_2_Data_9 Cat_3_Data_9 Cat_4_Data_9
Cat_1_Data_10 Cat_2_Data_10 Cat_3_Data_10 Cat_4_Data_10
Creating new dataframes (that's how i do it):
new_data_1 <- rbind(cat_1_data_1,cat_2_data_1,cat_3_data_1,cat_4_data_1)
...
new_data_10 <- rbind(cat_1_data_10,cat_2_data_10,cat_3_data_10,cat_4_data_10)
But i want a random pick of the datasets, like:
new_data_1 <- rbind(cat_1_data_[Random 1-10],cat_2_data_[Random 1-10]... and so on)
...
new_data_10 <- rbind(cat_1_data_[Random 1-10],cat_2_data_[Random 1-10]...and so on)
Is there any possibility to solve this problem? Actually i don't know how to approach this problem :(
Here is one sampling strategy that would work.
Create lists of your data.frames, one per category shuffling them as you go:
dflist.cat1 <- sample(list(Cat_1_Data_1, Cat_1_Data_2, ...))
dflist.cat2 <- sample(list(Cat_2_Data_1, Cat_2_Data_2, ...))
...
Run lapply to rbind the corresponding element of each list. This will result in a list of length 10:
dflist.new <- lapply(1:10, function(i){
rbind(dflist.cat1[[i]],
dflist.cat2[[i]],
dflist.cat3[[i]],
dflist.cat4[[i]])
})
You can access your data.frames using dflist.new[[1]] for the first one, and so on.
I am sure there is a more elegant way to do this with 2-dimensional list indices, but this works well for a small number of categories.

Creating a new nested list element that is a combination of two existing nested list elements (in R)

I am looking for a hint about how to create a new nested list element from two existing nested list elements. In the current form of the script I am working on, I create a list called tardis that is n elements long, based on the number of elements in an input list. In the example blow, that input list, dataLayers, is 2 elements long.
After creating tardis, the script populates it by reading in data from 1200 netCDF files. Each of the 12 elements in 'mean' and 'sd' in tardis are matrices of geographic data, tardis[['data']][[decade]][['mean']][[month]], for example, for the 12 calendar months. When the list is fully populated I would like to create some derived variables. For example, in the snippet below, I would like to create a variable TOTALPRECIP by adding SNOW and RAIN. In doing this, I would like to create TOTALPRECIP from SNOW + RAIN as a third list element in tardis with the exact nested structure as the other two elements (adding them together and preserving the structure).
Is this possible with apply or its related functions?
begin <- 1901
end <- 1991
dataLayers <- c("SNOW","RAIN")
tardis<-list()
for (i in 1:length(dataLayers)){
tardis[[dataLayers[i]]]<-list('longName'='timeLord','units'='theDr','data'=list())
for (j in seq(begin,end,10)){
tardis[[dataLayers[[i]]]][['data']][[as.character(j)]]<-list('mean'=vector(mode='list',length=12),'sd'=vector(mode='list',length=12))
}
}
#add SNOW AND RAIN
print(names(tardis))
>[1] "SNOW" "RAIN" "TOTALPRECIP"
Here are your for loops using replicate (Note that the expression value for each replicate is the same expression you have in the assignment portion of your for loop)
## This is your inner for-loop, using replicate
inds <- seq(begin, end, 10)
datas <- replicate(length(inds), list('mean'=vector(mode='list',length=12),'sd'=vector(mode='list',length=12))
, simplify=FALSE)
names(datas) <- inds
# This is your outer loop
tardis2 <- replicate(length(dataLayers), list('longName'='timeLord','units'='theDr','data'=datas)
, simplify=FALSE)
names(tardis2) <- dataLayers
# Compare Results
identical(tardis2, tardis)
# [1] TRUE
However, I'm not sure if lists are relaly the best structure for this. Have you considered data.frames?

is.na() in R for loop not quite understood

I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.

Efficient function to return varying length vector from lookup table

I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)

Resources