Saving many subsets as dataframes using "for"-loops - r

this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.

Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.

Related

How to reduce memory usage in R while looping through dataframes

I have two data frames and i am doing a grouping operation based on weighted scores. I used PROFVIS to profile the code and figured that looping through the dataframes to check and add group labels is a costly operation. I understand we can use lapply, but not sure how to parse two dataframes and a new variable for this. Please help. I just need to reduce the time and space complexity of this code using apply functions.
rank1<-c()
occup_cats<-c()
for(i in 1:length(data_set$primary_occupation)){
for(j in 1:length(occup_cat_prop$Category)){
**if((as.character(data_set$primary_occupation[i])) == (as.character(occup_cat_prop$income_source[j])))**{
rank1[i]<-occup_cat_prop$prop[j]
occup_cats[i]<-as.character(occup_cat_prop$Category[j])
}
}
}

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

How to make loops in R that operate on and return multiple objects

This is my first post, and I think I have looked thoroughly for my answer with no luck, but I might not be typing in the right search terms, since I am relatively new to R. I apologize if this has been answered before and if it has a link would be greatly appreciated.
In essence, I am trying to make a loop that will operate on a set of data frames that I have read into R from .txt files using read.table. I am working with simulated vegetation data organized into many species by site matrices, so it would be best for me if I could create loops that will just operate on the objects I have read in using some functions I have made and then put out new objects into my workspace with a specific naming pattern (e.g. put "_av" on the end of the name of the object operated on when creating a new object).
for convenience sake, lets say I have only four matrices I want to work with, all which contain the phrase "mod" for model. I have read that I can put these data frames into a list of data frames by the following code:
list.mods=lapply(ls(pattern="mod"),get)
This does create a list which I have been having trouble on getting my functions to actually operate on. From what I read this is the best way to make a list of objects you want to operate on.
So lets say that list.mods is now my list of operable matrices - mod1, mod2, mod3, and mod4. Also, lets say I have a function that simply calculates Bray-Curtis dissimilarity as follows:
bc=function(x){
vegdist(x,method="bray")
}
I can use this by typing in:
mod1.bc=bc(mod1)
That works. But it seems like I should be able to apply my list of models to the function bc and have it output the models with a pattern mod1.bc, mod2.bc, mod3.bc, and mod4.bc. I cannot get my list of files to work in the function much less save each operation as a new object with a patterned name.
What am I doing wrong? In the end I might have as many as a hundred models or more and would really appreciate being able to create a list of items that I can run through loops.
Thanks in advance.
You can use lapply again:
new.list.mods <- lapply(list.mods, bc)
This will return a new list in which each element is the result of applying bc to the corresponding element of list.mods.
The 'apply' family of functions in R basically allows you to save typing. If that's easier for you to understand, you can use a 'for loop' instead. Of course you will need to know how to access elements in a list for that. There is a question about that.
How about collecting the names of the models/objects you want into a list:
mod_list <- sapply(ls(pattern = "mod"), as.name)
and then looping over them with your function:
output_list <- lapply(eval(mod_list), bc)
With this approach you avoid creating the potentially large and redundant list.mods object in your example. Also, I think this will result in conveniently named lists.

Convert data frame to list

I am trying to go from a data frame to a list structure in R (and I know technically a data frame is a list). I have a data frame containing reference chemicals and their mechanisms different targets. For example, estrogen is an estrogen receptor agonist. What I would like is to transform the data frame to a list, because I am tired of typing out something like:
refchem$chemical_id[refchem$target=="AR" & refchem$mechanism=="Agonist"]
every time I need to access the list of specific reference chemicals. I would much rather access the chemicals by:
refchem$AR$Agonist
I am looking for a general answer, even though I have given a simplified example, because not all targets have all mechanisms.
This is really easy to accomplish with a loop:
example <- data.frame(target=rep(c("t1","t2","t3"),each=20),
mechan=rep(c("m1","m2"),each=10,3),
chems=paste0("chem",1:60))
oneoption <- list()
for(target in unique(example$target)){
oneoption[[target]] <- list()
for(mech in unique(example$mechan)){
oneoption[[target]][[mech]] <- as.character(example$chems[ example$target==target & example$mechan==mech ])
}
}
I am just wondering if there is a more clever way to do it. I tried playing around with lapply and did not make any progress.
Using split:
split(refchem, list(refchem$target, refchem$mechanism))
Should do the trick.
The new way to access would be refchem$AR.Agonist
If you make a keyed data.table instead, ...
you'll still have all the data in one data.frame (instead of a possibly-nested list of many);
you may find iterating over these subsets nicer; and
the syntax is pretty clean:
To access a subset:
DT[.('AR','Agonist')]
To do something for each group, that will be rbinded together in the result:
DT[,{do stuff},by=key(DT)]
Similar to aggregate(), any list of vectors of the correct length can go into the by, not just the key.
Finally, DT came from...
require(data.table)
DT <- data.table(refchem,key=c('target','mechanism'))
You can also use a plyr function:
library(plyr)
dlply(example, .(target, mechan))
It has the added advantage of using a function to process the data, if needed (there's an implicit identity in the above).

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

Resources