How to reduce memory usage in R while looping through dataframes - r

I have two data frames and i am doing a grouping operation based on weighted scores. I used PROFVIS to profile the code and figured that looping through the dataframes to check and add group labels is a costly operation. I understand we can use lapply, but not sure how to parse two dataframes and a new variable for this. Please help. I just need to reduce the time and space complexity of this code using apply functions.
rank1<-c()
occup_cats<-c()
for(i in 1:length(data_set$primary_occupation)){
for(j in 1:length(occup_cat_prop$Category)){
**if((as.character(data_set$primary_occupation[i])) == (as.character(occup_cat_prop$income_source[j])))**{
rank1[i]<-occup_cat_prop$prop[j]
occup_cats[i]<-as.character(occup_cat_prop$Category[j])
}
}
}

Related

function to remove all observations that contain a "prohibited" value - R

I have an large dataset looking like:
There are overall 43 different values for PID. I have identified PIDs that need to be removed and summarized them in a vector:
I want to remove all observations (rows) from my data set that contain one of the PIDs from the vecotor NullNK. I have tried writing a function for it, but i get an error ( i have never written functiones before):
for (i in length(NullNK)){
SR_DynUeber_einfam <- SR_DynUeber_einfam [-which(SR_DynUeber_einfam$PID == NullNK(i)),]
}
How can i efficently remove the observations from my original data set that are containing PIDs from NullNK vector?
What is wrong with my function?
Thanks!
For basic operations like this, for loops are often not needed. This does what you are looking for:
SR_DynUeber_einfam[!SR_DynUeber_einfam$PID %in% NullNK,]
One mistake in your function is NullNK(i). You should subset from a vector with NullNK[i] in R.
Hope this helps!

R - Perform a levene test with two samples with different sizes

So lets say that I have these two arrays with different sizes
And I'd like to perform a levene test with the function leveneTest()
However, the way I have been taught converts these to a dataframe and then calls the function melt to make a data structure readable by the function. This ends up repopulating the smallest array so that both arrays have the same size.
dataServers <- as.data.frame(cbind(down25,down27)) #dataServers is 124*2 now
dataServers <- melt(dataServers,variable.name="Server",value.name="DownTimes")
leveneTest(DownTimes~Server,dataServers,center=mean)
What would be the easiest way around this?
You can simply stack your two samples, of different sizes, into one data frame and perform the leveneTest:
stacked <- stack(list(down25=down25,down27=down27))
car::leveneTest(values~ind,data=stacked,center=mean)
Could you not just make them independent dataframes add an identifier column to each and then do an R bind and run the test? I can make up a full example if that helps that is reproducible.
make dataframe1 with id and value
dataframe2 with id and value
rbind the two together and skip the melt step

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

Converting a list of data frames into individual data frames in R [duplicate]

This question already has answers here:
Return elements of list as independent objects in global environment
(4 answers)
Closed 3 years ago.
I have been searching high and low for what I think is an easy solution.
I have a large data frame that I split by factors.
eqRegions <- split(eqDataAll, eqDataAll$SeismicRegion)
This now creates a list object of the data frames by region; there are 8 in total. I would like to loop through the list to make individual data frames using another name.
I can execute the following to convert the list items to individual data frames, but I am thinking that there is a loop mechanism that is fast if I have many factors.
testRegion1 <- eqRegions[[1]]
testRegion3 <- eqRegions[[3]]
I can manually perform the above and it handles it nicely, but if I have many regions it's not efficient. What I would like to do is the equivalent of the following:
for (i in 1:length(eqRegions)) {
region[i] <- as.data.frame(eqRegions[[i]])
}
I think the key is to define region before the loop, but it keep overwriting itself and not incrementing. Many thanks.
Try
list2env(eqRegions,envir=.GlobalEnv)
This should work. The name of the data.frames created will be equal to the names within eqDataAll$SeismicRegion. Anyways, this practice of populating individual data.frames is not recommended. The more I work with R, the more I love/use list.
lapply(names(eqRegions), function(x) assign(x, eqRegions[[x]], envir = .GlobalEnv))
edit: Use list2env solution posted. Was not aware of list2env function.
attach(eqRegions) should be enough. But I recommend working with them in list form using lapply. I guarantee it will result in simpler code.
list2env returns data frames to the global environment whose names are the names in the list. An alternative, if you want to have the same name for the data frames but identified by i from a loop:
for (i in 1:length(eqRegions)) {
assign(paste0("eqRegions", i), as.data.frame(eqRegions[[i]]))
}
This can be slow if the length if the list gets too long.
As an alternative, a "best practice" when splitting data like this is to keep the data.frames within a list, as provided by split. To process it, you use either one of sapply or lapply (many factors) and capture the output back in a list. For instance:
eqRegionsProcessed <- lapply(eqRegions, function(df) {
## do something meaningful here
})
This obviously only works if you are doing the same thing to each data.frame.
If you really must break them out and deal with each data.frame uniquely, then #MatthewPlourde's and #MaratTalipov's answers will work.

Saving many subsets as dataframes using "for"-loops

this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.
Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.

Resources