In R, Create Summary Data Frame from Multiple Objects - r

I'm trying to create a "summary" data frame that holds some high-level stats about a few objects in my R project. I'm having trouble even accomplishing this simple task and I've tried using For loops and Apply functions with no luck.
After searching (a lot) on SO I'm seeing that For loops might not be the best performing option, so I'm open to any solution that gets the job done.
I have three objects: text1 text2 and text3 of class "Large Character (vectors)" (imagine I might be exploring these objects and will create a NLP predictive model from them). Each are > 250 MB in size (upwards of 1 million "rows" each) once loaded into R.
My goal: Store the results of object.size() length() and max(nchar()) in a table for my 3 objects.
Method 1: Use an Apply() Function
Issue: I haven't successfully applied multiple functions to a single object. I understand how to do simple applies like lapply(x, mean) but I'm falling short here.
Method 2: Bind Rows Using a For loop
I'm liking this solution because I almost know how to implement it. A lot of SO users say this is a bad approach, but I'm lacking other ideas.
sources <- c("text1", "text2", "text3")
text.summary <- data.frame()
for (i in sources){
text.summary[i ,] <- rbind(i, object.size(get(i)), length(get(i)),
max(nchar(get(i))))
}
Issue: This returns the error data length exceeds size of matrix - I know I could define the structure of my data frame (on line 2), but I've seen too much feedback on other questions that advise against doing this.
Thanks for helping me understand the proper way to accomplish this. I know I'm going to have trouble doing NLP if I can't even figure out this simple problem, but R is my first foray into programming. Oof!

Just try for example:
do.call(rbind, lapply(list(text1,text2,text3),
function(x) c(objectSize=c(object.size(x)),length=length(x),max=max(nchar(x)))))
You'll obtain a matrix. You can coerce to data.frame later if you need.

Related

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Rstudio - how to write smaller code

I'm brand new to programming and an picking up Rstudio as a stats tool.
I have a dataset which includes multiple questionnaires divided by weeks, and I'm trying to organize the data into meaningful chunks.
Right now this is what my code looks like:
w1a=table(qwest1,talm1)
w2a=table(qwest2,talm2)
w3a=table(quest3,talm3)
Where quest and talm are the names of the variable and the number denotes the week.
Is there a way to compress all those lines into one line of code so that I could make w1a,w2a,w3a... each their own object with the corresponding questionnaire added in?
Thank you for your help, I'm very new to coding and I don't know the etiquette or all the vocabulary.
This might do what you wanted (but not what you asked for):
tbl_list <- mapply(table, list(qwest1, qwest2, quest3),
list(talm1, talm2, talm3) )
names(tbl_list) <- c('w1a', 'w2a','w3a')
You are committing a fairly typical new-R-user error in creating multiple similarly named and structured objects but not putting them in a list. This is my effort at pushing you in that direction. Could also have been done via:
qwest_lst <- list(qwest1, qwest2, quest3)
talm_lst <- list(talm1, talm2, talm3)
tbl_lst <- mapply(table, qwest_lst, talm_lst)
names(tbl_list) <- paste0('w', 1:3, 'a')
There are other ways to programmatically access objects with character vectors using get or wget.

Explaining Simple Loop in R

I successfully wrote a for loop in R. That is okay and I am very happy that it works. But I also want to understand what I've done exactly because I will have to work with loops later on in my analysis as well.
I work with Raster Data (DEMs). I load them into the environment as rasters and then I use the getValues function in the loop as I want to do some calculations. Looks as follows:
list <- dir(pattern=".tif", full.names=T)
tif.files <- list()
tif.files.values <- tif.files
for (i in 1: length(list)){
tif.files[[i]] <- raster (list[[i]])
tif.files.values[[i]] <- getValues(tif.files[[i]])
}
Okay, so far so good. I don't get why I have to specify tif.files and tif.files.values before I use them in the loop and I don't know why to specify them exactly how I did that. For the first part, the raster operation, I had a pattern. Maybe someone can explain the context. I really want to understand R.
When you do:
tif.files[[i]] <- raster (list[[i]])
then tif.files[[i]] is the result of running raster(list[[i]]), so that is storing the raster object. This object contains the metadata (extent, number of rows, cols etc) and the data, although if the tiff is huge it doesn't actually read it in at the time.
tif.files.values[[i]] <- getValues(tif.files[[i]])
that line calls getValues on the raster object, which reads the values from the raster and returns a vector. The values of the grid cells are now in tif.files.values[[i]].
Experiment by printing tif.files[[1]] and tif.files.values[[1]] at the R prompt.
Note
This is R, not RStudio, which is the interface you are using that has all the buttons and menus. The R language exists quite happily without it, and your question is just a language question. I've edited and tagged it now for you.

How to make loops in R that operate on and return multiple objects

This is my first post, and I think I have looked thoroughly for my answer with no luck, but I might not be typing in the right search terms, since I am relatively new to R. I apologize if this has been answered before and if it has a link would be greatly appreciated.
In essence, I am trying to make a loop that will operate on a set of data frames that I have read into R from .txt files using read.table. I am working with simulated vegetation data organized into many species by site matrices, so it would be best for me if I could create loops that will just operate on the objects I have read in using some functions I have made and then put out new objects into my workspace with a specific naming pattern (e.g. put "_av" on the end of the name of the object operated on when creating a new object).
for convenience sake, lets say I have only four matrices I want to work with, all which contain the phrase "mod" for model. I have read that I can put these data frames into a list of data frames by the following code:
list.mods=lapply(ls(pattern="mod"),get)
This does create a list which I have been having trouble on getting my functions to actually operate on. From what I read this is the best way to make a list of objects you want to operate on.
So lets say that list.mods is now my list of operable matrices - mod1, mod2, mod3, and mod4. Also, lets say I have a function that simply calculates Bray-Curtis dissimilarity as follows:
bc=function(x){
vegdist(x,method="bray")
}
I can use this by typing in:
mod1.bc=bc(mod1)
That works. But it seems like I should be able to apply my list of models to the function bc and have it output the models with a pattern mod1.bc, mod2.bc, mod3.bc, and mod4.bc. I cannot get my list of files to work in the function much less save each operation as a new object with a patterned name.
What am I doing wrong? In the end I might have as many as a hundred models or more and would really appreciate being able to create a list of items that I can run through loops.
Thanks in advance.
You can use lapply again:
new.list.mods <- lapply(list.mods, bc)
This will return a new list in which each element is the result of applying bc to the corresponding element of list.mods.
The 'apply' family of functions in R basically allows you to save typing. If that's easier for you to understand, you can use a 'for loop' instead. Of course you will need to know how to access elements in a list for that. There is a question about that.
How about collecting the names of the models/objects you want into a list:
mod_list <- sapply(ls(pattern = "mod"), as.name)
and then looping over them with your function:
output_list <- lapply(eval(mod_list), bc)
With this approach you avoid creating the potentially large and redundant list.mods object in your example. Also, I think this will result in conveniently named lists.

call columns from inside a for loop in R

I basically want to be capable to call columns from inside a for loop (in reality two nested for loops), using past() and i (j..) value of the loop to access
my data frames columns wise in a flexible manner.
#for the showcase I use the standard cars example
r1 <- cars
r2 <- cars
# in case there are more data to consider I would want to add, ore remove further with out changing the rest
# here I am entering the "dimension" of what I want to compare for the showcase its only one
num_r <- 2 #total number of reactors in the experiment
for( i in 1:num_r)
{
# shoud create proxie variable to be processed further
assign(paste("proxi_r",i,sep="", colapse="") , do.call("matrix",
list(get(paste("r",i,"$speed",sep="", colapse="" )))))
# further operations of gluing and arranging data follow so they fit tests formatting requirements
}
which gives me:
Error in get(paste("r", i, "$speed", sep = "", colapse = "")) :
object 'r1$speed' not found
but when typ r1$speed it obviously exists??
Sofare I searched "R object dont exist inside loop", "using paste() to acces variables inside loop", "foor loops and objects","do.call inside loops" ....and similar...
Is there anything to circumvent get() so I don’t have to look into the topic of environments, so I can keep the flexibility of my loops so I don’t have re-edit my script every time I have a changed the experimental configuration, which is really time consuming and allows a lot of errors to sneak inside.
The size of the data have crashed excel with extensive use of excel macros, which everyone in the lab here is using, several times :) , so there is no going back to the convort zone.
I am now trying to dig into R programming with a R statics book, and a lot of googling and reading tutorials, so please forgive my naive approach, and my lousy English.
I would be very thankful for any tips, as I feel sort of stuck right now.
This is a common confusion. You've created an object name "r1$speed" , i.e. a complete character string. This is not the same as the object r1 subsetted by $speed .
Try using get(paste('r',i,collapse='',sep=''))$speed

Resources