Am I using the most efficient (or right) R instructions? - r

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.

In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Not aggregating correctly

My goal of this code is to create a loop that aggregates each company's word frequency by a certain principle vector I created and adds it to a list. The problem is, after I run this, it only prints the 7 principles that I have rather than the word frequencies along side them. The word frequencies being the certain column of the FREQBYPRINC.AG data frame. Individually, running this code without the loop and just testing out a certain column, it works no problem. For some reason, the loop doesn't want to give me the correct data frames for the list. Any suggestions?
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
attach(FREQBYPRINC.AG)
list.agg[i]<-aggregate(FREQBYPRINC.AG[,i+1],by=list(Type=principle),FUN=sum,na.rm=TRUE)
}
I really wish I could help. After reading your statement, It seems that to you , you feel that the code should be working and it is not. Well maybe there exists a glitch.
Since you had previously specified list. agg as a list, you need to subset it with double square brackets. Try this one out:
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
list.agg[[i]]<-aggregate(FREQBYPRINC.AG[,i+1],by=list
(Type=principle),FUN=sum,na.rm=TRUE)}

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

Best way of storing data in 100 objects for later retrieval?

When doing sequencing, I normally apply TraMineR's seqdef function on a dataset to generate a single sequence object:
sequence_object <- seqdef(data)
However, let's say I want to loop through a dataframe and generate 1 sequence object per every chunk of 10 columns. Then I would do something like this:
colpicks <- seq(10,1000,by=10)
mapply(function(start,stop) seqdef(df[,start:stop]), colpicks-9, colpicks)
Now, I want to store these objects in some suitable manner. Two questions:
What is the most suitable way of storing (or maybe just automatically naming) 100 objects, so that I can easily loop through each of them at a later point?
How can I modify my code above so that it stores the data per your answer to (1)?
"Most suitable" is completely subjective and dependent on your goal.
I'm assuming this question is related to your previous question, and thus I would suggest setting the simplify argument of mapply to FALSE
myMatrixList <- mapply(.... , simplify=FALSE)
However, even that is not necessary, as you can just combine the sapply from the previous question and skip the middle step

Merging dataframes in R on a pre-sorted column?

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

Resources