How to build a function with summation - r

I need to find a way to create a function that can work like the function counts described in the picture.
This is what I tried so far working with a code from an answer I found here, but I can't work out how the elements that I see in that function can be translated to my case.
Function I need to replicate
How my database looks
Also, the function can actually be simplified with respect to the one above because the database already shows the total number of cites per year... Am I wrong? This is what I have so far:
j <- patents_grant$Company
t<- patents_grant$Year
x <- patents_grant$count
fun_counts <- function(j,t) {
for (i in j)
sum(x[1:M, j], na.rm = T)
}
counts_try <- sapply(1:j ,fun_counts, M=3)
I'm pretty sure this one must be easy to build and I just don't have the knowledge. So even if you just have suggestions on good places to look at to learn how to build functions, that would be immensely appreciated.

What you actually want to do is calculate the 5 (or 3)-year moving total of the number of patents lagged by one year and transformed with log(x + 1).
I have created example data in patent_grants, and this is what one can do:
patent_grants <- expand.grid(Company = LETTERS[1:6],
Year = 1990:2010)
patent_grants$count <- rpois(nrow(patent_grants), 4)
M <- 5
we sort the data by Company and Year, which makes it a lot easier and create the new column for the transformed moving total:
patent_grants <- patent_grants[with(patent_grants, order(Company, Year)),]
patent_grants$count_avg <- NA
one now splits the data by Company, as the value is for each Company (this is done once for the vector of our moving total, and once for the count of the patents). For each company_data (ie. the number of patents sorted by value), we now calculate the moving total [the branching over the Companies is done with lapply].
For each year, we select the relevant data of the following M years with company_data[(t + 1):(t + M)], remove possible NA-values with na.omit, take the sum, and calculate log(x + 1) which is equivalent to log1p(x) [the "for-each-year" part is done using sapply].
split(patent_grants$count_avg, patent_grants$Company) <-
lapply(split(patent_grants$count, patent_grants$Company), function(company_data) {
sapply(1:length(company_data), function(t) {
log1p(sum(na.omit(company_data[(t + 1):(t + M)])))
})
})
Created on 2022-06-14 by the reprex package (v2.0.1)

Related

How to: make my code more efficient in R by not creating new dataframes repeatedly?

Problem:
Firstly, I am just starting out. While I was proud of my code, I have realised how inefficient and non-replicable it is coming back to it and using it on a different variable. Particularly, #3) has a manual component when excluding columns (downpour, precipitation, rainwater) which is not very replicable. Could anyone advise? (it looked worse before if you can believe)
Code:
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- raincount %>% filter(total_ncount >= 1000)
# 2) filter for dictionaries which contain 3 or more tokens from our set of rain-related tokens
f2_raincount <- f1_raincount
#compute rain-set count
f2_raincount$set_count <- f2_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0, values_drop_na=TRUE))
f2_raincount <- f2_raincount %>% filter(set_count >= 3)
# 3) Select for rain-related noun tokens with frequencies greater than 10 across dictionaries
#First, compute dictionary counts
f3_raincount <- f2_raincount
f3_dict_long <- f3_raincount %>% select(cloud:thunderstorm) %>% apply(2, function(x) sum(x !=0))
#Second, exclude those under 10: downpour, precipitation, rainwater
f3_raincount <- f3_raincount %>% select(-c(downpour, precipitation, rainwater) )
# 4) given exclusion #3, compute rain set count and filter again
f4_raincount <- f3_raincount
f4_raincount$set_count2 <- f4_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0))
f4_raincount <- f4_raincount %>% filter(set_count2 >= 3) %>%
select(id:dictsize) #select final rain-set
What I normally do is have all ETL code inside a ETL function even if i only plan to run it once on the entire script.
why?
Find it easy to debug if errors arise with debug
While on the topic of debuging, it's easier to debug also because the enviroment will only contain the used variables and not everything else
Auxiliar variables are automatically deleted once the function call is over
Easier to document that chunk of code with a title
More reproducible
because of this my scripts tend to be 20% setting parameters and libraries
60% functions and 20% code that runs those functions
your final code should then look like this:
f4_raincount <- funcName(raincount)
naturally having all the other messy code inside funcName
As for the actual code I'd need an actual example (data table and libraries) since it looks to me that you are just adding count columns that could be done with mutate function from dplyr. If that is indeed the case then you have a lot of optimization in front of you :P. but not knowing what cloud:thunderstorm is it's hard to give you more feedback.
EDIT:
ETL (Extract Transform Load) might not have been a good idea for me to mention since we are only Transforming data and neither Extracting or Loading.
Either way, I belive it's best if I demonstrate on a chunk of code.
Imagine we have a dataframe df.MyData and want to calculate the ratio between 2 variables times a certain ratio (just becaue)
Here's how one might approach this simple problem:
library(dplyr)
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
Here's how I would do it:
# LIBRARY ####
library(dplyr)
# PARAMETERS ####
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
# FUNCTIONS ####
fn.CalculateRatio <- function(pf.df.MyData = df.MyData, pf.n.Ratio = 2)
{
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
return(n.Result)
}
# PROCESS ####
fn.CalculateRatio()
My approach clearly has more code, so it very well might be disregarded by many, but I prefer it nonetheless as I tend to find it more organized on bigger pieces of code.
Your example would look like this:
fn.MyFunc <- function(pf.raincount = raincount){
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- pf.raincount %>% filter(total_ncount >= 1000)
.......[your code (excluding first 2 rows) goes here]
return(f4_raincount)
}
fn.MyFunc()
You could naturally go the extra mile, and replace the mention of (what looks like arbitrary numbers) 1000 and 3 by other variables and place them on the function itself. This way, should you want to change them you simply need to explicitly mention the value you want to use when running the function
fn.MyFunc(pf.raincount = NEWraincount)
or something else if you define other variables
I'm using prefixes on all my variables to identify what they are fn for functions, df for data frames, pf for function parameters, n for length 1 number vectors... the list is quite extensive, and I go as far as having a rulebook with all the rules that I use to stay consistent across projects but that's also a story for another day
Finally, I find # XXXXXX #### very useful as I can hide chunks of code when I'm not working on them
Again this is how I find organization on hundreds or thousands of lines of code. Each one of us has to figure his own style, the only thing I believe we all agree with, is that consistency is key.
I'm straining a bit of topic though, the key idea of using this wrapper function is that the other tables you define stay inside the function environment and get deleted afterwards. It's better to actually edit the code to not create them in the first place, but at the least you can use this method as a bandage since it takes close to no time or skill to clean those variables (I didn't have to understand your code to make this post)

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

Converting slow "WHILE" loop to "apply"-type function

I have created a while loop that is being executed across a sizable data set. The loop is as such:
i = 1
while(i<=m){
Date = London.Events$start_time[i]
j=1
while(j<=n){
Hotel = London.Hotels$AS400.ID[j]
Day.Zero[i,j] = sum(London.Bookings$No.of.Rooms[London.Bookings$Stay.Date == Date & London.Bookings$Legacy.Hotel.Code == Hotel])
j=j+1
}
i=i+1
}
Where:
m = 9957 #Number of Events
n = 814 #Number of Hotels
Day.Zero = as.data.frame(matrix(0, 9957, 814))
Briefly explained, for each combination of date and hotel (pulled from two other data frames), produce the sum from the column London.Bookings$No.of.Rooms and deposit that into the corresponding row of the matrix.
The loop appears to run without error, however when stopping it after 5 mins+ it is still running and nowhere near complete!
I would like to know how one of the apply family of functions could be used as a replacement here for much faster completion.
Thanks!
Probably,
xtabs(No.of.Rooms ~ Stay.Date + Legacy.Hotel, data = London.Bookings)
gets you something similar to what you want.
Using library dplyr, you can do something like the following (assuming your input data frame has such column names - vaguely interpreted from your code / question):
library(dplyr)
London.Bookings %>% group_by(Legacy.Hotel.Code, Stay.Date) %>% summarise(Total.No.of.Rooms = sum(No.of.Rooms))

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources