Use of x`apply` to speed up loops

Use of x`apply` to speed up loops - r

This is a more focussed question based on another question I have open at Vectorize/Speed up Code with Nested For Loops
Basically, I want to speed up the execution of this code. I was thinking of using one of the apply family of functions. The apply function would have to use/perform the following:
Input: loop over regions 1 to 10; vectors sed and borewidth with preallocated dimensions filled with NAs
Process: fill data in each of sed and borewidth in the manner implemented in the inner for loop
Output: sed and borewidth vectors
Assumptions (h/t Simon Urbanek): the begin, finish points of each row are contiguous, sequential and for each region, begin at 0.
Code is as below:
for (region in 1:10) {
# subset standRef and sample by region code
standRef.region <- standRef[which(standRef$region == region),]
sample.region <- sample[which(sample$region == region),]
for (i in 1:nrow(sample.region))
{
# create a dataframe - locations - that includes:
# 1) those indices of standRef.region in which the value of the location column is greater than the value of the ith row of the begin column of sample.region
# 2) those indices of standRef.region in which the value of the location column is less than the value of the ith row of the finish column of sample.region
locations <- standRef.region[which((standRef.region$location > sample.region$begin[i]) & (standRef.region$location < sample.region$finish[i])),]
sed[end_tracker:(end_tracker + nrow(locations))] <- sample.region$sed[i]
borewidth[end_tracker:(end_tracker + nrow(locations))] <- sample.region$borewidth[i]
# update end_tracker to the number of locations rows for this iteration
end_tracker <- end_tracker + nrow(locations)
}
cat("Finished region", region,"\n")
}
Sample Data for borewidth andsed. Edit: corrected formatting error in dput
structure(list(region = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
begin = c(0L, 2253252L, 7091077L, 9120205L, 0L, 135094L,
941813L, 5901391L, 6061324L), finish = c(2253252L, 7091077L,
9120205L, 17463033L, 135094L, 941813L, 5901391L, 6061324L,
7092402L), sed = c(3.31830840984048, 1.38014704208403, 6.13049140975458,
2.10349875097134, 0.48170587509345, 0.13058713509175, 9.13509713513509,
6.13047153058701, 3.81734081501503), borewidth = c(3L, 5L,
2L, 1L, 1L, 1L, 2L, 4L, 4L)), .Names = c("region", "begin",
"finish", "sed", "borewidth"), class = "data.frame", row.names = c(NA,
-9L))
TIA.

With some extra assumptions based on the data you posted (incl. the other question), this is one way you could do it:
index <- unlist(lapply (unique(standRef$region), function(reg) {
reg.filter <- which(standRef$region == reg)
samp.filter <- which(sample$region == reg)
samp.filter[cut(standRef$location[reg.filter],c(0L,sample$finish[samp.filter]),labels=F)]
}))
sed <- sample$sed[index]
borewidth <- sample$borewidth[index]
The extra assumption is that your samples are contiguous, sequential (all your examples were) and start at 0. This allows us to use cut() on the $finish instead of treating each interval separately. One difference is that you code left gaps at the breaks, but I'm assuming that was not intentional.

Related

Merging three factors so their dependent variable sums in R

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)

If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))

Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)

There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum

I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

Create function to count values across list of columns

R folks:
I have a dataframe with many sets of columns. Each set is a bank of survey items. I would like to count the number of columns in each set having a certain value. I wrote a function to do this but it results in a list of repeated values that is appended to my dataframe.
df<- structure(list(RespondentID = c(6764279930, 6779986023, 6760279439,
6759243066),
q1 = c(3L, 3L, 4L, 1L),
q2 = c(2L, 2L, 4L, 4L),
q3 = c(4L, 2L, 4L, 5L),
q0010_0004 = c(1L, 2L, 3L, 1L)),
.Names = c("RespondentID", "q1", "q2", "q3", "q4"),
row.names = c(NA, 4L), class = "data.frame")
group1<-c("q1","q2","q3","q4")
# Objective: Count number of ratings==4 for each row
# Make function that receives list of columns &
# then returns ONE column in dataframe with total # columns
# having certain value (in this case, 4)
countcol<-function(colgroup) {
s<-subset(df, select=c(colgroup)) #select only the columns designated by list
s$sum<-Reduce("+", apply(X=s,1,FUN=function(x) (sum(x==4, na.rm = TRUE)))) # count instances of value==4
s2<-subset(s,select=c(sum)) # return ONE column with result for each row
return(s2$sum) }
countcol(group1)
My function, countcol runs without errors but as stated above results in what appears to be a transposed list of results for each row. I would like to have ONE number for each row that indicates the count of values.
I attempted various apply functions here but could not prevail. Anyone have a tip?
Thanks!

rowSums can give you results OP is looking for. This return count of ratings==4 for each group.
rowSums(df[2:5]==4)
#1 2 3 4
#1 0 3 1
OR just part of function from OP can give answer.
apply(df[2:5], 1, function(x)(sum(x==4)))
#1 2 3 4
#1 0 3 1

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")

First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Constructing All Possible Pairs within Groups

I have a large amount of graph data in the following form. Suppose a person has multiple interests.
person,interest
1,1
1,2
1,3
2,1
2,5
2,2
3,2
3,5
...
I want to construct all pairs of interests for each user. I would like to convert this into an edgelist like the following. I want the data in this format so that I can convert it into an adjacency matrix for graphing etc.
person,x_interest,y_interest
1,1,2
1,1,3
1,2,3
2,1,5
2,1,2
2,5,2
3,2,5
There is one solution here: Pairs of Observations within Groups but it works only for small datasets as the call to table wants to generate more than 2^31 elements. Is there another way that I can do this without having to rely on table?

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'person', we get the unique pairwise combinations of 'interest' to create two columns ('x_interest' and 'y_interest').
library(data.table)
setDT(df1)[,{tmp <- combn(unique(interest),2)
list(x_interest=tmp[c(TRUE, FALSE)], y_interest= tmp[c(FALSE, TRUE)])} , by = person]
NOTE: To speed up, combnPrim from library(gRbase) could be used in place of combn.
data
df1 <- structure(list(person = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
interest = c(1L,
2L, 3L, 1L, 5L, 2L, 2L, 5L)), .Names = c("person", "interest"
), class = "data.frame", row.names = c(NA, -8L))

Counting an event only every X days per subject (in an irregular time series)

I've got data where I'm counting episodes of care (like ER visits). The trick is, I can't count every single visit, because sometimes a 2nd or 3rd visit is actually a follow-up for a previous problem. So I've been given direction to count visits by using a 30 day "clean period" or "black out period", such that, I look for the first event (VISIT 1) by patient (min date), I count that event, then apply rules so as NOT to count any visits that occur in the 30 days following the first event. After that 30 day window has elapsed, I can begin looking for the 2nd visit (VISIT 2), count that one, then apply the 30 day black out again (NOT counting any visits that occur in the 30 days after visit #2)... wash, rinse, repeat...
I have rigged together a very sloppy solution that requires a lot of babysitting and manual checking of steps(see below). I have to believe that there is a better way. HELP!
data1 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "patient1", "patient2",
"patient3"), class = "factor"), Date = structure(c(14610, 14610,
14627, 14680, 14652, 14660, 14725, 15085, 15086, 14642, 14669,
14732, 14747, 14749), class = "Date"), test = c(1L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L)), .Names = c("ID", "Date",
"test"), class = "data.frame", row.names = c(NA, 14L))
library(doBy)
## create a table of first events
step1 <- summaryBy(Date~ID, data = data1, FUN=min)
step1$Date30 <- step1$Date.min+30
step2 <- merge(data1, step1, by.x="ID", by.y="ID")
## use an ifelse to essentially remove any events that shouldn't be counted
step2$event <- ifelse(as.numeric(step2$Date) >= step2$Date.min & as.numeric(step2$Date) <= step2$Date30, 0, 1)
## basically repeat steps above until I dont capture any more events
## there just has to be a better way
data3 <- step2[step2$event==1,]
data3<- data3[,1:3]
step3 <- summaryBy(Date~ID, data = data3, FUN=min)
step3$Date30 <- step3$Date.min+30
step4 <- merge(data3, step3, by.x="ID", by.y="ID")
step4$event <- ifelse(as.numeric(step4$Date) >= step4$Date.min & as.numeric(step4$Date) <= step4$Date30, 0, 1)
data4 <- step4[step4$event==1,]
data4<- data4[,1:3]
step5 <- summaryBy(Date~ID, data = data4, FUN=min)
step5$Date30 <- step5$Date.min+30
## then I rbind the "keepers"
## in this case steps 1 and 3 above
final <- rbind(step1,step3, step5)
## then reformat
final <- final[,1:2]
final$Date.min <- as.Date(final$Date.min,origin="1970-01-01")
## again, extremely clumsy, but it works... HELP! :)

This solution is loop-free and uses only base R. It produces a logical vector ok which selects the acceptable rows of data1.
ave runs the indicated anonymous function over each patient separately.
We define a state vector consisting of the current date and the start of the period for which no other dates are considered. Each date is represented by as.numeric(x) where x is the date. step takes the state vector and the current date and updates the state vector. Reduce runs it over the data and then we take only results for which the minimum and current date are the same and for which the current date is not a duplicate.
step <- function(init, curdate) {
c(curdate, if (curdate > init[2] + 30) curdate else init[2])
}
ok <- !!ave(as.numeric(data1$Date), paste(data1$ID), FUN = function(d) {
x <- do.call("rbind", Reduce(step, d, c(-Inf, 0), acc = TRUE))
x[-1,1] == x[-1,2] & !duplicated(x[-1,1])
})
data1[ok, ]

Since that kind of manipulation is not straightforward and error-prone,
I would write a separate function to discard events in the blackout period.
The function contains a loop,
which basically does what you were doing by hand,
until there is nothing left to do.
blackout <- function(dates, period=30) {
dates <- sort(dates)
while( TRUE ) {
spell <- as.numeric(diff(dates)) <= period
if(!any(spell)) { return(dates) }
i <- which(spell)[1] + 1
dates <- dates[-i]
}
}
# Tests
stopifnot(
length(
blackout( seq.Date(Sys.Date(), Sys.Date()+50, by=1) )
) == 2
)
stopifnot(
length(
blackout( seq.Date(Sys.Date(), by=31, length=5) )
) == 5
)
It can be used as follows.
library(plyr)
ddply(data1, "ID", summarize, Date=blackout(Date))

How about
do.call('rbind', lapply(split(data1, factor(data1$ID)), function(x) (x <- x[order(x$Date),])[c(T, diff(x$Date) > 30),]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use of x`apply` to speed up loops - r

Related

Merging three factors so their dependent variable sums in R

Create function to count values across list of columns

Can I use %in% to search and match two columns?

Constructing All Possible Pairs within Groups

Counting an event only every X days per subject (in an irregular time series)

Categories

Resources