Convert data frame to matrix without looping - r

The Question:
I have a data frame with a column that shows whether an event occurred, and columns for month, day, and year. These last 3 were converted to a date vector. I want to make a matrix that shows whether or not an event occurred within a given time period. In this matrix, a row represents a site and a column a date. I was able to write a for loop to do it, but it seemed like there might be a better way to do this, either with apply or some other basic operation. How would you do this?
The Code:
#Initialize events matrix
events = matrix(FALSE,nrow(predicted),ncol(predicted))
# Mark the presence of events
for (i in 1:nrow(events)){
if ((days_from_start[i]>-1)&(days_from_start[i]<=ncol(predicted)))
events[i,days_from_start[i]] = !input_data$Event[i]
}
The Background:
The next step is to compare the events matrix against various model outputs with the same shape. There are relatively few events in the data frame compared to the matrix size; the (probably incorrect) assumption is that the data frame completely lists all events and that unlisted matrix cells did not experience an event. I’m very new to R, so I’d be interested in hearing about other approaches to the same problem, if you think I’m going about it the hard way.
The Data:
> input_data$Event[1:5]
[1] FALSE FALSE FALSE FALSE TRUE
> input_data$Year[1:5]
[1] 2010 2010 2011 2010 2010
> days_from_start[1:5]
Time differences in days
[1] 834 1018 1106 847 1055
> dim(predicted)
[1] 649 732

Since events[i,days_from_start[i]] is accessing more or less random locations in the events matrix (since presumably you have no pattern to days_from_start ) , it may be difficult not to use a loop. Possibly something like the following will work. I haven't tested this since you posted no datasets.
foo<- (days_from_start>-1)&(days_from_start<=ncol(predicted) )
index_matrix<-cbind((1:i)[foo],days_from_start[(1:i)[foo]])
events[index_matrix]<-!input_data$Event[index_matrix[,1]]
What the first line does is create a vector of logicals, TRUE where you want to do something
The next line creates a set of index pairs where you're going to insert data into events matrix. The last line does the insertion.

Related

How do I find combined proportions in an R table?

Excuse the horrible title. I don't think I'd be able to summarise this problem in such few words.
So I have a table in R with data whose proportions by row have been calculated using the prop.table function ( prop.table(tab, 1) ). It looks like this:
The row headings (i.e. Q1-00-05, etc.) denote times of the day. The column headings TRUE and FALSE denote whether a particular 999 call was responded to within 10 minutes.
What I need from this table is the proportion of 999 calls responded to efficiently (< 10 mins) between 1800hrs and 0500 hrs.
I tried doing this:
tab2<-table(callouts$daytime=="Q4-18-23"|"Q1-00-05", callouts$tenmins)
but this proved fruitless. I got an error message saying:
operations are possible only for numeric, logical or complex types
I expected the table to come out with TRUE or FALSE as the row headings (for whether the callout time was within this time frame or not) and TRUE or FALSE as the column headings (for whether the response time was sub-10mins)
Any help would be much appreciated. Thanks!

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

Iterate process in R using range of vectors derived from matrix

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.
You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

Selecting multiple parts of a list

I have a data frame with 100 entries, and I want to get a fields value for a subset of the entries. Specifically, I want every other 10 entries (i.e. indices 1-10,21-30,41-50,61-70,...)
The only way I've been able to do this is via: c(data$field[1:10],data$field[21:30],...)
But this seems like a horrible solution, especially if the size of the data frame changes.
You can do
data$field[rep(c(TRUE, FALSE), each = 10)]
whererep creates a vector of ten TRUE followed by ten FALSE and is recycled as needed when used for indexing.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources