Iterate process in R using range of vectors derived from matrix - r

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.

You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

Related

Convert data frame to matrix without looping

The Question:
I have a data frame with a column that shows whether an event occurred, and columns for month, day, and year. These last 3 were converted to a date vector. I want to make a matrix that shows whether or not an event occurred within a given time period. In this matrix, a row represents a site and a column a date. I was able to write a for loop to do it, but it seemed like there might be a better way to do this, either with apply or some other basic operation. How would you do this?
The Code:
#Initialize events matrix
events = matrix(FALSE,nrow(predicted),ncol(predicted))
# Mark the presence of events
for (i in 1:nrow(events)){
if ((days_from_start[i]>-1)&(days_from_start[i]<=ncol(predicted)))
events[i,days_from_start[i]] = !input_data$Event[i]
}
The Background:
The next step is to compare the events matrix against various model outputs with the same shape. There are relatively few events in the data frame compared to the matrix size; the (probably incorrect) assumption is that the data frame completely lists all events and that unlisted matrix cells did not experience an event. I’m very new to R, so I’d be interested in hearing about other approaches to the same problem, if you think I’m going about it the hard way.
The Data:
> input_data$Event[1:5]
[1] FALSE FALSE FALSE FALSE TRUE
> input_data$Year[1:5]
[1] 2010 2010 2011 2010 2010
> days_from_start[1:5]
Time differences in days
[1] 834 1018 1106 847 1055
> dim(predicted)
[1] 649 732
Since events[i,days_from_start[i]] is accessing more or less random locations in the events matrix (since presumably you have no pattern to days_from_start ) , it may be difficult not to use a loop. Possibly something like the following will work. I haven't tested this since you posted no datasets.
foo<- (days_from_start>-1)&(days_from_start<=ncol(predicted) )
index_matrix<-cbind((1:i)[foo],days_from_start[(1:i)[foo]])
events[index_matrix]<-!input_data$Event[index_matrix[,1]]
What the first line does is create a vector of logicals, TRUE where you want to do something
The next line creates a set of index pairs where you're going to insert data into events matrix. The last line does the insertion.

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Multiple Matrix Operations in R with loop based on matrix name

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.
it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Find powerset of all unique combinations of vector of strings

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:
x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
"TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
"TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
"TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
"TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
"TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
"TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
"TmaxKLEX","TminKSDF","TmaxKSDF")
# Generate a list with the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.
To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.
As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.
That is, as #joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.
It's certainly feasible to buy a system that can support this, but not very cost-effective.
At a meta-level, it is very likely, as #JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.
You might try using expand.grid.
Create a data frame from all combinations of the supplied vectors or
factors. See the description of the return value for precise details
of the way this is done.

Resources