Selecting multiple parts of a list - r

I have a data frame with 100 entries, and I want to get a fields value for a subset of the entries. Specifically, I want every other 10 entries (i.e. indices 1-10,21-30,41-50,61-70,...)
The only way I've been able to do this is via: c(data$field[1:10],data$field[21:30],...)
But this seems like a horrible solution, especially if the size of the data frame changes.

You can do
data$field[rep(c(TRUE, FALSE), each = 10)]
whererep creates a vector of ten TRUE followed by ten FALSE and is recycled as needed when used for indexing.

Related

How to remove some values from a 4-dimensional matrix?

I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).

How do I find combined proportions in an R table?

Excuse the horrible title. I don't think I'd be able to summarise this problem in such few words.
So I have a table in R with data whose proportions by row have been calculated using the prop.table function ( prop.table(tab, 1) ). It looks like this:
The row headings (i.e. Q1-00-05, etc.) denote times of the day. The column headings TRUE and FALSE denote whether a particular 999 call was responded to within 10 minutes.
What I need from this table is the proportion of 999 calls responded to efficiently (< 10 mins) between 1800hrs and 0500 hrs.
I tried doing this:
tab2<-table(callouts$daytime=="Q4-18-23"|"Q1-00-05", callouts$tenmins)
but this proved fruitless. I got an error message saying:
operations are possible only for numeric, logical or complex types
I expected the table to come out with TRUE or FALSE as the row headings (for whether the callout time was within this time frame or not) and TRUE or FALSE as the column headings (for whether the response time was sub-10mins)
Any help would be much appreciated. Thanks!

Couldn't reduce the looping variable inside the "for" loop in R

I have a for loop to do a matrix manipulation in R. For some checks are true i need to come to the same row again., means i need to be reduced by 1.
for(i in 1:10)
{
if(some chk)
{
i=i-1
}
}
Actually i is not reduced for me. For an example in 5th row i'm reducing the i to 4, so again it should come as 5, but it is coming as 6.
Please advice.
My intention is:
Checking the first column values of a matrix, if I find any duplicate value, I take the second column value and append with the first row's second column and remove the duplicate row. So, when I'm removing a row I do not need increase the i in while loop. (This is just a map reduce method, append values of same key)
Variables in R for loops are read-only, you cannot modify them. What you have written would be solved completely differently in normal R code – the exact solution depending on the actual problem, there isn’t a generic, direct replacement (except by replacing the whole thing with a while loop but this is both ugly and probably unnecessary).
To illustrate this, consider these two typical examples.
Assume you want to filter all duplicated elements from a list. Instead of looping over the list and copying all duplicated elements, you can use the duplicated function which tells you, for each element, whether it’s a duplicate.
Secondly, you use standard R subsetting syntax to select just those elements which are not a duplicate:
x = x[! duplicated(x)]
(This example works on a one-dimensional vector or list, but it can be generalised to more dimensions.)
For a more complex case, let’s say that you have a vector of numbers and, for every even number in the vector, you want to double the preceding number (this is highly artificial but in signal processing you might face similar problems). In other words:
input = c(1, 3, 2, 5, 6, 7, 1, 8)
output = ???
output
# [1] 1 6 2 10 6 7 2 8
… we want to fill in ???. In the first step, we check which numbers are even:
even = input %% 2 == 0
# [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
Next, we shift the result down – because we want to know whether the next number is even – by removing the first element, and appending a dummy element (FALSE) at the end.
even = c(even[-1], FALSE)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
And now we can multiply just these inputs by two:
output = input
output[even] = output[even] * 2
There, done.

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Find powerset of all unique combinations of vector of strings

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:
x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
"TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
"TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
"TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
"TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
"TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
"TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
"TmaxKLEX","TminKSDF","TmaxKSDF")
# Generate a list with the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.
To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.
As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.
That is, as #joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.
It's certainly feasible to buy a system that can support this, but not very cost-effective.
At a meta-level, it is very likely, as #JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.
You might try using expand.grid.
Create a data frame from all combinations of the supplied vectors or
factors. See the description of the return value for precise details
of the way this is done.

Resources