Suppose we have a two categorical variables A and B that can each take 6 values. So there are 36 possible combinations. I want to create a new variable category that enumerates these possibilities based on the values of A and B . Is there a way of doing this without hard coding?
apply(expand.grid(unique(A), unique(B)), 1, paste, collapse="")
From inmost function to the outmost:
unique, returns unique vales of its argument
expand.grid, returns a matrix which contains the Cartesian product of its components
apply, applies a given function to the specified matrix/data-frame/... along the given dimension (1 = rows, 2 = columns)
paste concatenates strings or vector elements
Related
I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).
I have a code I'm working with which has the following line,
data2 <- apply(data1[,-c(1:(index-1))],2,log)
I understand that this creates a new data frame, from the data1, taking column-wise values log-transformed and some columns are eliminated, but I don't understand how the columns are removed. what does 1:(index-1) do exactly?
The ":" operator creates an integer sequence. Because (1:(index-1) ) is numeric and being used in the second position for the extraction operator"[" applied to a dataframe, it is is referring to column numbers. The person writing the code didn't need the c-function. It could have been more economically written:
data1[,-(1:(index-1))]
# but the outer "("...")"'s are needed so it starts at 1 rather than -1
So it removes the first index-1 columns from the object passed to apply. (As MrFlick points out, index must have been defined before this gets passed to R. There's not default value or interpretation for index in R.
Suppose the index is 5, then index -1 returns 4 so the sequence will be from 1 to 4 i.e. and then we use - implies loop over the columns other than the first 4 columns as MARGIN = 2
I have a data frame that has 182 elements
and I want to split it into 26 parts with 7 elements each, but in the same order as the original data frame.
I saw the split() function, but I read that it splits randomly and I want each 7 elements in sequence to be split. What function can I use?
Where did you read that split is random? That is not true.
The documentation is pretty clear at ?split...
split(x, f, drop = FALSE, ...)
split divides the data in the vector x into the groups defined by f
...
x vector or data frame containing values to be divided into groups.
f a ‘factor’ in the sense that as.factor(f) defines the grouping, or a list of such factors in which case their interaction is used for the grouping.
...
The split is based on the second argument, f. The split is as random as f is - you can choose a random f or whatever non-random f you would like. In this case, "I want to split it into 26 parts with 7 elements each", we can make a good f use rep:
split(your_data, f = rep(1:26, each = 7))
Given data like this
C1<-c(3,-999.000,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,4,3,6,-999.000)
DF<-data.frame(ID=c("A","B","C","D","E"),C1=C1,C2=C2,C3=C3)
How do I go about removing the -999.000 data in all of the columns
I know this works per column
DF2<-DF[!(DF$C1==-999.000 | DF$C2==-999.000 | DF$C3==-999.000),]
But I'd like to avoid referencing each column. I am thinking there is an easy way to reference all of the columns in a particular data frame aka:
DF3<-DF[!(DF[,]==-999.000),]
or
DF3<-DF[!(DF[,(2:4)]==-999.000),]
but obviously these do not work
And out of curiosity, bonus points if you can me why I need that last comma before the ending square bracket as in:
==-999.000),]
The following may work
DF[!apply(DF==-999,1,sum),]
or if you can have multiple -999 on a row
DF[!(apply(DF==-999,1,sum)>0),]
or
DF[!apply(DF==-999,1,any),]
Based on your code, I'll assume that you want to remove all rows that contain -999.
DF2 <- DF[rowSums(DF == -999) == 0, ]
As for your bonus question: A data frame is a list of vectors, all of which have the same length. If we think of the vectors as columns, then a data frame can be thought of as a matrix where the columns might have different types (numeric, character, etc). R allows you to refer to elements of a data frame much the same way you refer to elements of a matrix; by using row and column indices. So DF[i, j] refers to the ith element in the jth vector of DF, which you can think of as the ith row and jth column. So if you want to retain only some of the rows of the data frame and all columns, you can use a matrix-like notation: DF[row.indices, ].
To address your "bonus" question, if we go to the documentation for ?Extract.data.frame we will find:
Data frames can be indexed in several modes. When [ and [[ are used
with a single index (x[i] or x[[i]]), they index the data frame as if
it were a list. In this usage a drop argument is ignored, with a
warning.
and also:
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they
act like indexing a matrix: [[ can only be used to select one element.
Note that for each selected column, xj say, typically (if it is not
matrix-like), the resulting column will be xj[i], and hence rely on
the corresponding [ method, see the examples section.
So you need the comma to ensure that R knows you are referring to a row, not a column.
I don't understand if your target is to remove all the rows that contain at least one NA, if this is what you are looking for, then this could be a possible answer:
DF[DF==-999] <- NA
na.omit(DF)
ID C1 C2 C3
1 A 3 3 5
3 C 4 3 3
4 D 4 4 6
I understand what tapply() does in R. However, I cannot parse this description of it from the documentaion:
Apply a Function Over a "Ragged" Array
Description:
Apply a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Usage:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
When I think of tapply, I think of group by in sql. You group values in X together by its parallel factor levels in INDEX and apply FUN to those groups. I have read the description of tapply 100 times and still can't figure out how what it says maps to how I understand tapply. Perhaps someone can help me parse it?
#joran's great answer helped me understand it (so please vote for his - I would have added it as comment if it wasn't too long for that), but this may be of help to some:
In quite a few languages, you have twodimensional arrays. Depending on the language, these arrays have fixed dimensions (i.e.: each row has the same number of columns), or some languages allow the number of items per row to differ. So instead of:
A: 1 2 3
B: 4 5 6
C: 7 8 9
You could get something like
A: 1 3
B: 4 5 6
C: 8
This is called a ragged array because, well, the right side of it looks ragged.
In typical R-style, we might represent this as two vectors:
values<-c(1,3,4,5,6,8)
names<-c("A", "A", "B", "B", "B", "C")
So tapply with these two vectors as the first parameters indeed allows us to apply this function to each 'row' of our ragged array.
Let's see what the R documentation says on the subject:
The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.
The list of factors you supply via INDEX together specify a collection of subsets of X, of possibly different lengths (hence, the 'ragged' descriptor). And then FUN is applied to each subset.
EDIT: #Joris makes an excellent point in the comments. It may be helpful to think of tapply(X,Y,...) as a wrapper for sapply(split(X,Y),...) in that if Y is a list of grouping factors, it builds a new, single grouping factor based on their unique levels, splits X accordingly and applies FUN to each piece.
EDIT: Here's an illustrative example:
library(lattice)
library(plyr)
set.seed(123)
#Make this example unbalanced
dat <- barley[sample(1:120,50),]
#Suppose we want the avg yield by year/site:
table(dat$year,dat$site)
#That's what they mean by 'ragged' array; there are different
# numbers of obs at each comb of levels
#In plyr we could use ddply:
ddply(dat,.(year,site),.fun=function(x){mean(x$yield)})
#Which gives the same result (listed in a diff order) as:
melt(tapply (dat$yield, list (dat$year, dat$site), mean))