I'm trying to get a hold on how the apply function works. Here is what I tried:
df = data.frame(x=c(1,2,3,4,5), x2=c(1,2,3,4,5))
apply(df$x2, 2, function(x) x*2) #doesn't work
apply(df["x2"], 2, function(x) x*2) #works
apply(df[,2], 2, function(x) (x*2)) #doesn't work
apply(df[2], 2, function(x) x*2) #works (suprisingly)
apply(df[2,], 1, function(x) x*2) #works, but gives me vertical vector
apply(df[2,], 2, function(x) x*2) #works; this gives me the output I expected in line above
Questions (as idicated by comments):
Why doesn't line 2 work although line 3 does?
Why can I use [2,] to refer to row 2 (line 6), but cannot use [,2] to refer to column 2 (line 4), but have to use [2] (line 5) instead?
In line 6 I expected to get what I got from line 7: row 2 (with double values) in a row. Why didn't I get this from line 6, I indicated row with
MARGIN=2?
apply needs to be used on something with a dimension of positive length. For simplicity some Object that has rows and columns.
That's why you have margin 1, 2. Standing for the row-wise and col-wise operation.
Check your Input values like this:
dim(df["x2"])
dim(df[,2]) #this is null, so it does not work
df[,2] gives you a vector same as df$x2. A vector does not have rows and cols. Therefore not working with apply.
In order to understand what you are doing wrong:
Type ?"[" into your console and read everything. Also play around... what you are already doing!
Have a closer look at the drop argument.
Lastly with df[2,] your subsetting a single row. It's still a dataframe.
Check dim(df[2,])
apply(df[2,], 1, function(x) x*2) #works, but gives me vertical vector
apply(df[2,], 2, function(x) x*2) #works; this gives me the output I expected in line above
The reason you don't get the same output. Is the WHOLE reason why apply exists. Please read ?apply to understand.
When you have questions after reading the two mentioned resources, feel free to ask more.
Here is a little example:
m <- matrix(1:9,nrow=3)
m
apply(m,1,max) #row-wise max value
apply(m,2,max) #col-wise max value
The problem is subsetting:
First:
df$x2 and df[, 2] are different from df["x2"] and df[2], as the former return a numeric vector, the latter return a data.frame.
Second:
df[2, ] returns the second row of your data.frame. If you use MARGIN = 1 you go through the rows, each row is represented as a (named) vector of length equal to the number of columns in your data.frame.
If you use MARGIN = 2 you go through the columns, again, each column is represented as a (named) vector of length equal to the number of rows in your data.frame.
Why doesn't line 2 work although line 3 does?
df$x2 is a vector i.e. c(1,2,3,4,5) whereas df["x2"] is a data frame with just one column. The vector has no second dimension to apply over. See ?'['] in R for details of how subsetting works, this isn't really related to the apply function
Why can I use [2,] to refer to row 2 (line 6), but cannot use [,2] to refer to column 2 (line 4), but have to use [2] (line 5) instead?
Again, see the subsetting help page, but df[,2,drop=FALSE] is probably what you need.
In line 6 I expected to get what I got from line 7: row 2 (with double values) in a row. Why didn't I get this from line 6, I indicated row with MARGIN=2?
The value section of ?apply explains the dimensions that you can expect as output from a call to apply:
If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise.
In this case we see that:
> dim(df[2,])
# [1] 1 2
and so:
apply(df[2,], 1, function(x) x*2)
has n=2 and dim(df[2,])[1]=1, so you should expect an output with dimensions c(2,1).
You should look at each type and dimension of the expression
> typeof(df$x2)
[1] "double"
> dim(df$x2)
NULL
> typeof(df["x2"])
[1] "list"
> dim(df["x2"])
[1] 5 1
> typeof(df[, 2])
[1] "double"
> dim(df[, 2])
NULL
> typeof(df[2])
[1] "list
> dim(df[2])
[1] 5 1
> typeof(df[2, ])
[1] "list"
> dim(df[2,])
[1] 1 2
The line 2 does not work because you try to apply function to variable which has NULL dimension. (dim(X) must have positive length). The rest is similar. You must keep attention on the type of the expression in apply. I recommend you to simply print values to check if there are properly for the apply function.
Related
I have two lists, each list contains two vectors i.e,
x <- list(c(1,2),c(3,4))
y <- list(c(2,4),c(5,6))
z <- list(c(0,0),c(1,1), c(2,3),c(4,5))
I would like to use for loop to iterate over the first list and if statement for the second list as follows:
for (j in 1:seq(x)){
if(y[[j]] == c(2,4))
z[[j]] <- c(0,0)
}
I would like to iterate over the first list and for each iteration I would like to give a condition for the second list. My function is complex, so I upload this example which is similar to what I am trying to do with my original function. So that is, I would like to choose the values of z based on the values of y. For x I just want to run the code based on the length of x.
When I run it, I got this message:
Warning messages:
1: In 1:seq(x) : numerical expression has 2 elements: only the first used
2: In if (y[[j]] == c(2, 4)) y[[j]] <- c(0, 0) :
the condition has length > 1 and only the first element will be used
I search this website and I saw similar question but it is not helpful (if loop inside a for loop which iterates over a list in R?). This question is just for the first part my question. So, it does not help me with my problem.
any help please?
The first warning is caused by using seq() which returns a [1] 1 2 in combination with the colon operator which creates a sequence between the LHS and RHS. Both values on the left and right of the colon must be of length 1. Otherwise it will take the first element and discard the rest. So 1:seq(x) is the same as writing 1:1
The second warning is that the if statement gets 2 logical values from your condition:
y[[1]] == c(2, 4)
[1] TRUE TRUE
If you want to test if elements of the vector are the same you can use your notation. If you want to test if the vectors are the same, you can use all.equal.
isTRUE(all.equal(y[[1]], c(2,4)))
[1] TRUE
It returns TRUE if vectors are equal (but not FALSE if they are not, which is why it needs to be used along with isTRUE()).
To get rid of the warnings, you can do:
for (j in seq_along(x)){
if (isTRUE(all.equal(y[[j]], c(2,4)))) {
z[[j]] <- c(0,0)
}
}
Note: seq_along() is a fast primitive for seq()
For the first part, seq() will returns [1] 1 2. So, you need to use j in seq(x) or j in 1:length(x).
and for the second part, as the command you used generates TRUE and FALSE as many as the elements in the vectors, you can use setequal(x,y). This command will check whether two objects are equal or not. two objects can be vectors, dataframes, etc, and the result is TRUE or FALSE.
The final code can be:
for (j in 1:length(x)){
if (setequal(y[[j]], c(2,4)) == TRUE) {
z[[j]] <- c(0,0)
}
}
or:
for (j in seq(x)){
if (setequal(y[[j]], c(2,4)) == TRUE) {
z[[j]] <- c(0,0)
}
}
There are two examples of function Reduce() in Hadley Wickham's book Advanced R. Both work well.
Reduce(`+`, 1:3) # -> ((1 + 2) + 3)
Reduce(sum, 1:3) # -> sum(sum(1, 2), 3)
However, when using mean in Reduce(), it does not follow the same pattern. The outcome is always the first element of the list.
> Reduce(mean, 1:3)
[1] 1
> Reduce(mean, 4:2)
[1] 4
The two functions sum() and mean() are very similar. Why one works fine with Reduce(), but the other does not? How do I know a if a function behaves normally in Reduce() before it gives incorrect result?
This has to do with the fact that, unlike sum or +, mean expects a single argument (re: a vector of values), and as such cannot be applied in the manner that Reduce operates, namely:
Reduce uses a binary function to successively combine the elements of
a given vector and a possibly given initial value.
Take note of the signature of mean:
mean(x, ...)
When you pass multiple values to it, the function will match x to the first value and ignore the rest. For example, when you call Reduce(mean, 1:3), this is more or less what is going on:
mean(1, 2)
#[1] 1
mean(mean(1, 2), 3)
#[1] 1
Compare this with the behavior of sum, which accept a variable number of values:
sum(1, 2)
#[1] 3
sum(sum(1, 2), 3)
#[1] 6
The technique of indexing a data frame with an empty index features several times in Hadley Wickam's Advanced R, but is only explained there in passing. I'm trying to figure out the rules governing indexing a list with an empty index. Consider the following four statements.
> (l <- list(a = 1, b = 2))
$a
[1] 1
$b
[1] 2
> (l[] <- list(c = 3))
$c
[1] 3
> l
$a
[1] 3
$b
[1] 3
> l[]
$a
[1] 3
$b
[1] 3
Questions:
Why is the output from second statement different from the output from the third statement? Isn't assignment supposed to return the object being assigned to, in which case the second statement should yield the same output as the third one?
How come did the assignment in the second statement result in the output shown after the third statement? What are the rules governing assignment to an emptily indexed list?
How come did the fourth statement yield the output shown? What are the rules governing indexing a list with an empty index when it is not on the left hand side of an assignment?
In short l[] will return the whole list.
(l <- list(a = 1, b = 2))
l[]
l[] <- list(c=3) is essentially reassigning what was assigned to each index to now be the result of list(c=3). For this example, it is the same as saying l[[1]] <- 3 and l[[2]] <- 3. From the ?'[' page, which mentions empty indexing a few times:
When an index expression appears on the left side of an assignment (known as subassignment) then that part of x is set to the value of the right hand side of the assignment.
and also
An empty index selects all values: this is most often used to replace all the entries but keep the attributes.
So, I roughly take this to mean each index of l should evaluate to list(c=3).
When you enter (l[] <- list(c = 3)) what is being returned is the replacement value. When you then enter l or l[] you will see that the values at each index have been replaced by list(c=3).
In addition to the previous answer, check this out. Note that the behaviour is totally the same with ordinary vectors and lists, so it cannot be labeled as "list-specific".
v <- 1:3
names(v) <- c("one", "two", "three")
r <- 4:5
names(r) <- c("four", "five")
(v[] <- r)
four five
4 5
Warning message:
In v[] <- r :
number of items to replace is not a multiple of replacement length
v
one two three
4 5 4
Assignment via subsetting keeps initial attributes (here, names). So names from the right side of the assigment are lost. What is also important, assigning via subsetting follows recycling rules. In your example, all values are reassigned to 3, in my example there is a partial recycling with a warning due to length incompatibility.
To sum up,
Assignment with <- returns evaluated right hand side before applying recycling rules.
This happens because of recycling, since lengths of two objects differ.
Without assignment operator, l or v is essentially the same as l[] or v[].
I have a vector, in this case a character vector. I want all the elements which only appear once in the vector, but the solution should be generalizable for limits other than 1.
I can pick them manually if I use the table function. I thought that the solution would look something like
frequencies <- table(myVector)
myVector[??#frequencies <= 1]
But first of all, I don't know the slot name which will have to go into ??, and searches for documentation on the table object lead me to nowhere.
Second, while the documentation for table() says that it returns 'an object of class "table"', trying the above with some random word used instead of ??, I didn't get a "no such slot" error, but
Error: trying to get slot "frequencies" from an object of a basic class ("function") with no slots
which seems to indicate that the above won't function even if I knew the slot name.
So what is the correct solution, and how do I get at the separate columns in a table when I need them?
D'oh, the documentation of the table function led me on a merry chase of imaginary object slots.
Whatever the table() function returns, it acts as a simple numeric vector. So my solution idea works when written as:
threshold <- 1
frequencies <- table(myVector)
frequencies[frequencies <= threshold]
You don’t need table for this:
vector <- c(1, 0, 2, 2, 3, 2, 1, 4)
threshold <- 1
Filter(function (elem) length(which(vector == elem)) <= threshold, vector)
# [1] 0 3 4
You can use table, but then you get the result as character strings rather than numbers. You can convert them back, of course, but it’s somehow less elegant:
tab <- table(vector)
names(tab)[tab <= threshold]
# [1] "0" "3" "4"
I'd like to know the reason why the following does not work on the matrix structure I have posted here (I've used the dput command).
When I try running:
apply(mymatrix, 2, sum)
I get:
Error in FUN(newX[, i], ...) : invalid 'type' (list) of argument
However, when I check to make sure it's a matrix I get the following:
is.matrix(mymatrix)
[1] TRUE
I realize that I can get around this problem by unlisting the data into a temp variable and then just recreating the matrix, but I'm curious why this is happening.
?is.matrix says:
'is.matrix' returns 'TRUE' if 'x' is a vector and has a '"dim"'
attribute of length 2) and 'FALSE' otherwise.
Your object is a list with a dim attribute. A list is a type of vector (even though it is not an atomic type, which is what most people think of as vectors), so is.matrix returns TRUE. For example:
> l <- as.list(1:10)
> dim(l) <- c(10,1)
> is.matrix(l)
[1] TRUE
To convert mymatrix to an atomic matrix, you need to do something like this:
mymatrix2 <- unlist(mymatrix, use.names=FALSE)
dim(mymatrix2) <- dim(mymatrix)
# now your apply call will work
apply(mymatrix2, 2, sum)
# but you should really use (if you're really just summing columns)
colSums(mymatrix2)
The elements of your matrix are not numeric, instead they are list, to see this you can do:
apply(m,2, class) # here m is your matrix
So if you want the column sum you have to 'coerce' them to be numeric and then apply colSums which is a shortcut for apply(x, 2, sum)
colSums(apply(m, 2, as.numeric)) # this will give you the sum you want.