I have the following numeric vectors x and y
x <- c(a=1,b=2,c=3)
y <- c(d=2,e=1,f=4)
I want to find the parallel maximum of each elements in the vectors, so I used:
> pmax(x,y)
a b c
2 2 4
The output has the right values, however, it returns the wrong names. The documentation for pmax mentions that it returns the attributes of the first argument, hence the a b c. Is there a way of getting the names of the maximum values? The desired output is as follow:
d b f
2 2 4
One option would be using max.col for finding the index of the maximum value per each row. For that, we need to create a matrix/data.frame by cbinding the vectors ('xy') and its names ('nmxy'). Create a row/column index ('ij') and subset the elements of 'xy' and set the names from 'nmxy'.
xy <- cbind(x,y)
nmxy <- cbind(names(x), names(y))
ij <- cbind(1:nrow(xy), max.col(xy))
setNames(xy[ij], nmxy[ij])
# d b f
# 2 2 4
Let
r <- pmax(x,y)
Simply add after the function a rename command
names(r)[y == r] <- names(y)[y == r]
If you want to be fancy, you can overload the pmax function to have the desired output.
old.pmax = pmax
pmax <- function(x,y){
r <- old.pmax(x,y)
names(r)[y == r] <- names(y)[y == r]
return(r)
}
Related
As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12
If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE
Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9
See how addition works over components:
a<-1:3
a+a #Gives (1+1), (2+2), (3+3)
I've considered using loops over argument lengths or transforming them into a data.frame and then using apply but I have the intuition there is a more efficient way of going about this.
Specifically, I'd like to calculate the mean of each set of components ignoring zero values, like so:
function(x) {
mean(x[x!=0])
}
Except x would be the i-th components of an arbitrary amount of arguments.
If I understand correctly, mapply or its wrapper Map would work fairly well here.
mapply(function(...) {temp <- c(...); mean(temp[temp != 0])}, 1:10, 11:20)
[1] 6 7 8 9 10 11 12 13 14 15
With mapply, the given function is applied to the collection of the first elements of each vector, then the collection of the second elements and so on. The function creates a new vector with c and then calculates the mean for all non-zero elements. This function returns an atomic vector.
Map(function(...) {temp <- c(...); mean(temp[temp != 0])}, 1:10, 11:20)
returns a list instead. This could be wrapped in unlist to return a vector.
If we need to do this sequentially from multiple vectors
Reduce(`+`, listofvectors)
Or rbind or cbind it to create a matrix and then do the colSums or rowSums
colSums(m1)
Update
Regarding the second part of the question (not clear), if it is to get the mean of individual vectors in a list excluding the 0 value
sapply(listofvectors, function(x) mean(x[x!=0]))
Or if we need the mean of sequence of elements in the matrix (created by rbinding the vectors), then replace the 0 values with NA, and get the colMeans with na.rm = TRUE
colMeans(replace(m1, m1==0, NA), na.rm = TRUE)
colMeans(replace(m2, m2==0, NA), na.rm = TRUE)
#[1] 6 7 8 9 10 11 12 13 14 15
NOTE: The colMeans and matrix approach are vectorized. No looping done here
data
a1 <- 1:5
b1 <- 6:10
c1 <- 11:15
listofvectors <- list(a1, b1, c1)
m1 <- rbind(a1, b1, c1)
m2 <- rbind(1:10, 11:20)
I have a matrix of A of 40000 rows and 9 cols and a vector B with 40000 items.
Each item in B is a number from 1 to 9. I want to assign the particular column in A corresponding to the item in B with 1.
Right now, I'm using a for loop for it.
for(r in 1:40000){
A[r,B[r]]=1
}
But is there a way to vectorize it ?
Thanks
You could try
A[cbind(1:nrow(A), B)] <- 1
Checking results with the OP's code
for(r in 1:nrow(A1)){
A1[r, B[r]] <- 1
}
identical(A, A1)
#[1] TRUE
Here we use a matrix that we created with cbind. From ?"[":
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i.
data
set.seed(24)
A <- matrix(sample(1:40, 25*9, replace=TRUE), ncol=9)
B <- sample(1:9, 25, replace=TRUE)
A1 <- A
I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.
The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.
Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7
I am trying to obtain the sum of values of a column (B) based on the interval between two values on another column (A) in a "reference" dataframe (df):
A <- seq(1:10)
B <- c(4,3,5,7,5,7,4,7,3,7)
df <- data.frame(A,B)
I have found two ways of doing this:
y <- sum(subset(df, A < 3 & A >= 1, select = "B"))
> y
[1] 7
and
z <- with(df,sum(df[A<3 & A>=1,"B"]))
> z
[1] 7
However, I would like to do this based on a two vectors of values stored on another dataframe
C <- c(3,7,7)
D <- c(1,1,5)
df2 <- data.frame(C,D)
to obtain a column of y values for each pair of C and D values.
I have created a function:
myfn <- function(c,d) {
y <-sum(subset(df, A < c & A >= d, select = "B"))
return(y)
}
Which works fine with numbers
myfn(3,1)
[1] 7
but not with vectors.
myfn(c=C,d=D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
> myfn(df2$C,df2$D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
>
Does anyone have any suggestion about how I could calculate such interval for sequence of values?
Try:
mapply(myfn, C, D)
# [1] 7 31 12
The problem is that your function is not naturally vectorized. You can see that because your return value is a sum of the inputs, and sum is not a vectorized operation.
Beyond that, if you look at myfn, the expression A < c & A >= d doesn't make sense when c and d have more than one value. There, you are comparing each value in df to the corresponding value in your C and D vectors (so first value to first, second to second, etc.), instead of comparing all the values in df to each value in C and D in turn.
By using mapply, I'm basically looping through your function with as arguments a single value from C and D at a time.
Fortunately in your case it turns out that C,D have different number of elements than df, so you actually got a warning. If they were the same length you would not have gotten a warning and you would have gotten a single value answer, instead of the three you are presumably looking for.
There are better ways to do this, but the mapply approach is pretty trivial here and works with your code pretty much as is.
Another way...
is.between <- function(x,vec){
return(x>=min(vec) & x<max(vec))
}
apply(df2,1,function(x){sum(df[is.between(df$A,x),]$B)})
# [1] 7 31 12