Most efficient way to select ending item of an array? - r

I'm looking for the most efficent way (i.e. the lesser keys pressed) to indexing the last element of an array.
Then something like
a <- c(1,2,3)
n <- length(a)
b <- a[n]
should not be used, I would like to use just a single command.
In the example above I could use
b <- a[length(a)]
but I wonder if something shorter does exist.
Let I want to select a part of an array, like
a <- seq(from = 1, to = 10, by = 1)
b <- a[3:length(a)]
Is there a shorter way to do it?

For the first case, you can use:
> tail(a, 1)
[1] 3
Not that that really qualifies as shorter.
For the second example
> tail(a, -2)
[1] 3 4 5 6 7 8 9 10
but in general; no there is nothing shorter. R doesn't have an inplace operator or syntactic sugar for the end of a vector or array, in the sense of something that evaluates to the end of the array. That is what length() can be used for.

Use tail() to get the tail end of an object:
x <- 1:100
By default, tail() returns 6 elements...
tail(x)
[1] 95 96 97 98 99 100
... but you can change that:
tail(x, 10)
[1] 91 92 93 94 95 96 97 98 99 100
Similarly, there is head() to get the first elements:
head(x, 7)
[1] 1 2 3 4 5 6 7

Related

Create frequency vector based on input vector

I have a variable test in the structure:
> test <- c(9,87)
> names(test) <- c("VGP", "GGW")
> dput(test)
structure(c(9, 87), .Names = c("VGP", "GGW"))
> class(test)
[1] "numeric"
This is a very simplified version of the input vector, but I want an output as a vector of length 100 which contains the frequency of each number 1-100 inclusive. The real input vector is of length ~1000000, so I am looking for an approach that will work for a vector of any length, assuming only numbers 1-100 are in it.
In this example, the numbers in all positions except 9 and 87 will show up as 0, and the 9th and 87th vector will both say 50.
How can I generate this output?
If we are looking for a proportion inclusive of the values that are not in the vector and to have those values as 0, convert the vector to factor with levels specified and then do the table and prop.table
100*prop.table(table(factor(test, levels = 1:100)))
>freq<-vector(mode="numeric",length=100)
>for(i in X)
+{ if(i>=1 && i<=100)
+ freq[i]=freq[i]+1
+}
>freq
X is the vector containing 10000 elements
Adding an if condition could ensure that the values are in the range of [1,100].
Hope this helps.
If you have a numeric vector and just want to get a frequency table of the values, use the table function.
set.seed(1234)
d <- sample(1:10, 1000, replace = TRUE)
x <- table(d)
x
# 1 2 3 4 5 6 7 8 9 10
# 92 98 101 104 87 112 104 94 88 120
If there is a possibility of missing values, say 11 is a possibility in my example then I'd do the following:
y <- rep(0, 11)
names(y) <- as.character(1:11)
y[as.numeric(names(x))] <- x
y
# 1 2 3 4 5 6 7 8 9 10 11
92 98 101 104 87 112 104 94 88 120 0

R list get first item of each element

This should probably be very easy for someone to answer but I have had no success on finding the answer anywhere.
I am trying to return, from a list in R, the first item of each element of the list.
> a
[1] 1 2 3
> b
[1] 11 22 33
> c
[1] 111 222 333
> d <- list(a = a,b = b,c = c)
> d
$a
[1] 1 2 3
$b
[1] 11 22 33
$c
[1] 111 222 333
Based on the construction of my list d above, I want to return a vector with three values:
return 1 11 111
sapply(d, "[[", 1) should do the trick.
A bit of explanation:
sapply: iterates over the elements in the list
[[: is the subset function. So we are asking sapply to use the subset function on each list element.
1 : is an argument passed to "[["
It turns out that "[" or "[[" can be called in a traditional manner which may help to illustrate the point:
x <- 10:1
"["(x, 3)
# [1] 8
You can do
output <- sapply(d, function(x) x[1])
If you don't need the names
names(output) <- NULL

Fastest way to find nearest value in vector

I have two integer/posixct vectors:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
Now my resulting vector c should contain for each element of vector a the nearest element of b:
c <- c(4,4,4,4,4,6,6,...)
I tried it with apply and which.min(abs(a - b)) but it's very very slow.
Is there any more clever way to solve this? Is there a data.table solution?
As it is presented in this link you can do either:
which(abs(x - your.number) == min(abs(x - your.number)))
or
which.min(abs(x - your.number))
where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.
For example:
x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))
would output:
[1] 21 22
Update: Based on the very kind comment of hendy I have added the following to make it more clear:
Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:
x <- seq(from = 100, to = 10, by = -5)
x
[1] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10
Now let's find the number closest to 42:
your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]
which would output the "value" we are looking for from the x vector:
[1] 40
Not quite sure how it will behave with your volume but cut is quite fast.
The idea is to cut your vector a at the midpoints between the elements of b.
Note that I am assuming the elements in b are strictly increasing!
Something like this:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)
cut(a, breaks=cuts, labels=b)
# [1] 4 4 4 4 4 6 6 6 10 10 10 10 10 16 16
# Levels: 4 6 10 16
This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).
findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4
So of course you can do something like:
index = findInterval(a, cuts)
b[index]
# [1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.
library(data.table)
a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,merge:=Value]
b=data.table(Value=c(4,6,10,16))
b[,merge:=Value]
setkeyv(a,c('merge'))
setkeyv(b,c('merge'))
Merge_a_b=a[b,roll='nearest']
In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.
For those who would be satisfied with the slow solution:
sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
Here might be a simple base R option, using max.col + outer:
b[max.col(-abs(outer(a,b,"-")))]
which gives
> b[max.col(-abs(outer(a,b,"-")))]
[1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)
To get around this we can lapply over your a list, and find the closest.
library(DescTools)
lapply(a, function(i) Closest(x = b, a = i))
You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).
To get around this, put either min or max around the result:
lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))
Then unlist the result to get a plain vector :)

How to pull out values corresponding to a random selection and get the cumulative summation of them?

Let's say I have a data frame with two columns for now:
df<- data.frame(scores_set1=c(32,45,65,96,45,23,23,14),
scores_set2=c(32,40,60,98,21,23,21,63))
I want to randomly select some rows
selected_indeces<- sample(c(1:8), 4, replace = FALSE)
Now I want to add up the values of selected_indeces sequentially meaning that for first selected_indeces I just need the value of that specific row, for the second I want the second row value + the first selected value ... and for the nth index I want sum of all values selected already + the value nth row. So, first need a matrix to put the results in
cumulative_loss<-matrix(rep(NA,8*2),nrow=8,ncol=2)
and then one loop for each column and another for each selected_index
for (s in 1:ncol(df)) #for each column
{
for (i in 1:length(selected_indeces)) #for each randomly selected index
{
if (i==1)
{
cumulative_loss[i,s]<- df[selected_indeces[i],s]
}
if (i > 1)
{
cumulative_loss[i,s]<- df[selected_indeces[i],s] +
df[selected_indeces[i-1],s]
}
}
}
The script works although It might be a naive way for doing such thing but the thing is that if (i=4) is only adds values of 4th and third selection while I want it to add first, second , third and fourth random selection and return it.
Conveniently, cumsum() works on data.frames directly, in which case it runs on each column independently. Thus we can index out the selected rows of df with an index operation and pass the result directly to cumsum() to get the required output:
set.seed(0L);
sel <- sample(1:8,4L);
sel;
## [1] 8 2 3 6
df[sel,];
## scores_set1 scores_set2
## 8 14 63
## 2 45 40
## 3 65 60
## 6 23 23
cumsum(df[sel,]);
## scores_set1 scores_set2
## 8 14 63
## 2 59 103
## 3 124 163
## 6 147 186
To select different indexes for each column, we can use apply():
set.seed(0L);
apply(df,2L,function(col) cumsum(col[sample(1:8,4L)]));
## scores_set1 scores_set2
## [1,] 14 63
## [2,] 59 103
## [3,] 124 126
## [4,] 147 147
If you want to compute the indexes in advance, it becomes slightly trickier. Here's one way of doing it:
set.seed(0L);
sels <- replicate(2L,sample(1:8,4L)); sels;
## [,1] [,2]
## [1,] 8 8
## [2,] 2 2
## [3,] 3 6
## [4,] 6 5
sapply(seq_len(ncol(df)),function(ci) cumsum(df[[ci]][sels[,ci]]));
## [,1] [,2]
## [1,] 14 63
## [2,] 59 103
## [3,] 124 126
## [4,] 147 147
Here's a way to do this with data.table (taking into account your comment on #bgoldst's answer:
library(data.table); setDT(df)
#sample 4 elements of each column (i.e., every element of .SD), then cumsum them
df[ , lapply(.SD, function(x) cumsum(sample(x, 4)))]
If you want to use different indices for each column, I would pre-choose them first:
set.seed(1023)
idx <- lapply(integer(ncol(df)), function(...) sample(nrow(df), 4))
idx
# [[1]] #indices for column 1
# [1] 2 8 6 3
#
# [[2]] #indices for column 2
# [1] 4 8 5 1
Then modify the above slightly:
df[ , lapply( seq_along(.SD), function(jj) cumsum(.SD[[jj]][ idx[[jj]] ]) )]
This is the craziest compendium of brackets/parentheses I've ever written in a functional line of code, so I guess it makes sense to break things down a bit:
seq_along .SD to pick out the index number of each column, jj
.SD[[jj]] selects the jth column, idx[[jj]] selects the indices for that column, .SD[jj]][idx[jj]]] picks the idx[[jj]] rows of the jth column; this is equivalent to .SD[idx[jj], jj, with = FALSE]
Lastly, we cumsum the length(idx[[jj]]) rows we chose for column jj.
Result:
# V1 V2
# 1: 45 98
# 2: 59 161
# 3: 82 182
# 4: 147 214
With dplyr, if we want to sample each column separately and do the cumsum, we can use mutate_each and then select the first 4 with head.
library(dplyr)
df %>%
mutate_each(funs(cumsum(sample(.)))) %>%
head(.,4)
If the sample needs to be for the whole dataset
df %>%
slice(sample(row_number(), 4)) %>%
mutate_each(funs(cumsum))

How do you apply a function to a nested list?

I need to get the maximum of a variable in a nested list. For a certain station number "s" and a certain member "m", mylist[[s]][[m]] are of the form:
station date.time member bias
6019 2011-08-06 12:00 mbr003 86
6019 2011-08-06 13:00 mbr003 34
For each station, I need to get the maximum of bias of all members. For s = 3, I managed to do it through:
library(plyr)
var1 <- mylist[[3]]
var2 <- lapply(var1, `[`, 4)
var3 <- laply(var2, .fun = max)
max.value <- max(var3)
Is there a way of avoiding the column number "4" in the second line and using the variable name $bias in lapply or a better way of doing it?
You can use [ with the names of columns of data frames as well as their index. So foo[4] will have the same result as foo["bias"] (assuming that bias is the name of the fourth column).
$bias isn't really the name of that column. $ is just another function in R, like [, that is used for accessing columns of data frames (among other things).
But now I'm going to go out on a limb and offer some advice on your data structure. If each element of your nested list contains the data for a unique combination of station and member, here is a simplified toy version of your data:
dat <- expand.grid(station = rep(1:3,each = 2),member = rep(1:3,each = 2))
dat$bias <- sample(50:100,36,replace = TRUE)
tmp <- split(dat,dat$station)
tmp <- lapply(tmp,function(x){split(x,x$member)})
> tmp
$`1`
$`1`$`1`
station member bias
1 1 1 87
2 1 1 82
7 1 1 51
8 1 1 60
$`1`$`2`
station member bias
13 1 2 64
14 1 2 100
19 1 2 68
20 1 2 74
etc.
tmp is a list of length three, where each element is itself a list of length three. Each element is a data frame as shown above.
It's really much easier to record this kind of data as a single data frame. You'll notice I constructed it that way first (dat) and then split it twice. In this case you can rbind it all together again using code like this:
newDat <- do.call(rbind,lapply(tmp,function(x){do.call(rbind,x)}))
rownames(newDat) <- NULL
In this form, these sorts of calculations are much easier:
library(plyr)
#Find the max bias for each unique station+member
ddply(newDat,.(station,member),summarise, mx = max(bias))
station member mx
1 1 1 87
2 1 2 100
3 1 3 91
4 2 1 94
5 2 2 88
6 2 3 89
7 3 1 74
8 3 2 88
9 3 3 99
#Or maybe the max bias for each station across all members
ddply(newDat,.(station),summarise, mx = max(bias))
station mx
1 1 100
2 2 94
3 3 99
Here is another solution using repeated lapply.
lapply(tmp, function(x) lapply(lapply(x, '[[', 'bias'), max))
You may need to use [[ instead of [, but it should work fine with a string (don't use the $). try:
var2 <- lapply( var1, `[`, 'bias' )
or
var2 <- lapply( var1, `[[`, 'bias' )
depending on if var1 is a list.

Resources