I would like to create a vector in which each element is the n-th element plus the x following elements of another vector.
For example, if I have the vector a:
a <- c(1,2,3,4,5,6,7,8,9,10)
My new vector b should have the elements
b <- c(1,2,5,6,9,10)
meaning the first two elements, the third two elements etc.
Any help would be much appreciated!
Logical indexing with recycling easily does this:
a <- c(1,2,3,4,5,6,7,8,9,10)
a[c(T,T,F,F)]
## [1] 1 2 5 6 9 10
From your comment to the question:
n <- 4
x <- 2
a[c(rep(T, n-x), rep(F,x))]
## [1] 1 2 5 6 9 10
Related
I have two integer/posixct vectors:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
Now my resulting vector c should contain for each element of vector a the nearest element of b:
c <- c(4,4,4,4,4,6,6,...)
I tried it with apply and which.min(abs(a - b)) but it's very very slow.
Is there any more clever way to solve this? Is there a data.table solution?
As it is presented in this link you can do either:
which(abs(x - your.number) == min(abs(x - your.number)))
or
which.min(abs(x - your.number))
where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.
For example:
x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))
would output:
[1] 21 22
Update: Based on the very kind comment of hendy I have added the following to make it more clear:
Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:
x <- seq(from = 100, to = 10, by = -5)
x
[1] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10
Now let's find the number closest to 42:
your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]
which would output the "value" we are looking for from the x vector:
[1] 40
Not quite sure how it will behave with your volume but cut is quite fast.
The idea is to cut your vector a at the midpoints between the elements of b.
Note that I am assuming the elements in b are strictly increasing!
Something like this:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)
cut(a, breaks=cuts, labels=b)
# [1] 4 4 4 4 4 6 6 6 10 10 10 10 10 16 16
# Levels: 4 6 10 16
This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).
findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4
So of course you can do something like:
index = findInterval(a, cuts)
b[index]
# [1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.
library(data.table)
a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,merge:=Value]
b=data.table(Value=c(4,6,10,16))
b[,merge:=Value]
setkeyv(a,c('merge'))
setkeyv(b,c('merge'))
Merge_a_b=a[b,roll='nearest']
In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.
For those who would be satisfied with the slow solution:
sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
Here might be a simple base R option, using max.col + outer:
b[max.col(-abs(outer(a,b,"-")))]
which gives
> b[max.col(-abs(outer(a,b,"-")))]
[1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)
To get around this we can lapply over your a list, and find the closest.
library(DescTools)
lapply(a, function(i) Closest(x = b, a = i))
You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).
To get around this, put either min or max around the result:
lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))
Then unlist the result to get a plain vector :)
How should I subset a matrix specifying both the line and the column of each item ? I'm currently using sapply but I don't find that particularly elegant:
> mat <- data.frame(a=c(1,2,3),b=c(7,6,5))
> mat
a b
1 1 7
2 2 6
3 3 5
> rowSel <- 1:3
> colSel <- c(1,2,1)
> sapply(rowSel,function(i){mat[i,colSel[i]]})
[1] 1 6 3
A shorter way:
mat[cbind(rowSel, colSel)]
#[1] 1 6 3
This uses the indexing by a twocolumn matrix. The first column contains the index of the row, the second column contains the index of the column. Each row of the twocolumn matrix indexes a element of the matrix mat.
I have 2 objects:
A data frame with 3 variables:
v1 <- 1:10
v2 <- 11:20
v3 <- 21:30
df <- data.frame(v1,v2,v3)
A numeric vector with 3 elements:
nv <- c(6,11,28)
I would like to compare the first variable to the first number, the second variable to the second number and so on.
which(df$v1 > nv[1])
which(df$v2 > nv[2])
which(df$v3 > nv[3])
Of course in reality my data frame has a lot more variables so manually typing each variable is not an option.
I encounter these kinds of problems quite frequently. What kind of documentation would I need to read to be fluent in these matters?
One option would be to compare with equally sized elements. For this we can replicate the elements in 'nv' each by number of rows of 'df' (rep(nv, each=nrow(df))) and compare with df or use the col function that does similar output as rep.
which(df > nv[col(df)], arr.ind=TRUE)
If you need a logical matrix that corresponds to comparison of each column with each element of 'nv'
sweep(df, 2, nv, FUN='>')
You could also use mapply:
mapply(FUN=function(x, y)which(x > y), x=df, y=nv)
#$v1
#[1] 7 8 9 10
#
#$v2
#[1] 2 3 4 5 6 7 8 9 10
#
#$v3
#[1] 9 10
I think these sorts of situations are tricky because normal looping solutions (e.g. the apply function) only loop through one object, but you need to loop both through df and nv simultaneously. One approach is to loop through the indices and to use them to grab the appropriate information from both df and nv. A convenient way to loop through indices is the sapply function:
sapply(seq_along(nv), function(x) which(df[,x] > nv[x]))
# [[1]]
# [1] 7 8 9 10
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 9 10
got that one I can't resolve.
Example dataset:
company <- c("compA","compB","compC")
compA <- c(1,2,3)
compB <- c(2,3,1)
compC <- c(3,1,2)
df <- data.frame(company,compA,compB,compC)
I want to create a new column with the value from the column which name is in the column "company" of the same line. the resulting extraction would be:
df$new <- c(1,3,2)
df
The way you have it set up, there's one row and one column for every company, and the rows and columns are in the same order. If that's your real dataset, then as others have said diag(...) is the solution (and you should select that answer).
If your real dataset has more than one instance of company (e.g., more than one row per company, then this is more general:
# using your df
sapply(1:nrow(df),function(i)df[i,as.character(df$company[i])])
# [1] 1 3 2
# more complex case
set.seed(1) # for reproducible example
newdf <- data.frame(company=LETTERS[sample(1:3,10,replace=T)],
A = sample(1:3,10,replace=T),
B=sample(1:5,10,replace=T),
C=1:10)
head(newdf)
# company A B C
# 1 A 1 5 1
# 2 B 1 2 2
# 3 B 3 4 3
# 4 C 2 1 4
# 5 A 3 2 5
# 6 C 2 2 6
sapply(1:nrow(newdf),function(i)newdf[i,as.character(newdf$company[i])])
# [1] 1 2 4 4 3 6 7 2 5 3
EDIT: eddi's answer is probably better. It is more likely that you would have the dataframe to work with rather than the individual row vectors.
I am not sure if I understand your question, it is unclear from your description. But it seems you are asking for the diagonals of the data values since this would be the place where "name is in the column "company" of the same line". The following will do this:
df$new <- diag(matrix(c(compA,compB,compC), nrow = 3, ncol = 3))
The diag function will return the diagonal of the matrix for you. So I first concatenated the three original vectors into one vector and then specified it to be wrapped into a matrix of three rows and three columns. Then I took the diagonal. The whole thing is then added to the dataframe.
Did that answer your question?
Let's say I have two vectors:
x <- c(1,16,20,7,2)
y <- c(1, 7, 5,2,4,16,20,10)
I want to remove elements in y that are not in x. That is, I want to remove elements 5, 4, 10 from y.
y
[1] 1 7 2 16 20
In the end, I want vectors x and y to have to same elements. Order does not matter.
My thoughts: The match function lists the indices of the where the two vectors contains a matching element but I need a function is that essentially the opposite. I need a function that displays the indices where the elements in the two vectors don't match.
# this lists the indices in y that match the elements in x
match(x,y)
[1] 1 6 7 2 4 # these are the indices that I want; I want to remove
# the other indices from y
Does anyone know how to do this? thank you
You are after intersect
intersect(x,y)
## [1] 1 16 20 7 2
If you want the indices for the elements of y in x, using which and %in% (%in% uses match internally, so you were on the right track here)
which(y %in% x)
## [1] 1 2 4 6 7
As #joran points out in the comments intersect will drop the duplicates, so perhaps a safe option, if you want to return true matches would be something like
intersection <- function(x,y){.which <- intersect(x,y)
.in <- x[which(x %in% y)]
.in}
x <- c(1,1,2,3,4)
y <- c(1,2,3,3)
intersection(x,y)
## [1] 1 1 2 3
# compare with
intersect(x,y)
## [1] 1 2 3
intersection(y,x)
## [1] 1 2 3 3
# compare with
intersect(y, x)
## [1] 1 2 3
You then need to be careful about ordering with this modified function (which is avoided with intersect as it drops duplicated elements )
If you want the index of those element of y not in x, simply prefix with ! as `%in% returns a logical vector
which(!y%in%x)
##[1] 3 5 8
Or if you want the elements use setdiff
setdiff(y,x)
## [1] 5 4 10