How can I extract a part of a vector to another vector (including positions) - r

I have a vector with different values (positive and negative), so, I want to select only the 10 lowest odd number values, and the 10 lowest pair values. Help me, please!

This is a way to do it using base R.
vector with odd and even numbers
x <- sample(-100:100, 30)
The modulus operator in R help to get the job done. You can use it this way
c(
# Extract the lowest even numbers
head(sort(x[x %% 2 == 0]), 5),
# Extract the lowest odds numbers
head(sort(x[x %% 2 == 1]), 5)
)

Given vector vas your input vector, you can obtain the desired output (including positions) via the following code
names(v) <- seq_along(v)
# lowest 10 odd numbers
low_odd <- sort(v[v%%2==1])[1:10]
# positions of those odd numbers in v
low_odd_pos <- as.numeric(names(low_odd))
# lowest 10 even numbers
low_even <- sort(v[v%%2==0])[1:10]
# positions of those even numbers in v
low_even_pos <- as.numeric(names(low_even))
Example
set.seed(1)
v <- sample(-50:50)
then
> low_odd
43 101 39 95 85 72 7 73 45 29
-49 -47 -45 -43 -41 -39 -37 -35 -33 -31
> low_odd_pos
[1] 43 101 39 95 85 72 7 73 45 29

Related

How to use Logical indexing and min function to find the row which has min value?

So, I know how to find it using the subset function. Is there any way not to use subset function?
Example dataset:
Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75
So, lets say I need to write a code to find the min value in Column B.
My Code: subset(df, B == min(df)
My question:
How to use Logical indexing and min function for this dataset? I don't wanna use subset.
You can use which to find the postitions.
x <- c(2,1,3,1)
which(x == min(x))
#[1] 2 4
To get the first hit which.min could be used.
which.min(x)
#[1] 2
With the given data set.
x <- read.table(header=TRUE, text="Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75")
which(x$B == min(x$B))
#[1] 2
which(x[2:3] == min(x[2:3]), TRUE)
# row col
#[1,] 2 1

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

Integers that are not divisible by several numbers

I am trying to print a vector with the integers between 1 and 100 that are not divisible by 2, 3 and 7 in R.
I tried seq but I am not sure how to continue.
Another option is to use Filter to, well, filter the sequence for any number that meets your condition:
Filter(function(i) { all(i %% c(2,3,7) != 0) }, seq(100))
## [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
Note that while this may (IMO) be the most readable, it's the worst in terms of performance (so far):
UPDATED to take into account rawr's for loop solution:
microbenchmark(
filter={ v1 <- seq(100); Filter(function(i) { all(i %% c(2,3,7) != 0) }, v1) },
reduce={ v1 <- seq(100); v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))] },
rowout={ v1 <- seq(100); v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0] },
looopy={ v1 <- seq(100); for (ii in c(2,3,7)) v1 <- v1[-which(v1 %% ii == 0)]; v1 },
times=1000
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## filter 108.280 118.7000 143.88592 126.2155 136.6290 2349.952 1000 c
## reduce 21.552 23.8095 25.91997 24.8150 25.8580 144.067 1000 ab
## rowout 26.075 28.4920 31.11812 29.5350 31.2125 184.225 1000 b
## looopy 14.149 16.0765 18.11806 16.8995 17.8595 160.485 1000 a
To make it fair I added sequence generation to all of them (and, I was doing this to compare relative performance vs actual speed anyway, so the comparison results still work).
Original statement:
"Unsurprisingly, akrun's is optimal :-)"
is now superseded by:
"Unsurprisingly, rawr's is optimal :-)"
Basically you want to compute each of the numbers in 1:100 modulo 2, 3, and 7. You could use outer to perform all the modulo operations in a single vectorized operation, using rowSums to identify the elements in 1:100 that are not perfectly divided by 2, 3, or 7.
v1 <- 1:100
v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0]
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
We can do this in a loop using lapply using the modulo operator, convert the 0 to TRUE by negating (!), use Reduce with | to find the corresponding list elements that are either TRUE, negate and subset the 'v1'
v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))]
Or instead of looping, this can be also done in a faster way.
v1[!(!v1%%2) + (!v1%%3) + (!v1%%7)]
data
v1 <- seq(100)
The other answers are better, but if you really need to use a for loop, as this question suggests, this could be a possibility:
x <- vector()
n <- 1L
for(i in 1:100){if (i%%2!=0 & i%%3!=0 & i%%7!=0) {x[n] <- i; n <- n+1}}
#> x
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
As already mentioned, the other answers posted here are better because they exploit the vectorized capabilities of R. The short code shown here is probably slower than any of the other answers and more complicated to maintain. It is the typical syntax of other programming languages, like C or FORTRAN, applied to R. It works, but it is not the way things should be done.
Rather than using modulo arithmetic explicitly, we can generate the negative modulo sequence easily by counting down. Then for each of the three sequences, we can OR them all together, then drop it into which().
which(as.logical(pmin(rep_len(1:0, 100),
rep_len(2:0, 100),
rep_len(6:0, 100))))
If we want to be a bit less hardcoded, we might use do.call with lapply():
which(as.logical(do.call(pmin, lapply(c(2,3,7)-1, function(x)rep_len(x:0, 100)))))
EDIT:
Here's one way to do it using logicals:
v1 <- logical(100); for (ii in c(2,3,7) -1) v1 <- v1 | rep_len(rep(c(F,T), c(ii,1)), 100) ; which(!v1)
I had the same problem in my class. I assumed the teacher gave me all the information I needed to find the answer and I was correct. This is week one and all that other silly stuff all you other advanced people used has not came up.
I did this though.
r = c(1:100)
which(r %% 3 == 0 & r %% 7 == 0 & r %% 2 == 0)
Use the which function.

R: applying a function on whole dataset to find points within a circle

I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3

Resources