R alternate two/four vectors after every two/four values - r

Say I have two vectors with values that come from formulas:
A <- c(0.11, -0.11, -.20, .20, -0.18, 0.18)
B <- c(-0.11, 0.11, .20, -.20, 0.18, -0.18)
What I wish to accomplish is to merge the vertices into one vector where I have the first two values of A, then the 3rd and 4th value of B, then the 5th and 6th value of A (in the actual data set the vertices are 96 characters long), to end up with:
V <- c(0.11, -0.11, .20, -.20, -0.18, 0.18)
I wish to accomplish the same with four vectors where it switches between vectors every 4 values. Seeing as the vectors are long, I don't want to have to resort to the use of indices.
I've fumbled around a lot with combinations of the c() and rbind() functions, but always end up merging incorrectly.
An example of code I've tried (with called objects substituted with possible values):
c(rbind(1.2 - (1.2 + 1.2/2),
1.2 - 1.2/2)),
rbind(1.2 - 1.2/2),
1.2 - (1.2 + 1.2/2)))
This would end up with the vectors being merged after the first one ends. I've tried different combinations, but none worked out for me.
Does anybody have a nifty trick up their sleeve?

Here's a wrapper function that will accept any number of vectors and give you desire result (though the vectors are assumed to be of same length)
Myfunc <- function(...){
temp <- cbind(...)
len <- ncol(temp)
suppressWarnings(temp[cbind(seq(nrow(temp)), rep(seq(len), each = len))])
}
Myfunc(A, B)
## [1] 0.11 -0.11 0.20 -0.20 -0.18 0.18
On 4 vectors (Provided by OP in comments)
A <- 1:16 ; B <- 21:36 ; C <- 41:56 ; D <- 61:76
Myfunc(A, B, C, D)
## [1] 1 2 3 4 25 26 27 28 49 50 51 52 73 74 75 76

> (1:6) %% 4 %in% c(1,2)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
> (1:12) %% 8 %in% c(1,2,3,4)
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
etc.
or for your example
> D <- rep(0,6)
> D[(1:6) %% 4 %in% c(1,2)] <- A[(1:6) %% 4 %in% c(1,2)]
> D[!(1:6) %% 4 %in% c(1,2)] <- B[!(1:6) %% 4 %in% c(1,2)]
> D
[1] 0.11 -0.11 0.20 -0.20 -0.18 0.18

Related

Loop results in wrong position/order

I need to calculate the results of a very simple formula (weighted average) that uses two variables (A and B) and two weight factors (A_prop and B_prop). The calculation is to be performed in a very large data set and the weight factors are stored in another data frame that I called here grid.
My approach was first to create repetitions of the data for each weight factors combination and then performed the calculations. Till that nothing strange. However then I thought about calculating values inside loop. Everything seemed to be in place, but then I checked the results of both approaches and results do not match. The results from the calculation inside loop are incorrect.
I know I should just get along and keep with the one that gives me the correct results, also because the number of lines are quite small. No big problem. However... I can just live with this. I'm about to tear my hair.
Can anyone explain me why the results are not the same? What's wrong with the loop calculation?
Also, in addition, if you have any suggestion on a more elegant it will be welcome.
(note: my first time using a reprex. Hope it is as it should)
>require(tidyverse)
>require(magicfor)
>require(readxl)
>require(reprex)
> dput(dt)
structure(list(X = 1:5, A = c(83.73, 50.4, 79.59, 62.96, 0),
B = c(100, 86.8, 80.95, 81.48, 0), weight = c(201.6, 655,
220.5, 280, 94.5), ind = c(733L, 26266L, 6877L, 8558L, 16361L
)), class = "data.frame", row.names = c(NA, -5L))
> dput(grid)
structure(list(A_prop = c(0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8,
0.85, 0.9, 0.95, 1), B_prop = c(0.5, 0.45, 0.4, 0.35, 0.3, 0.25,
0.2, 0.15, 0.1, 0.05, 0), id = 1:11, tag = structure(1:11, .Label = c("Aprop_0.5",
"Aprop_0.55", "Aprop_0.6", "Aprop_0.65", "Aprop_0.7", "Aprop_0.75",
"Aprop_0.8", "Aprop_0.85", "Aprop_0.9", "Aprop_0.95", "Aprop_1"
), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
> foo<-function(data,i){
+ val<-(data$A*grid[i,1])+(data$B*grid[i,2])
+ return(val)
+ }
> magic_for(print, progress=FALSE,silent = TRUE)
> for(i in grid$id){
+
+ score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
+
+ weight=dt$weight
+ A<-dt$A
+ B<-dt$B
+
+ ind=dt$ind
+
+ print(score)
+ print(weight)
+ print(ind)
+ print(A)
+ print(B)
+ }
> rest<-magic_result_as_dataframe()
> magic_free()
> rest2<-left_join(rest,grid,by=c("i"="id"))%>%
+ arrange(ind,tag)%>%
+ mutate(score2=(A*A_prop)+(B*B_prop))
> head(rest2)
i score weight ind A B A_prop B_prop tag score2
1 1 91.8650 201.6 733 83.73 100 0.50 0.50 Aprop_0.5 91.8650
2 2 84.5435 201.6 733 83.73 100 0.55 0.45 Aprop_0.55 91.0515
3 3 86.1705 201.6 733 83.73 100 0.60 0.40 Aprop_0.6 90.2380
4 4 87.7975 201.6 733 83.73 100 0.65 0.35 Aprop_0.65 89.4245
5 5 89.4245 201.6 733 83.73 100 0.70 0.30 Aprop_0.7 88.6110
6 6 91.0515 201.6 733 83.73 100 0.75 0.25 Aprop_0.75 87.7975
The problem is actually your left_join and NOT the for loop. For future posts, I would recommend that you also provide a minimal(istic) example.
I will demonstrate what went wrong in your code.
Say, we have these data frames, which should be similar to your real-world data:
dt <- data.frame(
A = c(2,3,4),
B = c(20,30,40)
)
grid <- data.frame(
A_prop = c(0.5, 0.6),
B_prop = c(0.5, 0.4),
id = c(1,2),
tag = c("A_prop0.5", "A_prop0.6"))
We expect the following outputs:
Expected Output dt[1,] & A_prop 0.5 and B_prop 0.5
2 * 0.5 + 20 * 0.5 #= 11
Expected Output dt[2,] & A_prop 0.5 and B_prop 0.5
3 * 0.5 + 30 * 0.5 #= 16.5
Expected Output dt[3,] & A_prop 0.5 and B_prop 0.5
4 * 0.5 + 40 * 0.5 #= 22
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
2 * 0.6 + 20 * 0.4 #= 9.2
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
3 * 0.6 + 30 * 0.4 #= 13.8
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
4 * 0.6 + 40 * 0.4 #= 18.4
I have never used the "magicfor" library, but the problem lies in your way of joining i and id.
I would write the for loop as follows:
l <- list()
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3) # to keep track in which iteration the result was created.
l[[i]] <- list(
score = score,
A = A,
B = B,
iteration = iteration
)
}
Now I bind the list to a data frame and do the left_join as you did in your example:
l <- bind_rows(l)
l_merged <- grid %>% left_join(l, by = c("id"="iteration")) %>%
mutate(score2 = (A*A_prop + B*B_prop))
The test that score and score2 are the same:
transmute(l_merged, identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
Now to the actual problem
I have adapted your code a little bit. I have added the iteration number to the output.
magic_for(print, progress=FALSE,silent = TRUE)
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3)
print(score)
print(A)
print(B)
print(iteration)
}
rest<-magic_result_as_dataframe()
magic_free()
Now, if we look at the output and compare i and iteration, we can see that these are not identical. Therefore your left_join() has produced a confusing result.
rest %>% arrange(i)
i score A B iteration
1 1 11.0 2 20 1
2 1 22.0 4 40 1
3 1 13.8 3 30 2
4 2 16.5 3 30 1
5 2 9.2 2 20 2
6 2 18.4 4 40 2
To finalise, we can test it:
grid %>% left_join(rest, by = c("id"="i")) %>% # using i for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
The join with i does not produce the correct results.
But the join with iteration will:
grid %>% left_join(rest, by = c("id"="iteration")) %>% # using the "manually" produced iteration for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
I am not sure why the i from "magicfor" is different from the manually created iteration. I certainly get your confusion...

Extract multiple values from a dataset by subsetting with a vector

I have a data frame called "Navi", with 72 rows that describe all possible combinations of three variables f,g and h.
head(Navi)
f g h
1 40.00000 80 0.05
2 57.14286 80 0.05
3 74.28571 80 0.05
4 91.42857 80 0.05
5 108.57143 80 0.05
6 125.71429 80 0.05
I have a dataset that also contains these 3 variables f,g and h along with several others.
head(dataset1[,7:14])
# A tibble: 6 x 8
h f g L1 L2 Ref1 Ref2 FR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.02 20 100 53 53 0.501 2.00 2
2 0.02 20 260 67 67 0.200 5.01 5.2
3 0.02 20 420 72 71 0.128 7.83 8.4
4 0.02 20 580 72 72 0.0956 10.5 11.6
5 0.02 20 740 73 73 0.0773 12.9 14.8
6 0.02 20 900 72 71 0.0655 15.3 18
What I'm trying to do is:
for each row in the combinations data frame, filter the dataset by the three variables f,g and h.
Then, if there are exact matches, give me the matching rows of this dataset, then extract the values in the columns "L1" and "FR" in this dataset and calculate the average of them. Save the average value in the vectors "L_M2" and "FR_M2"
If there aren't exact matches, give me the rows where f,g,h in the dataset are closest to f,g,h from the data frame. Then extract all values for L and FR in these rows, and calculate the average. Save the average value in the vectors "L_M2" and "FR_M2".
What I've already tried:
I created two empty vectors where the extracted values shall be saved later on.
Then I am looping over every row of the combinations data frame, filtering the dataset by f,g and h.
The result would be multiple rows, where the values for f,g and h are the same in the dataset as in the row of the combinations data frame.
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
}
The thing is, I don't know what to do from here on. I don't know how to check for rows with closest values by multiple variables, if there are no exact matches...
I only did something more or less similar in the past, but I only checked for the closes "g" value like this:
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
for (i in 1:(nrow(Navi)){
Differences <- abs(Navi[i,2]- matchingRows$G)
indexofMin <- which(Differences == min (Differences))
L_M2 <- append(L_M2, matchingRows$L[[indexofMin]], after = length(L_M2))
FR_M2 <- append(FR_M2, matchingRows$FR[[indexofMin]], after = length(FR_M2))
}
So can anybody tell me how to achieve this extraction process?I am still pretty new to R, so please tell me If I made a rookie mistake or forgot to include some crucial information. Thank you!
First convert your data into dataframe (if not done before).
Navi <- data.frame(Navi)
Savi <- data.frame(dataset1[,7:14])
Then use merge to filter your lines:
df1 <- merge(Navi, Savi, by = c("f","g","h"))
Save "L1" and "FR" average from df1:
Average1 <- ((df1$L1+df1$FR)/2)
Get you your new Navi dataframe which doen not have exact match on f,g,h columns
Navi_new <- Navi[!duplicated(rbind(df1, Navi))[-seq_len(nrow(df1))], ]
For comparing the values with nearest match:
A1 <- vapply(Navi_new$f, function(x) x-Savi$f, numeric(3))
A2 <- apply(abs(A1), 2, which.min)
B1 <- vapply(A1$g, function(x) x-Savi$g, numeric(3))
B2 <- apply(abs(B1), 2, which.min)
C1 <- vapply(B1$g, function(x) x-Savi$g, numeric(3))
C2 <- apply(abs(C1), 2, which.min)
You can use C2 dataframe to get the average of "L1" and "FR" like 3 steps back.

Merging dataframes with all.equal on numeric(float) keys?

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.
Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.
Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

How to calculate the amount of numbers inside a specific range

I'm still having problems calculating numbers.
Trying to find the amount of numbers inside [-0.5 , 0.5] the first line, and the amount of numbers outside the same range in the second line.
I use abc = rnorm(100, mean=0, sd=1). So I have 100 numbers in total, but i only have 35 numbers inside the range, and 35 outside the range, that dosen't add up to 100.
length(abc[abc>=-0.5 & abc<=0.5])
[1] 35
length(abc[abc<-0.5 & abc>0.5])
[1] 35
Then I tried:
length(which(abc>=-0.5 & abc<=0.5))
[1] 40
length(which(abc<-0.5 & abc>0.5))
[1] 26
And it still doesn't add up to 100. What's wrong?
You are after:
R> set.seed(1)
R> abc = rnorm(100, mean=0, sd=1)
R> length(abc[abc >= -0.5 & abc <= 0.5])
[1] 41
R> length(abc[abc < -0.5 | abc > 0.5])
[1] 59
What went wrong
Two things:
abc < -0.5 & abc > 0.5 is asking for values less than -0.5 and greater than 0.5
However, you actually had: abc[abc<-0.5 & abc>0.5] This does something a bit different due to scoping. Let's pull it apart:
R> abc[abc<-0.5 & abc>0.5]
[1] 1.5953 0.7383 0.5758 1.5118 1.1249 0.9438 <snip>
Now let's look at abc
R> abc
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
You've changed the value of abc! This is because <- is the assignment operator. You have set abc equal to 0.5 & abc > 0.5. To avoid this, use spacing (as in my code).
When wanting to find numbers inside and outside a radius like this, it can be helpful to consider the absolute value, and you then only have one comparison to make:
length(abc[abs(abc)<=0.5])
[1] 41
length(abc[abs(abc)>0.5])
[1] 59
Or you can use cut and table to do it in one line:
table(cut(abs(abc),c(-Inf,0.5,Inf)))
(-Inf,0.5] (0.5,Inf]
41 59
As a shortcut, you can also do it this way :
set.seed(1)
abc <- rnorm(100, mean=0, sd=1)
sum(abc>=-0.5 & abc<=0.5)
# [1] 41
sum(abc< -0.5 | abc>0.5)
# [1] 59
This works because sum considers TRUE as 1 and FALSE as 0.
Alternatively via subset:
set.seed(1)
abc <- rnorm(100, mean=0, sd=1)
length(subset(abc, abc >= (-0.5) & abc <= 0.5))
[1] 41
length(subset(abc, abc < (-0.5) | abc > 0.5))
[1] 59

finding unique vector elements in a list efficiently

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

Resources