I am trying to build a data frame where I have a series of columns which contain a random assignment of another column. The data has some structure which needs to be maintained. Namely I want to randomise the assignment of L many time over, while maintaining the structure in V. I want to take a dataframe that looks like this;
L B V A
1 1 1 2 10.9
2 1 1 2 6.5
3 1 1 2 8.6
4 1 1 3 11.1
5 1 1 4 13.1
6 1 1 6 11.5
And create this;
ID L B V A R1 R2 R3 R4 R5
1 1_1_2 1 1 2 10.9 27 20 19 6 26
2 1_1_2 1 1 2 6.5 27 20 19 6 26
3 1_1_2 1 1 2 8.6 27 20 19 6 26
4 1_1_3 1 1 3 11.1 6 28 4 26 26
5 1_1_4 1 1 4 13.1 16 2 6 14 32
6 1_1_6 1 1 6 11.5 17 21 3 11 25
I can do this manually using the below script, but I wonder if there is a smooth way to make this automated, because I want to do it for hundreds of randomisations to make columns R1, R2, R3.. Rn (so a loop to do this would be preferred to manual repetition of the code).
# Example Data Frame #
df = data.frame(sample(1:33, 1000, replace = T), sample(1:3, 1000, replace = T), sample(1:9, 1000, replace = T), round(rnorm(1000, 10, 2),1))
colnames(df) = c("L", "B", "V", "A")
df = transform(df,id=as.numeric(factor(df$V)))
df = data.frame(as.factor(df[,1]),as.factor(as.numeric(df[,2])),as.factor(df[,5]),as.numeric(df[,4]))
colnames(df) = c("L","B","V","A")
df = df[order(df$L, df$B, df$V),]
rownames(df) = NULL
head(df)
# ID #
df$ID = paste(df[,1], df[,2], df[,3], sep = "_")
ID = unique(as.vector(df$ID))
# R1 #
ID2 = data.frame(ID, sample(ID)); colnames(ID2) = c("ID","R1")
df = merge(df, ID2)
df$R1 = as.factor(do.call(rbind, strsplit(as.vector(df$R1), split="_"))[,1])
# R2 #
ID2 = data.frame(ID, sample(ID)); colnames(ID2) = c("ID","R2")
df = merge(df, ID2)
df$R2 = as.factor(do.call(rbind, strsplit(as.vector(df$R2), split="_"))[,1])
# R3 #
ID2 = data.frame(ID, sample(ID)); colnames(ID2) = c("ID","R3")
df = merge(df, ID2)
df$R3 = as.factor(do.call(rbind, strsplit(as.vector(df$R3), split="_"))[,1])
# R4 #
ID2 = data.frame(ID, sample(ID)); colnames(ID2) = c("ID","R4")
df = merge(df, ID2)
df$R4 = as.factor(do.call(rbind, strsplit(as.vector(df$R4), split="_"))[,1])
# R5 #
ID2 = data.frame(ID, sample(ID)); colnames(ID2) = c("ID","R5")
df = merge(df, ID2)
df$R5 = as.factor(do.call(rbind, strsplit(as.vector(df$R5), split="_"))[,1])
How can I create a loop which will make this happen in n number of columns?
Following from the above code I finally got to an answer:
# ID #
df$ID = paste(df[,1], df[,2], df[,3], sep = "_")
ID = unique(as.vector(df$ID))
n = 5
Rs = as.vector(rep(NA,n))
for(i in 1:n){
Rs[i] = paste("R",i, sep = "")
}
Rs
for(i in 1:n){
df[,5+i] = NA
colnames(df)[5+i] = paste(Rs[i])
ID = unique(as.vector(df$ID))
ID2 = data.frame(ID, sample(ID))
ID2 = merge(df, ID2)
df[5+i] = as.factor(do.call(rbind, strsplit(as.vector(ID2[,6+i]), split="_"))[,1])
}
head(df)
Gives the result:
L B V A ID R1 R2 R3 R4 R5
1 1 1 2 10.1 1_1_2 21 12 27 4 26
2 1 1 4 7.7 1_1_4 7 29 2 9 10
3 1 1 5 9.7 1_1_5 27 27 3 1 22
4 1 1 5 8.3 1_1_5 27 27 3 1 22
5 1 1 7 9.5 1_1_7 13 15 32 19 11
6 1 1 7 12.4 1_1_7 13 15 32 19 11
Related
I have a dataframe with two species A and B and certain variables a b associated with the total of 100 rows.
I want to create a sampler such that in one set it randomly picks 6 rows reps from the df dataset. However, the samples for A must only come from rows associated with sp A from df, similarly from B. I want do this for 500 times over for each of species A and B.
I attempted a for loop and when I ran sampling it shows a single row with 6 columns. I would appreciate any guidance
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
df.sample <- for(i in 1:1000){
sampling <- sample(df[i,],6,replace = TRUE)
}
#Output in a single row
a a.1 sp b sp.1 a.2
1000 1.68951 1.68951 B 1.395995 B 1.68951
#Expected dataframe
df.sample
set rep a b sp
1 1 1 9 A
1 2 3 2 A
1 3 0 2 A
1 4 1 2 A
1 5 1 6 A
1 6 4 2 A
2 1 1 2 B
2 2 5 2 B
2 3 1 2 B
2 4 1 6 B
2 5 1 8 B
2 6 9 2 B
....
Here's how I would do it (using tidyverse):
data:
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# create an empty table with desired columns
library(tidyverse)
output <- tibble(a = numeric(),
b = numeric(),
sp = character(),
set = numeric())
# sampling in a loop
set.seed(42)
for(i in 1:500){
samp1 <- df %>% filter(sp == 'A') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2)) -> output
}
Result
> head(output, 20)
# A tibble: 20 × 4
a b sp set
<dbl> <dbl> <chr> <dbl>
1 2.59 3.31 A 1
2 1.84 1.66 A 1
3 2.35 1.17 A 1
4 2.33 1.95 A 1
5 0.418 1.11 A 1
6 1.19 2.54 A 1
7 2.35 0.899 B 1
8 1.19 1.63 B 1
9 0.901 0.986 B 1
10 3.12 1.75 B 1
11 2.28 2.61 B 1
12 1.37 3.47 B 1
13 2.33 1.95 A 2
14 1.84 1.66 A 2
15 3.76 1.26 A 2
16 2.96 3.10 A 2
17 1.03 1.81 A 2
18 1.42 2.00 A 2
19 0.901 0.986 B 2
20 2.37 1.39 B 2
You could split df by species at first. Random rows in each species can be drawn by x[sample(nrow(x), 6), ]. Pass it into replicate(), you could do sampling for many times. Here dplyr::bind_rows() is used to combine samples and add a new column set indicating the sampling indices.
lapply(split(df, df$sp), function(x) {
dplyr::bind_rows(
replicate(3, x[sample(nrow(x), 6), ], FALSE),
.id = "set"
)
})
Output
$A
set a b sp
1 1 1.52480034 3.41257975 A
2 1 1.82542370 2.08511584 A
3 1 1.80019901 1.39279162 A
4 1 2.20765154 2.11879412 A
5 1 1.61295185 2.04035172 A
6 1 1.92936567 2.90362816 A
7 2 0.88903679 2.46948106 A
8 2 3.19223788 2.81329767 A
9 2 1.28629416 2.69275525 A
10 2 2.61044815 0.82495427 A
11 2 2.30928735 1.67421328 A
12 2 -0.09789704 2.62434719 A
13 3 2.10386603 1.78157862 A
14 3 2.17542841 0.84016203 A
15 3 3.22202227 3.49863423 A
16 3 1.07929909 -0.02032945 A
17 3 2.95271838 2.34460193 A
18 3 1.90414536 1.54089645 A
$B
set a b sp
1 1 3.5130317 -0.4704879 B
2 1 3.0053072 1.6021795 B
3 1 4.1167657 1.1123342 B
4 1 1.5460589 3.2915979 B
5 1 0.8742753 0.9132530 B
6 1 2.0882660 1.5588471 B
7 2 1.2444645 1.8199525 B
8 2 2.7960117 2.6657735 B
9 2 2.5970774 0.9984187 B
10 2 1.1977317 3.7360884 B
11 2 2.2830643 1.0452440 B
12 2 3.1047150 1.5609482 B
13 3 2.9309124 1.5679255 B
14 3 0.8631965 1.3501631 B
15 3 1.5460589 3.2915979 B
16 3 2.7960117 2.6657735 B
17 3 3.1047150 1.5609482 B
18 3 2.8735390 0.6329279 B
If I understood well what you want, it could be done following this code
# Create the initial data frame
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# Rows with sp=A
row.A <- which(df$sp=="A")
row.B <- which(df$sp=="B")
# Sampling data.frame
sampling <- data.frame(matrix(ncol = 5, nrow = 0))
# "rep" column for each iteration
rep1 <- rep(1:6,2)
# Build the dara.frame
for(i in 1:500){
# Sampling row.A
s.A <- sample(row.A,6,replace = T)
# Sampling row.B
s.B <- sample(row.B,6,replace = T)
# Data frame with the subset of df and "set" and "rep" values
sampling <- rbind(sampling, set=cbind(rep(i,12),rep=rep1,df[c(s.A,s.B),]))
}
# Delete row.names of sampling and redefine sampling's column names
row.names(sampling) <- NULL
colnames(sampling) <- c("set", "rep", "a", "b", "sp")
And the output looks like this:
set rep a b sp
1 1 3.713663 2.717456 A
1 2 2.456070 2.803443 A
1 3 2.166655 1.395556 A
1 4 1.453738 5.662969 A
1 5 2.692518 2.971156 A
1 6 2.699634 3.016791 A
I have the following data frame in R:
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
And another data frame:
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.
Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.
Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:
data[which(data$time > p1_start & data$time < p2_end),]
once I get the rows it is easy to append the values.
Problem: how could I do the iteration? I tried with the following:
df %>%
group_by(target, comb) %>%
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])
But I am getting an error that time has not been initialized
Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.
We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code
library(data.table)
setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")
cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]
After dcast, df looks like this
target comb type end start
1: 1 0 p1 3 1
2: 1 0 p2 7 5
3: 1 1 p1 11 9
4: 1 1 p2 15 13
5: 2 0 p1 19 17
6: 2 0 p2 23 21
7: 2 1 p1 27 25
8: 2 1 p2 31 29
And the output is
> data
time name target comb type
1: 2 a 1 0 p1
2: 5 b 1 0 p2
3: 8 c NA NA <NA>
4: 14 d 1 1 p2
5: 14 e 1 1 p2
6: 20 f NA NA <NA>
7: 21 g 2 0 p2
8: 26 h 2 1 p1
9: 28 i NA NA <NA>
10: 28 j NA NA <NA>
Here is a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
rename(name_df=name) %>%
mutate(x = time +1) %>%
pivot_longer(
cols = c(time, x),
names_to = "helper",
values_to = "time"
) %>%
right_join(data, by="time") %>%
select(time, name, target, comb)
time name target comb
<dbl> <chr> <dbl> <dbl>
1 2 a 1 0
2 5 b 1 0
3 8 c 1 0
4 14 d 1 1
5 14 e 1 1
6 20 f 2 0
7 21 g 2 0
8 26 h 2 1
9 28 i 2 1
10 28 j 2 1
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
library(fuzzyjoin)
library(tidyverse)
tmp <- df %>%
separate(name,
into = c("p", "period"),
sep = "_",
remove = TRUE) %>%
pivot_wider(
id_cols = c(p, target, comb),
names_from = period,
values_from = time
) %>%
select(-p)
fuzzy_left_join(
x = data,
y = tmp,
by = c("time" = "start",
"time" = "end"),
match_fun = list(`>=`, `<=`))
#> time name target comb start end
#> 1 2 a 1 0 1 3
#> 2 5 b 1 0 5 7
#> 3 8 c NA NA NA NA
#> 4 14 d 1 1 13 15
#> 5 14 e 1 1 13 15
#> 6 20 f NA NA NA NA
#> 7 21 g 2 0 21 23
#> 8 26 h 2 1 25 27
#> 9 28 i NA NA NA NA
#> 10 28 j NA NA NA NA
Created on 2022-01-11 by the reprex package (v2.0.1)
I have one data frame that has 6 variables.
df is the name of my data frame.
I found the expectation of e (variable in df) using
Ee <- mean(df[["e"]])
How do I find E[e|Z=z] for z in {0,1}?
Similarly, how do I find E[e|X=x] for x in {1...20} using the sapply function?
Here's a thought:
set.seed(42)
sampdata <- data.frame(e = runif(1000), z = sample(0:1, size=1000, replace=TRUE), x = sample(1:20, size=1000, replace=TRUE))
head(sampdata)
# e z x
# 1 0.9148060 1 15
# 2 0.9370754 0 2
# 3 0.2861395 1 13
# 4 0.8304476 1 12
# 5 0.6417455 1 4
# 6 0.5190959 0 7
aggregate(e ~ z, data = sampdata, FUN = mean)
# z e
# 1 0 0.4910876
# 2 1 0.4852118
aggregate(e ~ x, data = sampdata, FUN = mean)
# x e
# 1 1 0.5097038
# 2 2 0.4495141
# 3 3 0.5077897
# 4 4 0.5300375
# 5 5 0.4549345
# 6 6 0.5122537
# 7 7 0.4704425
# 8 8 0.4911532
# 9 9 0.5572367
# 10 10 0.4634067
# 11 11 0.4408758
# 12 12 0.4815633
# 13 13 0.5503166
# 14 14 0.4922317
# 15 15 0.5205427
# 16 16 0.4999023
# 17 17 0.4784551
# 18 18 0.4282990
# 19 19 0.4202285
# 20 20 0.4852303
But if you feel you must use sapply, then this can be equivalent.
sapply(setNames(nm = unique(sampdata$z)), function(Z) mean(sampdata[["e"]][ sampdata[["z"]] == Z ]))
# 1 0
# 0.4852118 0.4910876
sapply(setNames(nm = unique(sampdata$x)), function(X) mean(sampdata[["e"]][ sampdata[["x"]] == X ]))
# 15 2 13 12 4 7 19 16 10 1
# 0.5205427 0.4495141 0.5503166 0.4815633 0.5300375 0.4704425 0.4202285 0.4999023 0.4634067 0.5097038
# 9 3 14 18 11 20 5 8 17 6
# 0.5572367 0.5077897 0.4922317 0.4282990 0.4408758 0.4852303 0.4549345 0.4911532 0.4784551 0.5122537
An option with dplyr
library(dplyr)
sampdata %>%
group_by(z) %>%
summarise(e = mean(e))
data
set.seed(42)
sampdata <- data.frame(e = runif(1000), z = sample(0:1, size=1000, replace=TRUE),
x = sample(1:20, size=1000, replace=TRUE))
I have two dataframes df1 and df2, I am looking for the simplest operation to get df3.
I want to replace rows in df1 with rows from df2 if id match (so rbind.fill is not a solution), and append rows from df2 where id does not exist in df1but only for columns that exist in df2.
I guess I could use several joins and antijoins and then merge but I wonder if there already exists a function for that operation.
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
df1
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 14 19 24
5 15 20 25
df2
id c1 c2
4 1 5
5 2 6
6 3 7
7 4 8
df3
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 1 5 24
5 2 6 25
6 3 7 NULL
7 4 8 NULL
We can use {powerjoin}, make a full join and deal with the conflicts using coalesce_xy (which is really dplyr::coalesce) :
library(powerjoin)
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
safe_full_join(df1, df2, by= "id", conflict = coalesce_xy)
# id c1 c2 c3
# 1 1 11 16 21
# 2 2 12 17 22
# 3 3 13 18 23
# 4 4 14 19 24
# 5 5 15 20 25
# 6 6 3 7 NA
# 7 7 4 8 NA
I ended up with :
special_combine <- function(df1, df2){
df1_int <- df1[, colnames(df1) %in% colnames(df2)]
df1_ext <- df1[, c("id", colnames(df1)[!colnames(df1) %in% colnames(df2)])]
df3 <- bind_rows(df1_int, df2)
df3 <- df3[!duplicated(df3$id, fromLast=TRUE), ] %>%
dplyr::left_join(df1_ext, by="id") %>%
dplyr::arrange(id)
df3
}
I'm working with R
WHAT I HAVE:
ID_1 ID_2 Date x_1 y_2
1 12 3 2011-12-21 15 10
2 12 13 2011-12-22 50 40
3 3 12 2011-12-22 20 30
4 15 13 2011-12-23 30 20
...
and so on
TARGET:
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1 12 3 2011-12-21 15 10 0 0
2 12 13 2011-12-22 50 40 15 0
3 3 12 2011-12-22 20 30 10 50
4 15 13 2011-12-23 30 20 0 40
...
and so on
I want to see in XX_1 and in YY_2 the values from the columns x_1 and y_2 corresponding to the previous values of ID_1 and ID1_2 in or "0" in case of no value is available before that date. I don't know how to handle the fact that different values could be in ID_1 and ID_2 (like IDs 3 and 12 in the example).
#Ekatef
ID1 AND ID2 (find match of the whole ID row, even if the order of IDs is switched):
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1 12 3 2011-12-21 15 10 0 0
2 12 13 2011-12-22 50 40 0 0
3 3 12 2011-12-22 20 30 10 15
4 15 13 2011-12-23 30 20 0 0
5 12 13 2011-12-23 10 5 50 40
The OP has requested to copy the previous value for an ID (if any) to the appropriate new column.
This can solved by reshaping multiple columns simultaneously from wide to long format, finding the previous value by shifting / lagging, and reshaping back to wide format:
library(data.table)
setDT(DF)[, rn := .I]
long <- melt(DF, id.vars = c("rn", "Date"), measure.vars = patterns("^ID", "^x|y"),
value.name = c("ID", "value"))
long[order(Date), previous := shift(value, fill = 0), by = ID]
dcast(long, rn + Date ~ variable, value.var = c("ID", "value", "previous"))
rn Date ID_1 ID_2 value_1 value_2 previous_1 previous_2
1: 1 2011-12-21 12 3 15 10 0 0
2: 2 2011-12-22 12 13 50 40 15 0
3: 3 2011-12-22 3 12 20 30 10 50
4: 4 2011-12-23 15 13 30 20 0 40
Alternatively, the final call to dcast() can be replaced by an update while joining:
DF[long, on = .(rn),
c("XX_1", "YY_2") := .(previous[variable == 1L], previous[variable == 2L])][
, rn := NULL]
DF
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1: 12 3 2011-12-21 15 10 0 0
2: 12 13 2011-12-22 50 40 15 0
3: 3 12 2011-12-22 20 30 10 50
4: 15 13 2011-12-23 30 20 0 40
which reproduces exactly OP's expected result.
Data
library(data.table)
DF <- fread(
"i ID_1 ID_2 Date x_1 y_2
1 12 3 2011-12-21 15 10
2 12 13 2011-12-22 50 40
3 3 12 2011-12-22 20 30
4 15 13 2011-12-23 30 20 ",
drop = 1L
)
If I understand you correctly, the target ID should be looked up from the left to the right and from the bottom to the top in all the rows strictly above the given ID value. I would write the function to find the coordinates of the preceded ID like that
# find the indices of the preceded ID value
# #id_matrix == your_data_frame[, c("ID_1", "ID_2")]
# [#i_of_row, #i_of_col] are the coordinates of the considered ID
# i_of_row > 1
FindPreviousID <- function(id_matrix, i_of_row, i_of_col) {
shorten_matrix <- id_matrix[1:(i_of_row - 1),,drop = FALSE]
rev_ind <- match(table = rev(t(shorten_matrix)),
x = ids[i_of_row,i_of_col], nomatch = NA_real_)
n_row_found <- floor((length(shorten_matrix) - rev_ind)/2) + 1
n_col_found <- (length(shorten_matrix) - rev_ind) %% ncol(shorten_matrix) + 1
return(c(row = n_row_found, col = n_col_found))
}
...and use it to calculate XX_1 and YY2
# emulate the original dataframe
ID_1 <- c(12,12,3,15,16,3)
ID_2<-c(3,13,12,13,17,15)
ids <- cbind(ID_1, ID_2) # IDs columns
x1 <- c(15, 50, 20, 30, 51, 60)
y2 <- c(10, 40, 30, 20, 53, 62)
vars <- cbind(x1, y2) # x&y columns
# assuming that the first XX_1 & YY_2 should be always 0
indices_XX <- sapply(FUN = function(i) FindPreviousID(id_matrix = ids, i_of_col = 1, i),
X = seq(along.with = ids[, 1])[-1])
indices_YY <- sapply(FUN = function(i) FindPreviousID(id_matrix = ids, i_of_col = 2, i),
X = seq(along.with = ids[, 1])[-1])
# construct XX and YY columns
XX_column <- c(NA, vars[t(indices_XX)])
XX_column[is.na(XX_column)] <- 0
YY_column <- c(NA, vars[t(indices_YY)])
YY_column[is.na(YY_column)] <- 0
Hope, that helps :)
Upd If you are interested to find a pair of IDs instead of the single ID, the function should be redesigned. One of the possible solutions looks like this
FindPreviousIDsPair <- function(id_matrix, i_of_row) {
shorten_matrix <- id_matrix[1:(i_of_row - 1),,drop = FALSE]
string_to_search_for <- id_matrix[i_of_row, ]
string_to_search_for_sorted <-
string_to_search_for[order(string_to_search_for)]
found_rows_boolean <- sapply(FUN = function(i) all(shorten_matrix[i,
order(shorten_matrix[i, ])] ==
string_to_search_for_sorted), X = 1:(i_of_row - 1))
found_row_n <- ifelse(any(found_rows_boolean),
max(which(found_rows_boolean)), NA_real_)
found_col_of_DI1 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[1], shorten_matrix[found_row_n, ]), NA_real_)
found_col_of_DI2 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[2], shorten_matrix[found_row_n, ]), NA_real_)
return(c(found_row_n, found_col_of_DI1, found_col_of_DI2))
}
Application of the redisigned look-up function to calculate XX and YY
indices_of_vars <- sapply(FUN = function(i) FindPreviousIDsPair(id_matrix =
ids, i), X = seq(along.with = ids[, 1])[-1])
indices_XX <- indices_of_vars[1:2, ]
indices_YY <- indices_of_vars[c(1, 3), ]
XX_column <- c(NA, vars[t(indices_XX)])
XX_column[is.na(XX_column)] <- 0
YY_column <- c(NA, vars[t(indices_YY)])
YY_column[is.na(YY_column)] <- 0