I have a dataframe which I would like to add a column identifying the closest value to the respective column from only all previous values ignoring itself.
I found a closest value function but am unsure how to limit it to only previous rows. In the following example I would like to find the closest Revenue value considering only previous rows.
set.seed(1)
df<-data.frame(id=c(1:20),Revenue=sample(20))
closest<-function(xv,sv){
xv[which(abs(xv-sv)==min(abs(xv-sv)))] }
You can try the code below using dist + apply
transform(
df,
close_prev = Revenue[apply(`diag<-`(m <- as.matrix(dist(Revenue)), Inf) / upper.tri(m), 2, which.min)]
)
which gives
id Revenue close_prev
1 1 4 4
2 2 7 4
3 3 1 4
4 4 2 1
5 5 13 7
6 6 19 13
7 7 11 13
8 8 17 19
9 9 14 13
10 10 3 4
11 11 18 19
12 12 5 4
13 13 9 7
14 14 16 17
15 15 6 7
16 16 15 14
17 17 12 13
18 18 10 11
19 19 20 19
20 20 8 7
To get only 1 closest value for each number you can change the function using which.min and use the following.
library(dplyr)
library(purrr)
closest <- function(xv,sv) xv[which.min(abs(xv-sv))]
df %>%
mutate(close_prev = map_dbl(row_number(),
~closest(Revenue[seq_len(max(.x - 1, 1))], Revenue[.x])))
# id Revenue close_prev
#1 1 4 4
#2 2 7 4
#3 3 1 4
#4 4 2 1
#5 5 13 7
#6 6 19 13
#7 7 11 13
#8 8 17 19
#9 9 14 13
#10 10 3 4
#11 11 18 19
#12 12 5 4
#13 13 9 7
#14 14 16 17
#15 15 6 7
#16 16 15 14
#17 17 12 13
#18 18 10 11
#19 19 20 19
#20 20 8 7
All the previous values (.x - 1) are passed everytime in closest function. max(.x - 1, 1) is used to handle the 1st row since there is no value before that.
Related
Ive seen many questions about creating a new ID variable, based on multiple columns conditions. However it is usually if var1 AND var2 are double, then mark as duplicate number.
My question is how do you create a new variable ID and mark for duplicates if
var1 is duplicate, OR
var2 is duplicate, OR
var3 is duplicate.
Example dataset (EDITED):
pat var1 var2 var3
1 1 1 10 1
2 2 16 10 11
3 3 21 27 2
4 4 22 29 2
5 5 31 35 3
6 6 44 47 4
7 7 5 50 5
8 8 6 60 6
9 9 7 70 7
10 10 8 80 7
11 11 9 90 8
12 12 10 11 9
13 13 11 13 91
14 14 11 14 10
15 15 NA 15 15
16 16 NA 15 16
17 17 12 NA 17
18 18 13 NA 18
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
var1 = c(1,16,21,22,31,44,5,6,7,8,9,10,11,11, NA,NA,12,13),
var2 = c(10,10,27,29,35,47,50,60,70,80,90,11,13,14,15,15,NA,NA),
var3 = c(1,11,2,2,3,4,5,6,7,7,8,9,91,10,15,16,17,18)
So if one of the three var variables is duplicated, then the new ID variable should show a duplicate ID number.
Desired output (EDITED):
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13
I couldnt find a question based on similar conditions therefor im asking it.
Many thanks in advance.
EDIT The answer of Ben works perfect if there are no NA values present. Unfortunately I did not mention I also had NA values present in for var1,2 or 3. A NA value meant that idnumber for Var1/2/3 was missing. So ive adjusted the question a bit and added some NA values.
The added question is:
Is it possible for a script to judge: if var1=c(NA,NA), var2=(1,1) and var3=(1,2) to report a duplicate but if var1=c(NA,NA), var2=c(1,2) and var3=(1,2) to report a unique number?
Maybe you could try the following. Here we use tail and head to refer to rows 2 through 14 compared to 1 through 13 (effectively comparing each row with the prior row).
We can use rowSums of differences between each row and the previous row. If the difference is zero, then the result is TRUE (or 1), and the ID would increase for each value of 1 from row to row. These are cumulatively summed with cumsum.
The use of c will make the first ID 1. Also, the cumsum is adjusted by 1 to account for the initial ID of 1.
sample$ID <-
c(1, cumsum(rowSums(tail(sample[-1], -1) == head(sample[-1], -1)) == 0) + 1)
sample
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
Edit: Based on comment below, there are occasions where the value is NA which should be ignored. In the example above, NA repeated (such as var2 in rows 17-18) does not count as a duplicate.
Here is another approach. You can use sapply to go through the rows numbers of your data.frame.
You can use mapply to subtract each var from the row next to a given row, and check if any have a value of zero. Note that na.rm = T will ignore missing NA values.
sample$ID <-
c(1,
cumsum(
sapply(
seq_len(nrow(sample)-1),
\(x) {
!any(mapply(`-`, sample[x, -1, drop = T], sample[x + 1, -1, drop = T]) == 0, na.rm = T)
}
)
) + 1
)
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13
I have set of marbles, of different colors and weights, and I want to split them into groups based on their weight and color.
The conditions are:
A group cannot weigh more than 100 units
A group cannot have more than 5 different-colored marbles.
A reproducible example:
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
color weight
1 1 22
2 15 33
3 13 35
4 11 13
5 6 26
6 8 15
7 10 3
8 16 22
9 14 21
10 3 16
11 4 26
12 20 30
13 9 31
14 2 16
15 7 12
16 17 13
17 19 19
18 5 17
19 12 12
20 18 40
And what I want is this group column:
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
TIA.
The below isn't an optimal assignment to the groups, it just does it sequentially through the data frame. It's uses rowwise and might not be the most efficient way as it's not a vectorized approach.
library(dplyr)
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
Below I create a rowwise function which we can apply using dplyr
assign_group <- function(color, weight) {
# Conditions
clists = append(color_list, color)
sum_val = group_sum + weight
num_colors = length(unique(color_list))
assign_condition = (sum_val <= 100 & num_colors <= 5)
#assign globals
cval <- if(assign_condition) clists else c(color)
sval <- ifelse(assign_condition, sum_val, weight)
gval <- ifelse(assign_condition, group_number, group_number + 1)
assign("color_list", cval, envir = .GlobalEnv)
assign("group_sum", sval, envir = .GlobalEnv)
assign("group_number", gval, envir = .GlobalEnv)
res = group_number
return(res)
}
I then setup a few global variables to track the allocation of the marbles to each group.
# globals
color_list <<- c()
group_sum <<- 0
group_number <<- 1
Finally run this function using mutate
test <- marbles %>% rowwise() %>% mutate(group = assign_group(color,weight)) %>% data.frame()
Which results in the below
color weight group
1 6 27 1
2 12 16 1
3 15 32 1
4 20 25 1
5 19 5 2
6 2 21 2
7 16 39 2
8 17 4 2
9 11 16 2
10 7 7 3
11 10 5 3
12 1 30 3
13 13 7 3
14 9 39 3
15 14 7 4
16 8 17 4
17 18 9 4
18 4 36 4
19 3 1 4
20 5 3 5
And seems to meet the constraints
test %>% group_by(group) %>% summarise(tot_w = sum(weight), n_c = length(unique(color)) )
group tot_w n_c
<dbl> <int> <int>
1 1 100 4
2 2 85 5
3 3 88 5
4 4 70 5
5 5 3 1
in base R you could write a recursive function as shown below:
create_group = function(df,a){
if(missing(a)) a = cumsum(df$weight)%/%100
b = !ave(df$color,a,FUN=seq_along)%%6
d = ave(df$weight,a+b,FUN=cumsum)>100
a = a+b+d
if (any(b|d)) create_group(df,a) else cbind(df,group = a+1)
}
create_group(df)
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
I'm trying to fill a missing ID column of a data frame as shown below. It's not blank in the first row it applies to and then blank until the next ID. I wrote ugly code to do this in a for loop, but wonder if there's a tidy-ier way to do this. Any suggestions?
Here's what I've got:
code data
1 A 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 B 11
12 12
13 13
14 14
15 15
16 C 16
17 17
18 18
19 19
20 20
I want:
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
Code I've got now:
# Create mock data frame
df <- data.frame(code = c("A", rep("", 9),
"B", rep("", 4),
"C", rep("", 4)),
data = 1:20)
# For loop over rows (BAD!)
for (i in seq(2, nrow(df))) {
df[i,]$code <- ifelse(df[i,]$code == "", df[i-1,]$code, df[i, ]$code)
}
There is a tidyr way to do it, there is the fill function. You also need to replace the zero length string with NA for this to work, which you can easily do using the mutate and na_if functions from dplyr.
df %>%
mutate(code = na_if(code,"")) %>%
fill(code)
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
I've a data frame like this
w<-c(0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0)
i would like an index position starting after value 1.
output : NA,NA,NA,NA,NA,1,2,3,4,5,6,7,1,2,3,4,5,1,2,3,4,5,6,7,8,9
ideally applicable to a data frame.
Thanks
edit : w is a data frame,
roughly this function
m<-as.data.frame(w)
m[m!=1] <- row(m)[m!=1]
m
w
1 1
2 2
3 3
4 4
5 5
6 1
7 7
8 8
9 9
10 10
11 11
12 12
13 1
14 14
15 15
16 16
17 17
18 1
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
but with a return to 1 when value 1 is matching.
> m
w wanted
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 1 1
7 7 2
8 8 3
9 9 4
10 10 5
11 11 6
12 12 7
13 1 1
14 14 2
15 15 3
16 16 4
17 17 5
18 1 1
19 19 2
20 20 3
21 21 4
22 22 5
23 23 6
24 24 7
25 25 8
26 26 9
Thanks
This assumes that the data is ordered in the way shown in example.
m$wanted <- with(m, ave(w, cumsum(c(TRUE,diff(w) <0)), FUN=seq_along))
m$wanted
#[1] 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 5 6 7 8 9
For the given data including repeated 1's and non-sequential input, the following works:
m[9,1] <- 100
m[3,1] <- 55
m[14,1] <- 60
m[14,1] <- 60
m[25,1] <- 1
m[19,1] <- 1
m$result <- 1:nrow(m) - which(m$w == 1)[cumsum(m$w == 1)] + 1
But if the data does not start on 1:
m[1,1] <- 2
Then this works:
firstone <- which(m$w == 1)[1]
subindex <- m[firstone:nrow(m),'w'] == 1
m$result <- c(rep(NA,firstone-1),1:length(subindex) - which(subindex)[cumsum(subindex)] + 1)
I have a dataframe that looks like this:
df$a <- 1:20
df$b <- 2:21
df$c <- 3:22
df <- as.data.frame(df)
> df
a b c
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12
11 11 12 13
12 12 13 14
13 13 14 15
14 14 15 16
15 15 16 17
16 16 17 18
17 17 18 19
18 18 19 20
19 19 20 21
20 20 21 22
I would like to add another column to the data frame (df$d) so that every 5 rows (df$d[seq(1, nrow(df), 4)]) would take the value of the start of the respective row in the first column: df$a.
I have tried the manual way, but was wondering if there is a for loop or shorter way that can do this easily. I'm new to R, so I apologize if this seems trivial to some people.
"Manual" way:
df$d[1:5] <- df$a[1]
df$d[6:10] <- df$a[6]
df$d[11:15] <- df$a[11]
df$d[16:20] <- df$a[16]
>df
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16
I have tried
for (i in 1:nrow(df))
{df$d[i:(i+4)] <- df$a[seq(1, nrow(df), 4)]}
But this is not going the way I want it to. What am I doing wrong?
This should work:
df$d <- rep(df$a[seq(1,nrow(df),5)],each=5)
And here's a data.table solution:
library(data.table)
dt = data.table(df)
dt[, d := a[1], by = (seq_len(nrow(dt))-1) %/% 5]
I'd use logical indexing after initializing to NA
df$d <- NA
df$d <- rep(df$a[ c(TRUE, rep(FALSE,4)) ], each=5)
df
#--------
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16