Fill column with prior nonmissing value, no ID - r

I'm trying to fill a missing ID column of a data frame as shown below. It's not blank in the first row it applies to and then blank until the next ID. I wrote ugly code to do this in a for loop, but wonder if there's a tidy-ier way to do this. Any suggestions?
Here's what I've got:
code data
1 A 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 B 11
12 12
13 13
14 14
15 15
16 C 16
17 17
18 18
19 19
20 20
I want:
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
Code I've got now:
# Create mock data frame
df <- data.frame(code = c("A", rep("", 9),
"B", rep("", 4),
"C", rep("", 4)),
data = 1:20)
# For loop over rows (BAD!)
for (i in seq(2, nrow(df))) {
df[i,]$code <- ifelse(df[i,]$code == "", df[i-1,]$code, df[i, ]$code)
}

There is a tidyr way to do it, there is the fill function. You also need to replace the zero length string with NA for this to work, which you can easily do using the mutate and na_if functions from dplyr.
df %>%
mutate(code = na_if(code,"")) %>%
fill(code)
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20

Related

Assign value to data based on more than two conditions and on other data

I have a data frame that looks like this
> df
name time count
1 A 10 9
2 A 12 17
3 A 24 19
4 A 3 15
5 A 29 11
6 B 31 14
7 B 7 7
8 B 30 18
9 C 29 13
10 C 12 12
11 C 3 16
12 C 4 6
and for each name group (A, B, C) I would need to assign a category following the rules below:
if time<= 10 then category = 1
if 10 <time<= 20 then category = 2
if 20 <time<= 30 then category = 3
if time> 30 then category = 4
to have a data frame that looks like this:
> df_final
name time count category
1 A 10 9 1
2 A 12 17 2
3 A 24 19 3
4 A 3 15 1
5 A 29 11 3
6 B 31 14 4
7 B 7 7 1
8 B 30 18 3
9 C 29 13 3
10 C 12 12 2
11 C 3 16 1
12 C 4 6 1
after that I would need to sum the value in count based on their category. The ultimate data frame should loo like this:
> df_ultimate
name count category
1 A 24 1
2 A 17 2
3 A 30 3
4 A NA 4
5 B 7 1
6 B NA 2
7 B 18 3
8 B 14 4
9 C 22 1
10 C 12 2
11 C 13 3
12 C NA 4
I have tried to play around with summarise and group_by but without much success.
Thanks for your help
With cut + complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(name, category = cut(time, breaks = c(-Inf, 10, 20, 30, Inf), labels = 1:4)) %>%
summarise(count = sum(count)) %>%
complete(category)
# # Groups: name [3]
# name category count
# 1 A 1 24
# 2 A 2 17
# 3 A 3 30
# 4 A 4 NA
# 5 B 1 7
# 6 B 2 NA
# 7 B 3 18
# 8 B 4 14
# 9 C 1 22
# 10 C 2 12
# 11 C 3 13
# 12 C 4 NA

Create ID variable: if ≥1 column duplicate then mark as duplicate

Ive seen many questions about creating a new ID variable, based on multiple columns conditions. However it is usually if var1 AND var2 are double, then mark as duplicate number.
My question is how do you create a new variable ID and mark for duplicates if
var1 is duplicate, OR
var2 is duplicate, OR
var3 is duplicate.
Example dataset (EDITED):
pat var1 var2 var3
1 1 1 10 1
2 2 16 10 11
3 3 21 27 2
4 4 22 29 2
5 5 31 35 3
6 6 44 47 4
7 7 5 50 5
8 8 6 60 6
9 9 7 70 7
10 10 8 80 7
11 11 9 90 8
12 12 10 11 9
13 13 11 13 91
14 14 11 14 10
15 15 NA 15 15
16 16 NA 15 16
17 17 12 NA 17
18 18 13 NA 18
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
var1 = c(1,16,21,22,31,44,5,6,7,8,9,10,11,11, NA,NA,12,13),
var2 = c(10,10,27,29,35,47,50,60,70,80,90,11,13,14,15,15,NA,NA),
var3 = c(1,11,2,2,3,4,5,6,7,7,8,9,91,10,15,16,17,18)
So if one of the three var variables is duplicated, then the new ID variable should show a duplicate ID number.
Desired output (EDITED):
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13
I couldnt find a question based on similar conditions therefor im asking it.
Many thanks in advance.
EDIT The answer of Ben works perfect if there are no NA values present. Unfortunately I did not mention I also had NA values present in for var1,2 or 3. A NA value meant that idnumber for Var1/2/3 was missing. So ive adjusted the question a bit and added some NA values.
The added question is:
Is it possible for a script to judge: if var1=c(NA,NA), var2=(1,1) and var3=(1,2) to report a duplicate but if var1=c(NA,NA), var2=c(1,2) and var3=(1,2) to report a unique number?
Maybe you could try the following. Here we use tail and head to refer to rows 2 through 14 compared to 1 through 13 (effectively comparing each row with the prior row).
We can use rowSums of differences between each row and the previous row. If the difference is zero, then the result is TRUE (or 1), and the ID would increase for each value of 1 from row to row. These are cumulatively summed with cumsum.
The use of c will make the first ID 1. Also, the cumsum is adjusted by 1 to account for the initial ID of 1.
sample$ID <-
c(1, cumsum(rowSums(tail(sample[-1], -1) == head(sample[-1], -1)) == 0) + 1)
sample
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
Edit: Based on comment below, there are occasions where the value is NA which should be ignored. In the example above, NA repeated (such as var2 in rows 17-18) does not count as a duplicate.
Here is another approach. You can use sapply to go through the rows numbers of your data.frame.
You can use mapply to subtract each var from the row next to a given row, and check if any have a value of zero. Note that na.rm = T will ignore missing NA values.
sample$ID <-
c(1,
cumsum(
sapply(
seq_len(nrow(sample)-1),
\(x) {
!any(mapply(`-`, sample[x, -1, drop = T], sample[x + 1, -1, drop = T]) == 0, na.rm = T)
}
)
) + 1
)
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13

Split into groups based on (multiple) conditions?

I have set of marbles, of different colors and weights, and I want to split them into groups based on their weight and color.
The conditions are:
A group cannot weigh more than 100 units
A group cannot have more than 5 different-colored marbles.
A reproducible example:
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
color weight
1 1 22
2 15 33
3 13 35
4 11 13
5 6 26
6 8 15
7 10 3
8 16 22
9 14 21
10 3 16
11 4 26
12 20 30
13 9 31
14 2 16
15 7 12
16 17 13
17 19 19
18 5 17
19 12 12
20 18 40
And what I want is this group column:
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
TIA.
The below isn't an optimal assignment to the groups, it just does it sequentially through the data frame. It's uses rowwise and might not be the most efficient way as it's not a vectorized approach.
library(dplyr)
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
Below I create a rowwise function which we can apply using dplyr
assign_group <- function(color, weight) {
# Conditions
clists = append(color_list, color)
sum_val = group_sum + weight
num_colors = length(unique(color_list))
assign_condition = (sum_val <= 100 & num_colors <= 5)
#assign globals
cval <- if(assign_condition) clists else c(color)
sval <- ifelse(assign_condition, sum_val, weight)
gval <- ifelse(assign_condition, group_number, group_number + 1)
assign("color_list", cval, envir = .GlobalEnv)
assign("group_sum", sval, envir = .GlobalEnv)
assign("group_number", gval, envir = .GlobalEnv)
res = group_number
return(res)
}
I then setup a few global variables to track the allocation of the marbles to each group.
# globals
color_list <<- c()
group_sum <<- 0
group_number <<- 1
Finally run this function using mutate
test <- marbles %>% rowwise() %>% mutate(group = assign_group(color,weight)) %>% data.frame()
Which results in the below
color weight group
1 6 27 1
2 12 16 1
3 15 32 1
4 20 25 1
5 19 5 2
6 2 21 2
7 16 39 2
8 17 4 2
9 11 16 2
10 7 7 3
11 10 5 3
12 1 30 3
13 13 7 3
14 9 39 3
15 14 7 4
16 8 17 4
17 18 9 4
18 4 36 4
19 3 1 4
20 5 3 5
And seems to meet the constraints
test %>% group_by(group) %>% summarise(tot_w = sum(weight), n_c = length(unique(color)) )
group tot_w n_c
<dbl> <int> <int>
1 1 100 4
2 2 85 5
3 3 88 5
4 4 70 5
5 5 3 1
in base R you could write a recursive function as shown below:
create_group = function(df,a){
if(missing(a)) a = cumsum(df$weight)%/%100
b = !ave(df$color,a,FUN=seq_along)%%6
d = ave(df$weight,a+b,FUN=cumsum)>100
a = a+b+d
if (any(b|d)) create_group(df,a) else cbind(df,group = a+1)
}
create_group(df)
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5

R: fill new columns in data.frame based on row values by condition?

I want to create a new columns in my data.frame, based on values in my rows.
If 'type" is not equal to "a", my "new.area" columns should contain the data from "area" of type "a". This is for multiple "distances".
Example:
# create data frame
distance<-rep(seq(1,5, by = 1),2)
area<-c(11:20)
type<-rep(c("a","b"),each = 5)
# check data.frame
(my.df<-data.frame(distance, area, type))
distance area type
1 1 11 a
2 2 12 a
3 3 13 a
4 4 14 a
5 5 15 a
6 1 16 b
7 2 17 b
8 3 18 b
9 4 19 b
10 5 20 b
I want to create a new columns (my.df$new.area), where for every "distance" in rows, there will be values of "area" of type "a".
distance area type new.area
1 1 11 a 11
2 2 12 a 12
3 3 13 a 13
4 4 14 a 14
5 5 15 a 15
6 1 16 b 11
7 2 17 b 12
8 3 18 b 13
9 4 19 b 14
10 5 20 b 15
I know how to make this manually for a single row:
my.df$new.area[my.df$distance == 1 ] <- 11
But how to make it automatically?
Here is a base R solution using index subsetting ([) and match:
my.df$new.area <- with(my.df, area[type == "a"][match(distance, distance[type == "a"])])
which returns
my.df
distance area type new.area
1 1 11 a 11
2 2 12 a 12
3 3 13 a 13
4 4 14 a 14
5 5 15 a 15
6 1 16 b 11
7 2 17 b 12
8 3 18 b 13
9 4 19 b 14
10 5 20 b 15
area[type == "a"] supplies the vector of possibilities. match is used to return the indices from this vector through the distance variable. with is used to avoid the repeated use of my.df$.
We can use data.table
library(data.table)
setDT(my.df)[, new.area := area[type=="a"] , distance]
my.df
# distance area type new.area
# 1: 1 11 a 11
# 2: 2 12 a 12
# 3: 3 13 a 13
# 4: 4 14 a 14
# 5: 5 15 a 15
# 6: 1 16 b 11
# 7: 2 17 b 12
# 8: 3 18 b 13
# 9: 4 19 b 14
#10: 5 20 b 15
Or we can use the numeric index of distance as it is in a sequence
with(my.df, area[type=="a"][distance])
#[1] 11 12 13 14 15 11 12 13 14 15

Changing every set of 5 rows in R

I have a dataframe that looks like this:
df$a <- 1:20
df$b <- 2:21
df$c <- 3:22
df <- as.data.frame(df)
> df
a b c
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12
11 11 12 13
12 12 13 14
13 13 14 15
14 14 15 16
15 15 16 17
16 16 17 18
17 17 18 19
18 18 19 20
19 19 20 21
20 20 21 22
I would like to add another column to the data frame (df$d) so that every 5 rows (df$d[seq(1, nrow(df), 4)]) would take the value of the start of the respective row in the first column: df$a.
I have tried the manual way, but was wondering if there is a for loop or shorter way that can do this easily. I'm new to R, so I apologize if this seems trivial to some people.
"Manual" way:
df$d[1:5] <- df$a[1]
df$d[6:10] <- df$a[6]
df$d[11:15] <- df$a[11]
df$d[16:20] <- df$a[16]
>df
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16
I have tried
for (i in 1:nrow(df))
{df$d[i:(i+4)] <- df$a[seq(1, nrow(df), 4)]}
But this is not going the way I want it to. What am I doing wrong?
This should work:
df$d <- rep(df$a[seq(1,nrow(df),5)],each=5)
And here's a data.table solution:
library(data.table)
dt = data.table(df)
dt[, d := a[1], by = (seq_len(nrow(dt))-1) %/% 5]
I'd use logical indexing after initializing to NA
df$d <- NA
df$d <- rep(df$a[ c(TRUE, rep(FALSE,4)) ], each=5)
df
#--------
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16

Resources