I'm trying to create a rolling average of a column based on an ID column and a measurement time label in R, but I am having a lot of trouble with it.
Here is what my dataframe looks like:
ID Measurement Value
A 1 10
A 2 12
A 3 14
B 1 10
B 2 12
B 3 14
B 4 10
The problem is that I have measurement counts varying from 9 to 76 for each ID so I haven't found a solution that will create a column of a rolling average for each ID while handling the varying window length.
My goal is a dataframe like this:
ID Measurement Value Average
A 1 10 NA
A 2 12 11
A 3 14 12
B 1 10 NA
B 2 12 11
B 3 14 12
B 4 10 11.5
With your data:
library(dplyr)
dat %>%
group_by(Id) %>%
mutate(Avrg = cumsum(Value)/(1:n()))
# A tibble: 7 x 4
# Groups: Id [2]
Id Measurement Value Avrg
<chr> <int> <int> <dbl>
1 A 1 10 10
2 A 2 12 11
3 A 3 14 12
4 B 1 10 10
5 B 2 12 11
6 B 3 14 12
7 B 4 10 11.5
Data:
structure(list(Id = c("A", "A", "A", "B", "B", "B", "B"),
Measurement = c(1L, 2L, 3L, 1L, 2L, 3L, 4L),
Value = c(10L, 12L, 14L, 10L, 12L, 14L, 10L)
),
class = "data.frame", row.names = c(NA, -7L))
P.S. I am pretty sure that the average of 10 is 10, not NA
library(dplyr)
data %>%
group_by(ID) %>%
mutate(rolling_mean = cummean(Value))
First row will be mean of first value for each group (ID), not NA.
This uses no packages. It calculates the cumulative average by ID except that for Measurement equal to 1 it forces the average to be NA.
transform(DF, Avg = ave(Value, ID, FUN = cumsum) /
ifelse(Measurement == 1, NA, Measurement))
giving:
ID Measurement Value Avg
1 A 1 10 NA
2 A 2 12 11.0
3 A 3 14 12.0
4 B 1 10 NA
5 B 2 12 11.0
6 B 3 14 12.0
7 B 4 10 11.5
Note
The input DF in reproducible form is:
Lines <- "ID Measurement Value
A 1 10
A 2 12
A 3 14
B 1 10
B 2 12
B 3 14
B 4 10"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)
Related
I have a dataframe that contains NA values, and I want to remove some rows that have an NA (i.e., not complete cases). However, I only want to remove rows at the beginning and ending of the dataframe. So, I want to keep any rows that have an NA that are not in the first or last rows of the dataframe. What is the most efficient way to simultaneously remove these rows with NAs without using a row index? This is related to my previous question, but I also want to remove the first rows at the same time. There are other posts that also focus on removing only the first rows, but not both.
Data
df <- structure(list(var1 = 1:15,
var2 = c(3, NA, 3, NA, 2, NA, 3, 4, 2, NA, 4, 2, 45, 2, 1),
var3 = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA),
var4 = c(NA, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, NA)),
class = "data.frame", row.names = c(NA, -15L))
Expected Output
So, in this example, I removed rows 1 to 2, and 12 to 15 since they have an NA and row 3 and 11 does not have an NA.
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
I know that I could have 2 statements in filter to remove the top and bottom rows (shown below). But I'm wondering if there is a more efficient way to do this with really large datasets (open to any method tidyverse, base R, data.table, etc.).
library(dplyr)
df %>%
filter(cumsum(complete.cases(.)) != 0 &
rev(cumsum(rev(complete.cases(.)))) != 0)
base R
r <- rle(complete.cases(df))
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
# - attr(*, "class")= chr "rle"
r$values[ -c(1, length(r$values)) ] <- TRUE
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
# - attr(*, "class")= chr "rle"
df[inverse.rle(r),]
# var1 var2 var3 var4
# 3 3 3 8 8
# 4 4 NA 9 9
# 5 5 2 10 10
# 6 6 NA 11 11
# 7 7 3 12 12
# 8 8 4 13 13
# 9 9 2 14 14
# 10 10 NA 15 15
# 11 11 4 16 16
dplyr
For your question of efficiency, you can adapt the rle solution to dplyr as well (that should be trivial), but I see no reason why the use of complete.cases and cumany/rev would be a problem. You can improve on your attempt by not calculating complete.cases(.) twice as you're doing, storing it in an interim column.
library(dplyr)
df %>%
mutate(aux = complete.cases(cur_data())) %>%
filter(cumany(aux) & rev(cumany(rev(aux))))
# var1 var2 var3 var4 aux
# 1 3 3 8 8 TRUE
# 2 4 NA 9 9 FALSE
# 3 5 2 10 10 TRUE
# 4 6 NA 11 11 FALSE
# 5 7 3 12 12 TRUE
# 6 8 4 13 13 TRUE
# 7 9 2 14 14 TRUE
# 8 10 NA 15 15 FALSE
# 9 11 4 16 16 TRUE
data.table
(Just an adaptation of the dplyr version.)
library(data.table)
setDT(df)
df[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]
# var1 var2 var3 var4 aux
# <int> <num> <int> <int> <lgcl>
# 1: 3 3 8 8 TRUE
# 2: 4 NA 9 9 FALSE
# 3: 5 2 10 10 TRUE
# 4: 6 NA 11 11 FALSE
# 5: 7 3 12 12 TRUE
# 6: 8 4 13 13 TRUE
# 7: 9 2 14 14 TRUE
# 8: 10 NA 15 15 FALSE
# 9: 11 4 16 16 TRUE
I would do
na_count <- rowSums(is.na(df))
df <- df %>%
slice(min(which(na_count==0)):max(which(na_count==0)))
Output
> df
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
I think we overcomplicated it a bit, most efficient I think is just plain base R
Directly take all your complete cases
s <- which(complete.cases(df))
We surely cannot subset on s, as we want to keep all the "in between" incomplete ones too, we can achieve that by simply subset from the first up till the last index.
df[first(s):last(s), ]
continuing a rle love fest:
(which(rle(rowSums(df_NA))$values != 'NA')[1]):dplyr::last(which(rle(rowSums(df_NA))$values != 'NA'))
[1] 3 4 5 6 7 8 9 10 11
or, dispensing with dplyr
(which(rle(rowSums(df_NA))$values != 'NA')[1]):(which(rle(rowSums(df_NA))$values != 'NA'))[[(length(which(rle(rowSums(df_NA))$values != 'NA')))]]
[1] 3 4 5 6 7 8 9 10 11
Another possible solution (thanks, #r2evans, for suggesting complete.cases):
library(dplyr)
df %>%
mutate(aux = !complete.cases(.)) %>%
filter(!cumall(aux)) %>%
arrange(desc(var1)) %>%
filter(!cumall(aux)) %>%
arrange(var1) %>%
select(-aux)
#> var1 var2 var3 var4
#> 1 3 3 8 8
#> 2 4 NA 9 9
#> 3 5 2 10 10
#> 4 6 NA 11 11
#> 5 7 3 12 12
#> 6 8 4 13 13
#> 7 9 2 14 14
#> 8 10 NA 15 15
#> 9 11 4 16 16
Bit late to the party, but base R in a single expression:
df[Reduce(
function(x, y){
seq(from = x, to = y)
},
range(
which(
complete.cases(df)
)
)
), ]
Benchmark
Here, I create a bigger dataset with 1,000,000 million rows of 3 variables to determine which method is the fastest. *Note: It will take a few seconds to apply the NA values randomly to the 3 columns for the first 100,000 rows and the last 100,000 rows. Essentially, with this example, we want to remove the first 100,000 rows and the last 100,000 rows.
Dataset
set.seed(203)
df <- data.frame(var1 = sample(x = 1:500, size = 1000000, replace = TRUE),
var2 = sample(x = 1:500, size = 1000000, replace = TRUE),
var3 = sample(x = 1:500, size = 1000000, replace = TRUE))
df[1:100000,] <- plyr::ddply(df[1:100000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[900000:1000000,] <- plyr::ddply(df[900000:1000000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[300000:400000,2] <- NA
Output
It looks like #MerijnvanTilborg data.table solution is the fastest, followed by #r2evans data.table version on this sample dataset.
Code
library(tidyverse)
library(data.table)
df1 <- df
dt1 <- as.data.table(df)
dt2 <- as.data.table(df)
bm <- microbenchmark::microbenchmark(baseR_r2evans = {r <- rle(complete.cases(df1));
r$values[ -c(1, length(r$values)) ] <- TRUE; df[inverse.rle(r),]},
dplyr_r2evans = {df %>%
dplyr::mutate(aux = complete.cases(cur_data())) %>%
dplyr::filter(cumany(aux) & rev(cumany(rev(aux))))},
datatable_r2evans = {dt1[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]},
valkyr = {na_count <- rowSums(is.na(df)); df %>%
dplyr::slice(min(which(na_count==0)):max(which(na_count==0)))},
PaulS = {df %>%
dplyr::mutate(aux = !complete.cases(.)) %>%
dplyr::filter(!cumall(aux)) %>%
dplyr::arrange(desc(var1)) %>%
dplyr::filter(!cumall(aux)) %>%
dplyr::arrange(var1) %>%
dplyr::select(-aux)},
Chris = {df[(which(rle(rowSums(df))$values != 'NA')[1]):(which(rle(rowSums(df))$values != 'NA'))[[(length(which(rle(rowSums(df))$values != 'NA')))]],]},
AndrewGB = {df %>%
dplyr::filter(cumsum(complete.cases(.)) != 0 &
rev(cumsum(rev(complete.cases(.)))) != 0)},
Merijn_baseR = {s <- which(complete.cases(df));
df[first(s):last(s), ]},
Merijn_datatable = {dt2[, aux := complete.cases(.SD)][first(which(aux)):last(which(aux))]},
times = 1000
)
R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2
I have a data frame like this:
name count
a 3
a 5
a 8
b 2
a 9
b 7
so I want to calculate the row differences group by name. so my code is:
data%>%group_by(Name)%>%mutate(last_count = lag(count),diff = count - last_count)
However, I get a result like the below table
name count last_count diff
a 3 NA NA
a 5 3 2
a 8 5 3
b 2 NA NA
a 9 8 1
b 7 2 5
But what I want should look like this:
name count last_count diff
a 3 NA NA
a 5 3 2
a 8 5 3
b 2 NA NA
a 9 NA NA
b 7 NA NA
Thanks in advance to whoever can help me fix it!
Does this work:
> library(dplyr)
> df %>% mutate(last_count = case_when(name == lag(name) ~ lag(count), TRUE ~ NA_real_),
diff = case_when(name == lag(name) ~ count - lag(count), TRUE ~ NA_real_))
# A tibble: 6 x 4
name count last_count diff
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 5 3 2
3 a 8 5 3
4 b 2 NA NA
5 a 9 NA NA
6 b 7 NA NA
>
We could use rleid to create a grouping column based on the adjacent matching values in the 'name' column and then apply the diff
library(dplyr)
library(data.table)
data %>%
group_by(grp = rleid(name)) %>%
mutate(last_count = lag(count), diff = count - last_count) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 x 4
# name count last_count diff
# <chr> <int> <int> <int>
#1 a 3 NA NA
#2 a 5 3 2
#3 a 8 5 3
#4 b 2 NA NA
#5 a 9 NA NA
#6 b 7 NA NA
Or using base R with ave and rle
data$diff <- with(data, ave(count, with(rle(name),
rep(seq_along(values), lengths)), FUN = function(x) c(NA, diff(x)))
data
data <- structure(list(name = c("a", "a", "a", "b", "a", "b"), count = c(3L,
5L, 8L, 2L, 9L, 7L)), class = "data.frame", row.names = c(NA,
-6L))
I have a data.frame like this:
df<-data.frame( Id = paste0("g",1:6),
a= c(6:11),
b = c(10:13,NA,NA),
c = c(7:10,NA,10),
d = c(NA,7:9,NA,13),
e= c(NA,6:10),
f= c(NA,NA,NA,4:5,NA))
colnames(df)=c("ID",rep("normal",3),rep("patient",3))
> df
ID normal normal normal patient patient patient
1 g1 6 10 7 NA NA NA
2 g2 7 11 8 7 6 NA
3 g3 8 12 9 8 7 NA
4 g4 9 13 10 9 8 4
5 g5 10 NA NA NA 9 5
6 g6 11 NA 10 13 10 NA
this df contains data for two groups (normal and patient).I am going to perform some analysis for all rows, therefore all groups in each rows must have at least two values.I used the following codes to filter the rows that all groups have not at least two values.
fx=function(x){length(x[!is.na(x)])>=2}
f1=apply(df[,2:4], 1,fx)#filter based on group normal
f2=apply(df[,5:7], 1,fx)#filter based on group patient
df=subset(df,f1&f2)
> df
ID normal normal.1 normal.2 patient patient.1 patient.2
2 g2 7 11 8 7 6 NA
3 g3 8 12 9 8 7 NA
4 g4 9 13 10 9 8 4
6 g6 11 NA 10 13 10 NA
but these codes are useful for a data with limited groups. my main data have 100 groups(and all groups have 3 replicates),colnames(df)=paste0("grp",sort(rep(1:100,3)))
therefore I need some simple codes to filter the rows in a data.frame with 100 groups.
my goal: delete the rows that have not at least two values in each groups.
Could do:
library(dplyr)
names(df) <- paste0(names(df), 1:ncol(df))
df %>%
filter(
rowSums(!is.na(select(., contains("normal")))) >= 2 &
rowSums(!is.na(select(., contains("patient")))) >= 2
)
We could differentiate "normal" and "patient" columns and select the rows using rowSums
normal_cols <- grep("normal", names(df))
patient_cols <- grep("patient", names(df))
df[rowSums(!is.na(df[normal_cols])) >= 2 & rowSums(!is.na(df[patient_cols])) >= 2,]
# ID normal normal normal patient patient patient
#2 g2 7 11 8 7 6 NA
#3 g3 8 12 9 8 7 NA
#4 g4 9 13 10 9 8 4
#6 g6 11 NA 10 13 10 NA
Or using the fx function you have defined we can use apply twice on both set of columns and select the rows using subset.
fx = function(x) {length(x[!is.na(x)])>=2}
subset(df, apply(df[normal_cols], 1,fx) & apply(df[patient_cols], 1,fx))
We may use reshape to get a long format and look at the colSums.
First rule in such matters are appRopriate column names, i.e. <chr_prefix>.<num_suffix>.
names(df) <- c("ID", paste(rep(c("normal", "patient"), each=3), 1:3, sep="."))
Now we reshape into long format and split by "ID". We only want those IDs where all colSums are > 2, we store this in a vector s with which we may subset the data frame df.
r <- reshape(df, idvar="ID", direction="long", varying=list(2:4, 5:7), times=1:3)
s <- by(r[-1], r$ID, function(i) all(colSums(i, na.rm=TRUE) > 2))
df[s, ]
# ID normal normal normal patient patient patient
# 2 g2 7 11 8 7 6 NA
# 3 g3 8 12 9 8 7 NA
# 4 g4 9 13 10 9 8 4
# 6 g6 11 NA 10 13 10 NA
Data
df <- structure(list(Id = structure(1:6, .Label = c("g1", "g2", "g3",
"g4", "g5", "g6"), class = "factor"), a = 6:11, b = c(10L, 11L,
12L, 13L, NA, NA), c = c(7, 8, 9, 10, NA, 10), d = c(NA, 7, 8,
9, NA, 13), e = c(NA, 6L, 7L, 8L, 9L, 10L), f = c(NA, NA, NA,
4L, 5L, NA)), class = "data.frame", row.names = c(NA, -6L))
I have a dataframe which looks like:
Student_ID Number Position
VB-123 10 2
VB-456 15 5
VB-789 25 25
VB-889 12 2
VB-965 15 7
VB-758 45 9
VB-245 25 25
I want to add new column and assign a value based on below conditions:
If only Number is duplicate in entire dataframe then Assign A
If only Position is duplicate in entire dataframe then assign B
If both Number and Position are duplicate then assign C
If none of the duplicate then assign D.
Output would looks like:
Student_ID Number Position Assign
VB-123 10 2 B
VB-456 15 5 A
VB-789 25 25 C
VB-889 12 2 B
VB-965 15 7 A
VB-758 45 9 D
VB-245 25 25 C
With dplyr,
library(dplyr)
students <- data.frame(Student_ID = c("VB-123", "VB-456", "VB-789", "VB-889", "VB-965", "VB-758", "VB-245"),
Number = c(10L, 15L, 25L, 12L, 15L, 45L, 25L),
Position = c(2L, 5L, 25L, 2L, 7L, 9L, 25L))
students2 <- students %>%
mutate_at(vars(Number, Position), funs(n = table(.)[as.character(.)])) %>%
mutate(Assign = case_when(Number_n > 1 & Position_n > 1 ~ 'C',
Number_n > 1 ~ 'A',
Position_n > 1 ~ 'B',
TRUE ~ 'D'))
students2
#> Student_ID Number Position Number_n Position_n Assign
#> 1 VB-123 10 2 1 2 B
#> 2 VB-456 15 5 2 1 A
#> 3 VB-789 25 25 2 2 C
#> 4 VB-889 12 2 1 2 B
#> 5 VB-965 15 7 2 1 A
#> 6 VB-758 45 9 1 1 D
#> 7 VB-245 25 25 2 2 C
As an alternative to the mutate_at line, you could use add_count twice, renaming as necessary. To remove the intermediary columns, tack on select(-matches('_n$')).
You can more or less replicate the logic in base by assigning to subsets:
students2 <- cbind(students, lapply(students[2:3], function(x) table(x)[as.character(x)]))
students2$Assign <- 'D'
students2$Assign[students2$Number.Freq > 1 & students2$Position.Freq > 1] <- 'C'
students2$Assign[students2$Number.Freq > 1 & students2$Position.Freq == 1] <- 'A'
students2$Assign[students2$Number.Freq == 1 & students2$Position.Freq > 1] <- 'B'
students2[4:7] <- NULL
students2
#> Student_ID Number Position Assign
#> 1 VB-123 10 2 B
#> 2 VB-456 15 5 A
#> 3 VB-789 25 25 C
#> 4 VB-889 12 2 B
#> 5 VB-965 15 7 A
#> 6 VB-758 45 9 D
#> 7 VB-245 25 25 C
Here is an option using base R. Create a list of column names as in the order of evaluatin ('l1'), pre assign 'D' to create the 'Assign' column in 'dat', loop through the sequence of 'l1', subset the columns of data based on the column names in 'l1', use duplicated to find the duplicate elements and reassign the 'Assign' column to the corresponding LETTER
l1 <- list("Number", "Position", c("Number", "Position"))
dat$Assign <- rep("D", nrow(dat))
for(i in seq_along(l1)){
df <- dat[l1[[i]]]
i1 <- duplicated(df)|duplicated(df, fromLast = TRUE)
dat$Assign <- replace(dat$Assign, i1, LETTERS[i])
}
-output
dat
# Student_ID Number Position Assign
#1 VB-123 10 2 B
#2 VB-456 15 5 A
#3 VB-789 25 25 C
#4 VB-889 12 2 B
#5 VB-965 15 7 A
#6 VB-758 45 9 D
#7 VB-245 25 25 C
A solution using dplyr.
library(dplyr)
dat2 <- dat %>% count(Number)
dat3 <- dat %>% count(Position)
dat4 <- dat %>% count(Number, Position)
dat5 <- dat %>%
left_join(dat2, by = "Number") %>%
left_join(dat3, by = "Position") %>%
left_join(dat4, by = c("Number", "Position")) %>%
mutate(Assign = case_when(
n > 1 ~ "C",
n.x > 1 & n.y == 1 ~ "A",
n.y > 1 & n.x == 1 ~ "B",
TRUE ~ "D"
)) %>%
select(-n.x, -n.y, -n)
dat5
# Student_ID Number Position Assign
# 1 VB-123 10 2 B
# 2 VB-456 15 5 A
# 3 VB-789 25 25 C
# 4 VB-889 12 2 B
# 5 VB-965 15 7 A
# 6 VB-758 45 9 D
# 7 VB-245 25 25 C
DATA
dat <- read.table(text = "Student_ID Number Position
'VB-123' 10 2
'VB-456' 15 5
'VB-789' 25 25
'VB-889' 12 2
'VB-965' 15 7
'VB-758' 45 9
'VB-245' 25 25",
header = TRUE, stringsAsFactors = FALSE)