Replace data entry errors with the most common value - dplyr - r

I have a data frame which contains some data entry errors.
I wish to replace these outlier values per group with the most common value per group.
My data looks as follows:
df <- data.frame(CODE = c("J1745","J1745","J1745","J1745","J1100","J1100","J1100","J1100","J1100","J1100"),NDC = c(1234,1234,1234,1234,5678,5678,5678,5678,5678,5678),DOSAGE = c("10ML","10 ML","10 ML","10 ML","5 ML","5 ML","5 ML","5 ML","50 ML","5 ML"),DESC = c("TEXT1","TEXT 1","TEXT 1","TEXT 1","TEXT 2","TEXT 2","TEXT 2","TEXT 2","TEXT 10","TEXT 2"))
As you can see my DOSAGE and DESC columns contain some inconsistencies and I would like to replace them with the most common value within each group.
My desired output looks as follows:

I agree with the comment that this is potentially dangerous.
The code below replaces elements that have <= a specified number of occurrences with the most common value. I use base-R machinery within the replacement function because that's what I know how to do.
repl_common <- function(x,n=1) {
tt <- tapply(x,x,length) ## count number of instances
m <- names(tt)[which.max(tt)] ## find mode
x[tt[as.character(x)]<=n] <- m ## replace
return(x)
}
## apply by group across specified columns
df %>% group_by(CODE) %>% mutate(across(c(DOSAGE,DESC), repl_common))

You can use the Mode function from here to get the most common value.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Apply this function by group.
library(dplyr)
df %>% group_by(CODE, NDC) %>% mutate(across(c(DOSAGE, DESC), Mode)) %>% ungroup
# CODE NDC DOSAGE DESC
# <chr> <dbl> <chr> <chr>
# 1 J1745 1234 10 ML TEXT 1
# 2 J1745 1234 10 ML TEXT 1
# 3 J1745 1234 10 ML TEXT 1
# 4 J1745 1234 10 ML TEXT 1
# 5 J1100 5678 5 ML TEXT 2
# 6 J1100 5678 5 ML TEXT 2
# 7 J1100 5678 5 ML TEXT 2
# 8 J1100 5678 5 ML TEXT 2
# 9 J1100 5678 5 ML TEXT 2
#10 J1100 5678 5 ML TEXT 2

Related

Vectorised calculation

I have data in a wide format containing a Category column listing types of transport and then columns with the name of that type of transport and totals.
I want to create the Calc column where each row is summed across the columns but the value for where the category and column name is the same is excluded.
So for the row total of Car, the sum would be train + bus. The row total of Train would be Car + Bus.
If there is a type of transport in the Category column which isnt listed as a column name, then there should be a NA in the Calc column.
The dataframe is as below, with the Calc column with the results added as expected.
Category<-c("Car","Train","Bus","Bicycle")
Car<-c(9,15,25,5)
Train<-c(8,22,1,7)
Bus<-c(5,2,4,8)
Calc<-c(13, 17,26,NA)
df<-data.frame(Category,Car,Train,Bus,Calc, stringsAsFactors = FALSE)
Can anyone suggest how to add the Calc column as per above? Ideally a vectorised calculation without a loop.
Here is an alternative in base R. You can use apply row-wise through your data.frame. If the Category is one of your columns, then calculate the sum by excluding both the Category column as well as the column corresponding to the Category column. Otherwise, use NA.
df$Calc <- apply(
df,
1,
\(x) {
if (x["Category"] %in% names(x)) {
sum(as.numeric(x[setdiff(names(x), c(x["Category"], "Category"))]))
} else {
NA_integer_
}
}
)
df
Output
Category Car Train Bus Calc
1 Car 9 8 5 13
2 Train 15 22 2 17
3 Bus 25 1 4 26
4 Bicycle 5 7 8 NA
Here is a tidyverse solution:
df<-data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
library(dplyr)
library(tidyr)
df |>
pivot_longer(cols = !Category,
names_to = "cat2",
values_to = "value") |>
group_by(Category) |>
mutate(value = case_when((Category %in% cat2) ~ value,
TRUE ~ NA_real_)) |>
filter(cat2 != Category) |>
summarize(Calc = sum(value)) |>
left_join(df)
# A tibble: 4 × 5
Category Calc Car Train Bus
<chr> <dbl> <dbl> <dbl> <dbl>
1 Bicycle NA 5 7 8
2 Bus 26 25 1 4
3 Car 13 9 8 5
4 Train 17 15 22 2
Using rowSums and a matrix for indexing.
# Example data
Category <- c("Car","Train","Bus","Bicycle")
Car <- c(9,15,25,5)
Train <- c(8,22,1,7)
Bus <- c(5,2,4,8)
df <- data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
# add the "Calc" column
df$Calc <- rowSums(df[,2:4]) - df[,2:4][matrix(c(1:nrow(df), match(df$Category, colnames(df)[2:4])), ncol = 2)]
df
#> Category Car Train Bus Calc
#> 1 Car 9 8 5 13
#> 2 Train 15 22 2 17
#> 3 Bus 25 1 4 26
#> 4 Bicycle 5 7 8 NA

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

use assign / create new object with value (dplyr)

I want to create a new variable with the value of column number in my DF by its name.
I have managed to do this:
firstCol <- which(colnames(Mydf) == "Cars")
It takes the column number of the column with the name "Cars" and set its number to the object firstCol. It works well and good on base.
latly, I've been using dplyr and pipes and I'm trying to create a variable and do the same thing by using pipes but I'm unable to do this - use this line but in pipes %>%
Can you help me?
thanks,
Ido
The dplyr way to do this is select.
Here is an example using some made up data:
df <- data.frame(cars = sample(LETTERS, 100, replace = TRUE),
mpg = runif(100, 15, 45),
color = sample(c("green", "red","blue", "silver"),
100, replace = TRUE)) %>% tibbble()
df %>% select(cars)
# A tibble: 100 x 1
cars
<chr>
1 R
2 V
3 I
4 Q
5 P
6 D
7 J
8 Q
9 R
10 A
# ... with 90 more rows
You can also remove columns with select(-col_name)
df %>% select(-mpg)
# A tibble: 100 x 2
cars color
<chr> <chr>
1 R blue
2 V silver
3 I red
4 Q green
5 P silver
6 D silver
7 J green
8 Q blue
9 R red
10 A silver
# ... with 90 more rows

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to split dataframes into different dataframes based on one column name say ## sensor_name ## values that starts with some prefix like "RI_", "AI_" in R so that I can have two dataframes one for RI and another for AI?
I have tried the following code but this works well when I pivot my dataframe.
map(set_names(c("RI", "AI","FI")),~select(temp_df,starts_with(.x),starts_with("time_stamp")))
I expect the output to have two different dataframes,
RI_df:
AI_df:
It would be great if anyone help me with this since I just started to work on R programming language.
An option is split from base R
lst1 <- split(df1, substr(df1$sensor_name, 1,2))
names(lst1) <- paste0(names(lst1), "_df")
If the prefix length is variable
lst1 <- split(df1, sub("_.*", "", df1$sensor_name))
Or using tidyverse
library(dplyr)
df1 %>%
group_split(grp = str_remove(sensor_name, "_.*"), keep = FALSE)
NOTE: It is not recommended to have multiple objects in the global env. For that reason, keep it in the list and do all thee analysis on that list itself
Another approach from base R
df <- data.frame(sensor_name=c("R1_111","R1_113","A1_124","A1_2444"),
A=c(1,2,24,4),B=c(2,2,1,2),C=c(3,4,4,2))
df[grepl("R1",df$sensor_name),]
sensor_name A B C
1 R1_111 1 2 3
2 R1_113 2 2 4
df[grepl("A1",df$sensor_name),]
sensor_name A B C
3 A1_124 24 1 4
4 A1_2444 4 2 2
Create a variable to identify each group. After that you can subset the data to separate the groups. Functions from the stringr package can extract the relevant text from the longer sensor name.
library(stringr)
library(dplyr)
# Sample data
X <- tibble(
sensor = c("RI_1", "RI_2", "AI_1", "AI_2"),
A = c(1, 2, 3, 4),
B = c(5, 6, 7, 8),
C = c(9, 10, 11, 12)
)
# Extract text to identify groups
X <- X %>%
mutate(prefix = str_replace(sensor, "_.*", ""))
# Subset for desired group
X %>% filter(prefix == "AI")
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
# Or, split all the groups
lapply(unique(X$prefix), function(x) {
X %>% filter(prefix == x)
})
[[1]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 RI_1 1 5 9 RI
2 RI_2 2 6 10 RI
[[2]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
Depending on what you are doing with these groups you may do better to use group_by() form the dplyr package

Using regex to extract email address after # in dplyr pipe and then groupby to count occurrences [duplicate]

This question already has an answer here:
Filtering observations in dplyr in combination with grepl
(1 answer)
Closed 6 years ago.
I have dataframe which has column called email. I want to find email addresses after # symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.
registrant_email
chamukan#yahoo.com
tmrsons1974#yahoo.com
123ajumohan#gmail.com
123#websiterecovery.org
salesdesk#2techbrothers.com
salesdesk#2techbrothers.com
Now I can extract emails after # using below code
sub(".*#", "", df$registrant_email)
How can I use it in dplyr pipe and then count occurrences of each email address
tidyr::separate is useful for splitting columns:
library(tidyr)
library(dplyr)
# separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '#') %>%
# tally occurrences for each level of `domain`
count(domain)
## # A tibble: 4 x 2
## domain n
## <chr> <int>
## 1 2techbrothers.com 2
## 2 gmail.com 1
## 3 websiterecovery.org 1
## 4 yahoo.com 2
By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms
library(dplyr)
library(stringr)
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% group_by(X2) %>% count(X1)
The result is as follows
X2 X1 n
<fctr> <fctr> <int>
1 2techbrothers.com salesdesk 2
2 gmail.com 123ajumohan 1
3 websiterecovery.org 123 1
4 yahoo.com chamukan 1
5 yahoo.com tmrsons1974 1
If you want to set variable names for better code comprehension, you can use
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% setNames(c("local", "domain")) %>%
group_by(domain) %>% count(local)
We can use base R methods for this
aggregate(V1~V2, read.table(text = df1$registrant_email,
sep="#", stringsAsFactors=FALSE), FUN = length)
# V2 V1
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2
Or using the OP's method and wrap it with table
as.data.frame(table(sub(".*#", "", df1$registrant_email)))
# Var1 Freq
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2

Resources