Automating conditional logic for database data checks in R - r

I am trying to do a large data check for a database. Some fields in the database are hidden, so when I am doing the datacheck, I need to ignore all hidden fields. Fields are hidden based on conditional logic stored in the database. I have exported this conditional logic and have stored it in a dataframe in R. Now I need to automate the data check by somehow using the text string of a conditional argument to automate the script writing itself, which I do not think is possible, or finding a way around this problem.
Below is example code that I need to solve:
id <- c(1001, 1002, 1003, 1004, 1005, 1001, 1002, 1003, 1004, 1005)
target_var <- c("race","race","race","race","race", "race_other",
"race_other", "race_other", "race_other", "race_other")
value <- c(1, NA, 1, 1, 6, NA, NA, NA, NA, "Asian")
branching_logic <- c(NA, NA, NA, NA, NA,
"race == 6", "race == 6", "race == 6",
"race == 6", "race == 6")
race <- c(NA, NA, NA,NA, NA, 1, 1, 1, 6, 6)
data <- data.frame(id, var, value, branching_logic, race) %>%
mutate(data_check_result = case_when(
!is.na(value) ~ "No Missing Data",
is.na(value) & is.na(branching_logic) ~ "Missing Data 1",
is.na(value) & race == 6 ~ "Missing Data 2",
is.na(value) & race != 6 ~ "Hidden field",
))
It would be great if I could replace (race==6) with a variable or somehow directing the script to the conditional expression already saved as a string, but I know that R can't do that.
The above problem has four categories which the data could fall into:
No Missing Data: only if value is non-na
Missing Data 1: if the value is NA, and there is no branching logic that hid the variable.
Missing Data 2: if the value is NA and the branching logic is met to show the field
Hidden Field: if the value is NA and the branching logic is NOT net to show the field
I have thousands of fields to check with accompanying branching logic, so I need a way to use the branching logic saved in the "branching_logic" column within the script.
IMPORTANT NOTE: The case here is the simplest case. Many target_var variables and value variables have branching logic that looks at multiple other variables to determine whether to hide the field (Ex. race==6 & race==1)
This is only my second time posting, and I usually do not see such in depth problems here, but it would be great if someone has an idea!

You can store the expression you want to evaluate as a string if you pass it into parse() first as explained in this answer.
Here's a simple example of how you can store the expression in a column and then feed it to dplyr::case_when().
library(tidyverse)
set.seed(1)
d <- tibble(
a = sample(10),
b = sample(10),
c = "a > b"
)
d %>%
mutate(a_bigger = case_when(
eval(parse(text = c)) ~ "Y",
TRUE ~ "N"
))
#> # A tibble: 10 x 4
#> a b c a_bigger
#> <int> <int> <chr> <chr>
#> 1 9 3 a > b Y
#> 2 4 1 a > b Y
#> 3 7 5 a > b Y
#> 4 1 8 a > b N
#> 5 2 2 a > b N
#> 6 5 6 a > b N
#> 7 3 10 a > b N
#> 8 10 9 a > b Y
#> 9 6 4 a > b Y
#> 10 8 7 a > b Y
Created on 2022-03-07 by the reprex package (v2.0.1)

Related

Mutate several columns based on one condition

I'd like to assign different values to several columns, based on the value in another column, i.e. do a multiple mutate based on a single condition.
For example, I would have a dataframe like this:
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
I'd like to manually assign a set of values to length, tonnage, and power when cfr == IRL000I12572, another set of values when cfr == ESP000023522, and keep the given values when cfr == ESP000023194.
Right know, I'm doing it using either an ifelse or case_when statement in my mutate, but I end up with three rows per cfr (and I have many)...
For example:
df <- df %>%
mutate(length = ifelse(cfr == "IRL000I12572", 22.5, length),
tonnage = ifelse(cfr == "IRL000I12572", 153.00, tonnage),
power = ifelse(cfr == "IRL000I12572", 370, power))
Is there a way to 'condense' the statement and have only one per cfr value, to assign the three different length, tonnage, and power values in one row?
Thanks!
You can use rows_update() from dplyr. Note that this is marked as an experimental function, so use at your own risk!
library(dplyr)
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
df_update <- tibble(cfr = "IRL000I12572",
length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
rows_update(df_update, by = "cfr")
# A tibble: 3 x 5
cfr vessel_name length tonnage power
<chr> <chr> <dbl> <dbl> <dbl>
1 IRL000I12572 RACHEL JAY 22.5 153 370
2 ESP000023522 ALAKRANTXU NA NA NA
3 ESP000023194 DONIENE 109. 3507 7149.
You can also make use of across to pull from a reference list (or vector). But this would require a different reference table or some other code feature per lookup ID.
x <- list(length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
mutate(across(names(x), ~ ifelse(cfr == "IRL000I12572", x[[cur_column()]], .)))
In base R you could do:
df[df$cfr == "IRL000I12572", -c(1:2)] <- list(22.5, 153.00, 370)
So that
df
#> # A tibble: 3 x 5
#> cfr vessel_name length tonnage power
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 IRL000I12572 RACHEL JAY 22.5 153 370
#> 2 ESP000023522 ALAKRANTXU NA NA NA
#> 3 ESP000023194 DONIENE 109. 3507 7149.

Grab specific values from columns with disordered rows

As the consequence of a misformatted file (which is unfortunately the only file available) I have a few thousand columns each with 9 rows of data in them. Unfortunately, the actual values are in a different order in each column.
I need to extract matched locus_tag= and gene= and/or product= values for each column whilst keeping the column order intact so that these do not get mismatched. Another complication is that these are formatted as "gene=ltas" so I had thought some kind of grepl would be useful.
However, I also need them ordered so that each row only contains one either the correct value (e.g. gene=) or NA:
Column A
Column B
gene = ltas
NA
NA
product = hypothetical protein
locus_tag = RAS_R12345
locus_tag = RAS_R14053
Here is an example of the data that I am working with:
header 1
header 2
Parent=gene-SAS_RS00035
Name=hutH
gbkey=CDS
gene_biotype=protein_coding
inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1
locus_tag=SAS_RS00040
Dbxref=Genbank:WP_000449218.1
gbkey=Gene
locus_tag=SAS_RS00035
old_locus_tag=SAS0008
Name=WP_000449218.1
gene=hutH
cds-WP_000449218.1
gene-SAS_RS00040
protein_id=WP_000449218.1
product=NAD(P)H-hydrate dehydratase
I'n not sure where to start with coding this as it is so disordered and poorly formatted, so any advice would be very welcomed.
How about this:
dat <- structure(list(`header 1` = c("Parent=gene-SAS_RS00035", "gbkey=CDS",
"inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1",
"Dbxref=Genbank:WP_000449218.1", "locus_tag=SAS_RS00035", "Name=WP_000449218.1",
"cds-WP_000449218.1", "protein_id=WP_000449218.1", "product=NAD(P)H-hydrate dehydratase"
), `header 2` = c("Name=hutH", "gene_biotype=protein_coding",
"locus_tag=SAS_RS00040", "gbkey=Gene", "old_locus_tag=SAS0008",
"gene=hutH", "gene-SAS_RS00040", "", "")), row.names = c(NA,
9L), class = "data.frame")
prod <- apply(dat, 2, function(x){
prod_ind <- grep("^product", x)
if(length(prod_ind == 1)){
out <- gsub("(product=.*)", "\\1", x[prod_ind])
}else{
out <- NA
}
out
})
gene <- apply(dat, 2, function(x){
gene_ind <- grep("^gene=", x)
if(length(gene_ind == 1)){
out <- gsub("(gene=.*)", "\\1", x[gene_ind])
}else{
out <- NA
}
out
})
locus <- apply(dat, 2, function(x){
locus_tag_ind <- grep("^locus_tag=", x)
if(length(locus_tag_ind == 1)){
out <- gsub("(locus_tag=.*)", "\\1", x[locus_tag_ind])
}else{
out <- NA
}
out
})
dplyr::bind_rows(gene, prod, locus)
#> # A tibble: 3 × 2
#> `header 1` `header 2`
#> <chr> <chr>
#> 1 <NA> gene=hutH
#> 2 product=NAD(P)H-hydrate dehydratase <NA>
#> 3 locus_tag=SAS_RS00035 locus_tag=SAS_RS00040
Created on 2022-04-05 by the reprex package (v2.0.1)
The example above does this in three parts, each time searching for one of the things you're interested in. Then, it combines all the results together.

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

Create new column in string partial match-based dataframe without repeats

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
DF:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll.
If GLDESC contains the word Supply anywhere in the string then I want KIND to be Supply.
In all other cases I want KIND to be Other.
Then, I found this:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
But with that, I have everything that matches Supply, for example, classified. However, as in DF lines 4 and 5, the same GL has two Supply, which for me is unnecessary. In fact, I need only one type of GLDESC to be matched if for the same GL the string is repeated.
Edit: I can not delet any row. I want to have this as output:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll
If we need the repeating element to be NA, use duplicated on 'GLDESC' to get a logical vector and assign those elements in 'KIND' created with ifelse to NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
If we need to change the values by a grouping variable
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
Or with the full changes
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
data
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")

Efficient way to conditionally edit value labels

I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways.
The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames.
d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1,
2, 3, 8, 9), .Names = c("Protection of environment should be given priority",
"Economic growth should be given priority", "[DON'T READ] Both equally",
"[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")))
d$var1
<Labelled double>
[1] 1 2 NA NA 3 NA 1 1
Labels:
value label
1 Protection of environment should be given priority
2 Economic growth should be given priority
3 [DON'T READ] Both equally
8 [DON'T READ] Don't Know
9 [DON'T READ] Refused
If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)."
Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame.
library(labelled)
val_labels(d$var1) <- c("Protection of environment should be given priority" = 1,
"Economic growth should be given priority" = 2,
"Both equally (VOL)" = 3,
"Don't Know (VOL)" = 8,
"Refused (VOL)" = 9)
How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame?
The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate():
1). Using lapply()
This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well!
d[] <- lapply(d, function(x) {
labels <- attributes(x)$labels
names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels))
attributes(x)$labels <- labels
x
})
d$var1
[1] 1 2 NA NA 3 NA 1 1
attr(,"labels")
Protection of environment should be given priority Economic growth should be given priority
1 2
Both equally (VOL) Don't Know (VOL)
3 8
Refused (VOL)
9
attr(,"class")
[1] "labelled"
2) Using mutate_all()
Using the same logic (with the same result) you could change the name of the labels in a tidier way:
d %>%
mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>%
map(attributes) # just to check on the result

Resources