I am trying to create a report on data validation in R; I have used the validate package to generate the general summary of the data, but I need to get the specifics of what is failing our validation checks.
What I want to end up with is a data frame of ids, columns that fail their test and the value that is failing the test. However, not all of the columns are mandatory, so I need to be able to check whether the data passes without knowing whether or not the column is going to be there.
For other data frames with mandatory data, I converted it to True/False whether it passes the tests. For example:
library(dplyr)
library(validate)
library(tidyr)
test_df = data.frame(id = 1:10,
a = 11:20,
b = c(21:25,36,27:30),
c = c(41,52,43:50))
text_check = test_df %>% transmute(
a = a>21,
b = b > 31,
c = c> 51
)
value_fails<-data.frame(id = test_df$id, text_check[,-1][colSums(text_check[,-1]) > 0])
value_failures_gath = gather(value_fails, column, changed, -id) %>% filter(changed == TRUE)
value_failures_gath$Value = apply(value_failures_gath, c(1), function(x)
test_df[test_df$id == x[['id']], grep(x[['column']], colnames(test_df))])
value_failures_gath<-value_failures_gath %>% arrange(id, column)
value_failures_gath$changed<-NULL
colnames(value_failures_gath)<-c('ID','Field','Value')
> value_failures_gath
ID Field Value
1 2 c 52
2 6 b 36
I have a data frame with the checks I want to create, in the style of:
second_data_check = data.frame(a = 'a>21',
b = 'b > 31',
c = 'c> 51',
d = 'd> 61')
I can't just run these as are, since we don't have column D to check, but other data frames that are run through this validation might have column D but not have column B for example. I can filter this data frame to only include the tests for the columns we have but then is there a way to apply the tests in this data frame as checks? Is there a better way to do this?
Thanks so much for the help!
I would set up the checks one at a time so that you can check variable existence before evaluation. Would the following solution work?
text_check = data.frame(id=test_df$id)
if('a' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(a=a>21)
text_check <- cbind(text_check, text_check_temp)
}
if('b' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(b=b>31)
text_check <- cbind(text_check, text_check_temp)
}
if('c' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(c=c>51)
text_check <- cbind(text_check, text_check_temp)
}
if('d' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(d=d>61)
text_check <- cbind(text_check, text_check_temp)
}
I was trying to further refactor the code by looping through the transmute checks but was unable to figure out how to evaluate string formulas properly.
Jason
Related
I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)
I am trying to pass a variable Phyla (which is also the name of a df column of interest) into other functions. However I get the error: Error: Columntax_levelis unknown. Which I understand. It would just be more convenient to state the column you want to use once in the function since this will also be repeated numerous times in the script. I Have tried using OTU_melt_grouped[,1] since this will always be the first column to use in the dcast function, but get the error: Error: Must use a vector in[, not an object of class matrix. Moreover, it does not solve my solution in the group_by function since I want to be able to specify Phyla, Class, Order etc...
I am sure there must be a simple solution, but I don't know where to start!
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
OTU_melt_grouped <- data %>%
group_by(tax_level, variable) %>%
summarise(value = sum(value))
taxa_cols <- dcast(OTU_melt_grouped, variable ~ tax_level)
rownames(taxa_cols) <- meta_data$site
taxa_cols <- taxa_cols[-1]
return(taxa_cols)
}
tax_test <- taxa_specific_columns_func(OTU_melt)
As we are passing an unquoted variable, we could make use of curly-curly ({{..}}) operator in group_by
library(dplyr)
library(tidyr)
library(tibble)
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
data %>%
group_by({{tax_level}}, variable) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = {{tax_level}}, values_from = value) %>%
column_to_rownames("variable")
}
taxa_specific_columns_func(OTU_melt)
# A B C D E
#a 0.01859254 0.42141238 -0.196961 -0.1859115 -0.2901680
#b -0.64700080 NA -0.161108 NA NA
#c -0.03297331 0.05871052 -1.963341 NA 0.7608218
data
set.seed(48)
OTU_melt <- data.frame(Phyla = rep(LETTERS[1:5], each = 3),
variable = sample(letters[1:3], 15, replace = TRUE), value = rnorm(15))
I have the following code I'd like to run for multiple columns in a data frame called ccc.
ccc %>%
group_by(LA) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
LA is the name of one of the columns. How would I set up a loop to run through a number of different columns?
I've tried the following.
for (i in 26:ncol(ccc)) {
ccc %>%
group_by(i) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
}
But I get the following error message.
Error in resolve_vars(new_groups, tbl_vars(.data)) :
unknown variable to group by : i
What most people will miss in your question is a reproducible data set. Without it, its often very hard to reproduce your problem and solve it.
If I got you right, your data-set looks like the one above:
set.seed(1)
ccc=data.frame(Default=sample(c(0,1),100,replace = TRUE),LA=sample(c("X","Y","Z"),100,replace = TRUE),DC=sample(c("A","B","C"),100,replace = TRUE))
do.call() - applies rbind() to the subsequent elements.
lapply(dat,function(x)) applies the function to every element of dat - in our case columns.
library(dplyr)
do.call(rbind,lapply(ccc, function(Var) {
dat=data.frame(Var,Default=ccc$Default) %>% group_by(Var) %>% summarise(Def=sum(Default),NDef=n()-sum(Default),DRate=mean(Default))
return(as.data.frame(dat))
}
))
"LA is the name of one of the columns"
Actually, group by dplyr construction works on variables inside the columns. I guess you want to do other things.
If you want to apply the same function to different columns you could use summarize_at.
df <- data.frame( id = c(1:20),
a1 = runif(20),
b1 = runif(20),
c1 = runif(20)
)
library(dplyr)
df %>% summarise_at(c("a1","b1","c1"), funs(med = median,
avr = mean))
# result:
# a1_med b1_med c1_med a1_avr b1_avr c1_avr
# 1 0.6444056 0.5266252 0.6420554 0.5605837 0.4983654 0.5546381
I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.
I'd like to pull some data from a sql server with a dynamic filter. I'm using the great R package dplyr in the following way:
#Create the filter
filter_criteria = ~ column1 %in% some_vector
#Connect to the database
connection <- src_mysql(dbname <- "mydbname",
user <- "myusername",
password <- "mypwd",
host <- "myhost")
#Get data
data <- connection %>%
tbl("mytable") %>% #Specify which table
filter_(.dots = filter_criteria) %>% #non standard evaluation filter
collect() #Pull data
This piece of code works fine but now I'd like to loop it somehow on all the columns of my table, thus I'd like to write the filter as:
#Dynamic filter
i <- 2 #With a loop on this i for instance
which_column <- paste0("column",i)
filter_criteria <- ~ which_column %in% some_vector
And then reapply the first code with the updated filter.
Unfortunately this approach doesn't give the expected results. In fact it does not give any error but doesn't even pull any result into R.
In particular, I looked a bit into the SQL query generated by the two pieces of code and there is one important difference.
While the first, working, code generates a query of the form:
SELECT ... FROM ... WHERE
`column1` IN ....
(` sign in the column name), the second one generates a query of the form:
SELECT ... FROM ... WHERE
'column1' IN ....
(' sign in the column name)
Does anyone have any suggestion on how to formulate the filtering condition to make it work?
It's not really related to SQL. This example in R does not work either:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
df %>% filter_(~ "v1" == 1)
It does not work because you need to pass to filter_ the expression ~ v1 == 1 — not the expression ~ "v1" == 1.
To solve the problem, simply use the quoting operator quo and the dequoting operator !!
library(dplyr)
which_column = quot(v1)
df %>% filter(!!which_column == 1)
An alternative solution, with dplyr version 0.5.0 (probably implemented earlier than that), it is possible to pass a composed string as the .dots argument, which I find more readable than the lazyeval::interp solution:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_col <- "v1"
which_val <- 1
df %>% filter_(.dots= paste0(which_col, "== ", which_val))
v1 v2
1 1 1
2 1 2
3 1 4
UPDATE for dplyr 0.6 and later:
packageVersion("dplyr")
# [1] ‘0.5.0.9004’
df %>% filter(UQ(rlang::sym(which_col))==which_val)
#OR
df %>% filter((!!rlang::sym(which_col))==which_val)
(Similar to #Matthew 's response for dplyr 0.6, but I assume that which_col is a string variable.)
2nd UPDATE: Edwin Thoen created a nice cheatsheet for tidy evaluation: https://edwinth.github.io/blog/dplyr-recipes/
Here's a slightly less verbose solution and one which uses the typical behavior of the extract function, '[' in selecting a column by character value rather than converting it to a language element:
df %>% filter(., '['(., which_column)==1 )
set.seed(123)
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_column <- "v1"
df %>% filter(., '['(., which_column)==1)
# v1 v2
#1 1 5