I have the following code I'd like to run for multiple columns in a data frame called ccc.
ccc %>%
group_by(LA) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
LA is the name of one of the columns. How would I set up a loop to run through a number of different columns?
I've tried the following.
for (i in 26:ncol(ccc)) {
ccc %>%
group_by(i) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
}
But I get the following error message.
Error in resolve_vars(new_groups, tbl_vars(.data)) :
unknown variable to group by : i
What most people will miss in your question is a reproducible data set. Without it, its often very hard to reproduce your problem and solve it.
If I got you right, your data-set looks like the one above:
set.seed(1)
ccc=data.frame(Default=sample(c(0,1),100,replace = TRUE),LA=sample(c("X","Y","Z"),100,replace = TRUE),DC=sample(c("A","B","C"),100,replace = TRUE))
do.call() - applies rbind() to the subsequent elements.
lapply(dat,function(x)) applies the function to every element of dat - in our case columns.
library(dplyr)
do.call(rbind,lapply(ccc, function(Var) {
dat=data.frame(Var,Default=ccc$Default) %>% group_by(Var) %>% summarise(Def=sum(Default),NDef=n()-sum(Default),DRate=mean(Default))
return(as.data.frame(dat))
}
))
"LA is the name of one of the columns"
Actually, group by dplyr construction works on variables inside the columns. I guess you want to do other things.
If you want to apply the same function to different columns you could use summarize_at.
df <- data.frame( id = c(1:20),
a1 = runif(20),
b1 = runif(20),
c1 = runif(20)
)
library(dplyr)
df %>% summarise_at(c("a1","b1","c1"), funs(med = median,
avr = mean))
# result:
# a1_med b1_med c1_med a1_avr b1_avr c1_avr
# 1 0.6444056 0.5266252 0.6420554 0.5605837 0.4983654 0.5546381
Related
My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",
I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)
I have two data frames (table1 and randomdata) with the following schema:
#randomdata
randomdata$cube = {1,5,3,3,4,5,5,2,2,6,1,2,....} (1000 rows)
#table1
table1$side = {1,2,3,4,5,6} (6 rows)
table1$frequency = NULL
I want to count the occurence from the different sides of the cube (of the first 10 rows from randomdata$cube) and assign the result to table1$frequency to the corresponding row (based on table1$side).
I can do this successfuly this way:
table1$frequency[1] <- sum(randomdata$cube[1:10] == 1)
table1$frequency[2] <- sum(randomdata$cube[1:10] == 2)
table1$frequency[3] <- sum(randomdata$cube[1:10] == 3)
...
table1$frequency[6] <- sum(randomdata$cube[1:10] == 6)
This works very well, but there must be a better way.
Instead of 6 statements, I imagine something like this:
table1$frequency <- sum(randomdata$cube[1:10] == table1$side)
Can someone show me a more dynamic way to do this?
Thank you.
We can do this with converting the 'cube' column to factor with levels specified as 1:6 and then do the table. If we do it without that, missing elements can get dropped out of the table output. Here, it would be 0 if a level is missing
table1$frequency <- table(factor(randomdata$cube[1:10], levels = 1:6))
Or using tidyverse
library(tidyverse)
randomdata %>%
slice(1:6) %>%
count(cube = factor(cube, levels = 1:6), .drop = FALSE) %>%
pull(n) %>%
mutate(table1, frequency = .)
My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))
I am trying to create a report on data validation in R; I have used the validate package to generate the general summary of the data, but I need to get the specifics of what is failing our validation checks.
What I want to end up with is a data frame of ids, columns that fail their test and the value that is failing the test. However, not all of the columns are mandatory, so I need to be able to check whether the data passes without knowing whether or not the column is going to be there.
For other data frames with mandatory data, I converted it to True/False whether it passes the tests. For example:
library(dplyr)
library(validate)
library(tidyr)
test_df = data.frame(id = 1:10,
a = 11:20,
b = c(21:25,36,27:30),
c = c(41,52,43:50))
text_check = test_df %>% transmute(
a = a>21,
b = b > 31,
c = c> 51
)
value_fails<-data.frame(id = test_df$id, text_check[,-1][colSums(text_check[,-1]) > 0])
value_failures_gath = gather(value_fails, column, changed, -id) %>% filter(changed == TRUE)
value_failures_gath$Value = apply(value_failures_gath, c(1), function(x)
test_df[test_df$id == x[['id']], grep(x[['column']], colnames(test_df))])
value_failures_gath<-value_failures_gath %>% arrange(id, column)
value_failures_gath$changed<-NULL
colnames(value_failures_gath)<-c('ID','Field','Value')
> value_failures_gath
ID Field Value
1 2 c 52
2 6 b 36
I have a data frame with the checks I want to create, in the style of:
second_data_check = data.frame(a = 'a>21',
b = 'b > 31',
c = 'c> 51',
d = 'd> 61')
I can't just run these as are, since we don't have column D to check, but other data frames that are run through this validation might have column D but not have column B for example. I can filter this data frame to only include the tests for the columns we have but then is there a way to apply the tests in this data frame as checks? Is there a better way to do this?
Thanks so much for the help!
I would set up the checks one at a time so that you can check variable existence before evaluation. Would the following solution work?
text_check = data.frame(id=test_df$id)
if('a' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(a=a>21)
text_check <- cbind(text_check, text_check_temp)
}
if('b' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(b=b>31)
text_check <- cbind(text_check, text_check_temp)
}
if('c' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(c=c>51)
text_check <- cbind(text_check, text_check_temp)
}
if('d' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(d=d>61)
text_check <- cbind(text_check, text_check_temp)
}
I was trying to further refactor the code by looping through the transmute checks but was unable to figure out how to evaluate string formulas properly.
Jason