I would like to create a new column, document it only when it matches a specific condition (here x > 2 ) and then directly overwrite another existing column (here auxiliary) for these rows where the condition (x > 2) returned TRUE.
df <- tibble(x = 1:5, y = 1:5, auxiliary = NA)
# A tibble: 5 x 3
x y auxiliary
<int> <dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
I can do this successfully in two different calls within mutate() :
df %>%
mutate(result = if_else(condition = x > 2,
true = x+y,
false = NA_real_),
auxiliary = if_else(condition = x > 2,
true = "Calculation done",
false = NA_character_))
# A tibble: 5 x 4
x y auxiliary result
<int> <dbl> <chr> <dbl>
1 1 NA NA
2 2 NA NA
3 3 Calculation done 6
4 4 Calculation done 8
5 5 Calculation done 10
But there's some code repetition (condition = x > 2) which, in more complex cases, makes reading the code very uneasy and prone to errors, especially when there are multiple conditions.
Is there a way to simplify the code above by not repeating the condition ? :
Create new variable (mutate())
Document only if condition is matched (if_else or case_when())
Write another column's value only if the row's condition is matched. (I'm stuck here)
Something that would look like this :
df %>%
mutate(result = case_when(
x > 2 ~ x + y & auxiliary == "Calculation done", # we'd add the column reference here...
TRUE ~ NA_real & auxiliary = NA_character_))
Many thanks ! Any solution from the tidyverse would be ideal.
You can save the result of the condition in a column and use that to avoid evaluating the same condition again and again.
library(dplyr)
df <- tibble(x = 1:5, y = 1:5)
df %>%
mutate(condition = x > 2,
result = if_else(condition,
true = x+y,
false = NA_integer_),
auxiliary = if_else(condition,
true = "Calculation done",
false = NA_character_))
# x y condition result auxiliary
# <int> <int> <lgl> <int> <chr>
#1 1 1 FALSE NA NA
#2 2 2 FALSE NA NA
#3 3 3 TRUE 6 Calculation done
#4 4 4 TRUE 8 Calculation done
#5 5 5 TRUE 10 Calculation done
I would suggest saving the condition which should be used multiple times as string and then using the string as variable in the code, e.g.:
condition <- "x>2"
df %>%
mutate(result = ifelse(eval(parse(text=condition)),
x+y,
NA),
auxiliary = ifelse(eval(parse(text=condition)),
"Calculation done",
NA))
Note, that I am using base ifelse statement, to avoid the restriction that I have to use the same type in the column ("dplyr::if_else is specifically written to force you to have the same type in your true and false arguments."). See further information on that e.g. Different behavior of if else statement and if_else.
It is possible to achieve the kind of abstraction you would like to have, but it does require more set-ups. mutate is actually more flexible than you think it is. You can pass a script to it. Suppose you write something like A %>% mutate({...}). If the script {...} returns a dataframe, then its columns will be created directly in A or replace the existing columns in A if they share the same names. So you can do
df %>% mutate({
cond <- x > 2
out <- tibble(.rows = n())
mapply(
\(var, true, false) out[[var]] <<- if_else(cond, true, false),
var = c("result", "auxiliary"),
true = list(x + y, "Calculation done"),
false = list(NA_integer_, NA_character_)
)
out
})
Output
# A tibble: 5 x 4
x y auxiliary result
<int> <int> <chr> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 Calculation done 6
4 4 4 Calculation done 8
5 5 5 Calculation done 10
Related
I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.
sample data
sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
sample_df[i] <- sample(1:4, 6, replace = TRUE)
sample_df
}
desired output
library(tidyverse)
sample_df %>%
gather(var, value, V3:V5) %>%
mutate(n_var = n_distinct(var)) %>% # get the number of columns
group_by(id, group, value) %>%
mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var"
spread(var, value) %>%
select(-n_var)
#> # A tibble: 6 x 6
#> # Groups: id, group [6]
#> id group test V3 V4 V5
#> <chr> <chr> <lgl> <int> <int> <int>
#> 1 a r FALSE 3 3 1
#> 2 b l FALSE 1 4 4
#> 3 c r FALSE 2 4 2
#> 4 d l FALSE 2 1 2
#> 5 e r TRUE 4 4 4
#> 6 f l FALSE 2 2 3
Created on 2019-02-27 by the reprex package (v0.2.1)
Does not need to be dplyr. I just used it for showing what I want to achieve.
There are a bunch of ways to check for equality row-wise. Two good ways:
# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)
# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)
If you only want to test some columns, then use a subset of columns rather than the whole data frame:
cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)
# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)
Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.
This question already has answers here:
Idiom for ifelse-style recoding for multiple categories
(13 answers)
Closed 4 years ago.
I have like below mentioned dataframe:
Records:
ID Remarks Value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14
By utilizing the above mentioned dataframe, I want to add new column Status in the existing dataframe.
Where if the Remarks is ABC, AAB or ABB than status would be TRUE and for XYZ and ZZX it should be FALSE.
I am using below mentioned method for that but it didn't work.
Records$Status<-ifelse(Records$Remarks %in% ("ABC","AAB","ABB"),"TRUE",
ifelse(Records$Remarks %in%
("XYZ","ZZX"),"FALSE"))
And, bases on the Status i want to derive following output:
ID TRUE FALSE Sum
1 2 1 37
2 1 1 26
Records$Status<-ifelse(Records$Remarks %in% c("ABC","AAB","ABB"),TRUE,
ifelse(Records$Remarks %in%
c("XYZ","ZZX"),FALSE, NA))
You need to enclose your lists of strings with c(), and add an "else" condition for the second ifelse (but see Roman's answer below for a better way of doing this with case_when). (Also note that here I changed the "TRUE" and "FALSE" (as character class) into TRUE and FALSE (the logical class).
For the summary (using dplyr):
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Status), falses=sum(!Status), sum=sum(Value))
# A tibble: 2 x 4
ID trues falses sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Of course, if you don't really need the intermediate Status column but just want the summary table, you can skip the first step altogether:
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Remarks %in% c("ABC","AAB","ABB")),
falses=sum(Remarks %in% c("XYZ","ZZX")),
sum=sum(Value))
Since it makes sense to use dplyr for your second question (see #iod's answer) it is also a good opportunity to use the package's very straightforward case_when() function for the first part.
Records %>%
mutate(Status = case_when(Remarks %in% c("ABC", "AAB", "ABB") ~ TRUE,
Remarks %in% c("XYZ", "ZZX") ~ FALSE,
TRUE ~ NA))
ID Remarks Value Status
1 1 ABC 10 TRUE
2 1 AAB 12 TRUE
3 1 ZZX 15 FALSE
4 2 XYZ 12 FALSE
5 2 ABB 14 TRUE
This approach will scale to a large number of remarks.
Load the data and prepare a matching data frame
The second data frame makes a matching between remarks and their TRUE or FALSE value.
library(readr)
library(dplyr)
library(tidyr)
dtf <- read_table("id remarks value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14")
truefalse <- data_frame(remarks = c("ABC", "AAB", "ABB", "ZZX", "XYZ"),
tf = c(TRUE, TRUE, TRUE, FALSE, FALSE))
Group by id and summarise
This is the format as asked in the question
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id) %>%
summarise(true = sum(tf),
false = sum(!tf),
value = sum(value))
# A tibble: 2 x 4
id true false value
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Alternative proposal: group by id, tf and summarise
This option retains more details on the spread of value along the grouping variables id and tf.
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id, tf) %>%
summarise(n = n(),
value = sum(value))
# A tibble: 4 x 4
# Groups: id [?]
id tf n value
<int> <lgl> <int> <int>
1 1 FALSE 1 15
2 1 TRUE 2 22
3 2 FALSE 1 12
4 2 TRUE 1 14
In most cases, life is easier and lines are shorter without ifelse:
# short version
df$Status <- df$Remarks %in% c("ABC","AAB","ABB")
This version is OK for most purposes but it has shortcomings. Status will be FALSE if Remarks is NA or, say "garbage" but one might want it to be NA in these cases and FALSE only if Remarks %in% c("XYZ", "ZZX"). So one can add and multiply the conditions and finally convert it to logical:
df$Status <- as.logical(with(df,
Remarks %in% c("ABC","AAB","ABB") +
! Remarks %in% c("XYZ","ZZX") ))
And the summary table with base R:
aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
Umm... perhaps some formatting would be useful:
t1 <- aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
t1 <- t1[, c(1,3,2)]
colnames(t1) <- c("ID", "", "Sum")
t1
# ID FALSE TRUE Sum
# 1 1 1 2 37
# 2 2 1 1 26
This one returns correct result, only if there are two mentioned groups ("ABC", "AAB", "ABB" vs "XYZ","ZZX", ...). For me #iod's solution, is more R-like, but I've tried to avoid ifelse, and do it another way:
Code:
library(tidyverse)
dt %>%
group_by(ID, Status = grepl("^A[AB][CB]$", Remarks)) %>%
summarise(N = n(), Sum = sum(Value)) %>%
spread(Status, N) %>%
summarize_all(sum, na.rm = T) %>% # data still groupped by ID
select("ID", "TRUE", "FALSE", "Sum")
# A tibble: 2 x 4
ID `TRUE` `FALSE` Sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Data:
dt <- structure(
list(ID = c(1L, 1L, 1L, 2L, 2L),
Remarks = c("ABC", "AAB", "ZZX", "XYZ", "ABB"),
Value = c(10L, 12L, 15L, 12L, 14L)),
.Names = c("ID", "Remarks", "Value"), class = "data.frame", row.names = c(NA, -5L)
)
I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows
I want to create a frequency table from a data frame and save it in excel. Using table() function i can only create frequency of a particular column. But I want to create frequency table for all the columns altogether, and for each column the levels or type of variables may differ too. Like kind of summary of a data frame but there will not be mean or other measures, only frequencies.
I was trying something like this
for(i in 1:230){
rm(tb)
tb<-data.frame(table(mydata[i]))
tb2<-cbind(tb2,tb)
}
But it's showing the following Error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 15, 12
In place of cbind() I also used data.frame() but the Error didn't changed.
You are getting an error because you are trying to combine the data frames that have different dimensions. From what I understand, your problem is two-fold: (1) you want to get the frequency distribution of each column regardless of type; and, (2) you want to save all of the results in a single Excel sheet.
For the first problem, you can use the mapply() function.
set.seed(1)
dat <- data.frame(
x = sample(LETTERS[1:5], 15, replace = TRUE),
y = rbinom(5, 15, prob = 0.4)
)
mylist <- mapply(table, dat); mylist
# $x
#
# A B C D E
# 2 5 1 4 3
#
# $y
#
# 5 6 7 11
# 3 3 6 3
You can also use purrr::map().
library(purrr)
dat %>% map(table)
The second problem has several solutions in this question: Export a list into a CSV or TXT file in R. In particular, LyzandeR's answer will enable you to do just what you intended. If you prefer to save the outputs in separate files, you can do:
mapply(write.csv, mylist, file=paste0(names(mylist), '.csv'))
Maybe an rbind solution is better as it allows you to handle variables with different levels:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
# x y
# 1 A 1
# 2 A 1
# 3 B 2
# 4 C 1
dt_res = data.frame()
for (i in 1:ncol(dt)){
dt_temp = data.frame(t(table(dt[,i])))
dt_temp$Var1 = names(dt)[i]
dt_res = rbind(dt_res, dt_temp)
}
names(dt_res) = c("Variable","Levels","Freq")
dt_res
# Variable Levels Freq
# 1 x A 2
# 2 x B 1
# 3 x C 1
# 4 y 1 3
# 5 y 2 1
And an alternative (probably faster) process using apply:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
ff = function(x){
y = data.frame(t(table(x)))
y$Var1 = NULL
names(y) = c("Levels","Freq")
return(y)
}
dd = do.call(rbind, apply(dt, 2, ff))
dd
# Levels Freq
# x.1 A 2
# x.2 B 1
# x.3 C 1
# y.1 1 3
# y.2 2 1
# extract variable names from row names
dd$Variable = sapply(row.names(dd), function(x) unlist(strsplit(x,"[.]"))[1])
dd
# Levels Freq Variable
# x.1 A 2 x
# x.2 B 1 x
# x.3 C 1 x
# y.1 1 3 y
# y.2 2 1 y
Edit (2021-03-29): tidyverse Principles
Here is some updated code that utilizes tidyverse, specifically functions from dplyr, tibble, and purrr. The code is a bit more readable and easier to carry out as well. Example data set is provided.
tibble(
a = rep(c(1:3), 2),
b = factor(rep(c("Jan", "Feb", "Mar"), 2)),
c = factor(rep(LETTERS[1:3], 2))
) ->
dat
dat #print df
# A tibble: 6 x 3
a b c
<int> <fct> <fct>
1 1 Jan A
2 2 Feb B
3 3 Mar C
4 1 Jan A
5 2 Feb B
6 3 Mar C
Get counts and proportions across columns.
library(purrr)
library(dplyr)
library(tibble)
#library(tidyverse) #to load assortment of pkgs
#output tables - I like to use parentheses & specifying my funs
purrr::map(
dat, function(.x) {
count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100))
})
#here is the same code but more concise (tidy eval)
purrr::map(dat, ~ count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100)))
$a
# A tibble: 6 x 3
x n pct
<int> <int> <dbl>
1 1 1 16.7
2 2 1 16.7
3 3 1 16.7
4 4 1 16.7
5 5 1 16.7
6 6 1 16.7
$b
# A tibble: 3 x 3
x n pct
<fct> <int> <dbl>
1 Feb 2 33.3
2 Jan 2 33.3
3 Mar 2 33.3
$c
# A tibble: 2 x 3
x n pct
<fct> <int> <dbl>
1 A 3 50
2 B 3 50
Old code...
The table() function returns a "table" object, which is nigh impossible to manipulate using R in my experience. I tend to just write my own function to circumvent this issue. Let's first create a data frame with some categorical variables/features (wide formatted data).
We can use lapply() in conjunction with the table() function found in base R to create a list of frequency counts for each feature.
freqList = lapply(select_if(dat, is.factor),
function(x) {
df = data.frame(table(x))
names(df) = c("x", "y")
return(df)
}
)
This approach allows each list object to be easily indexed and further manipulated if necessary, which can be really handy with data frames containing a lot of features. Use print(freqList) to view all of the frequency tables.
First I am very new to R, and I'm aware that I may making an obvious mistake, I have searched for an answer, but maybe I'm searching for the wrong thing.
I am trying to apply a function to add a new column to a dataframe based on the contents of that row. But it looks to me like the values in the row are not being handled properly in the mutate function when using rowwise. I've tried to create a toy example to demonstrate my problem.
library(dplyr)
x<-c("A,"B")
y<-c(1,2)
df<-data.frame(x,y)
Then I have a function to create a new column called z which adds 1 to y if the value of x is "A" and adds 2 to y if the value of x is "B". Note that I have added print(x) to show what is going on.
calculatez <- function(x,y){
print(x)
if(x == "A"){
return (y+1)
}
else{
return(y+2)
}
}
I then try to use mutate:
df %>%
rowwise() %>%
mutate(z = calculatez(x,y))
and I get the following, 2 has been added to both rows, rather than 1 to the first row and the "A" and "B" have been passed into the function as 1 and 2.
[1] 1
[1] 2
Source: local data frame [2 x 3]
Groups:
x y z
1 A 1 3
2 B 2 4
If I remove the rowwise() function the "A" and "B" appear to be being passed properly, but clearly I don't get the right result.
df %>%
mutate(z = calculatez(x,y))
[1] A B
Levels: A B
x y z
1 A 1 2
2 B 2 3
Warning message:
In if (x == "A") { :
the condition has length > 1 and only the first element will be used
I can get it to work if I try to do it without writing my own function and then I don't get the error message about the length of the condition. So I don't think I understand properly what rowwise() is doing.
df %>%
mutate(z = ifelse(x=="A",y+1,y+2))
x y z
1 A 1 2
2 B 2 4
But I want to be able to use my own function, because in my real application the condition is more complicated and it will be difficult to read with lots of nested ifelse functions in the mutate function.
I can get round the problem by changing my condition to if(x==1) but that will make my code difficult to understand.
I don't want to waste your time, so sorry if I'm missing something obvious. Any tips on where I'm going wrong?
You could use rowwise with do
df %>%
rowwise() %>%
do(data.frame(., z= calculatez(.$x, .$y)))
gives the output
x y z
#1 A 1 2
#2 B 2 4
Or you could do:
df %>%
group_by(N=row_number()) %>%
mutate(z=calculatez(x,y))%>%
ungroup() %>%
select(-N)
Using a different dataset:
df <- structure(list(x = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), y = c(1, 2, 1, 2, 1)), .Names = c("x",
"y"), row.names = c(NA, -5L), class = "data.frame")
Running the above code gives:
# x y z
#1 A 1 2
#2 A 2 3
#3 B 1 3
#4 B 2 4
#5 B 1 3
If you are using data.table
library(data.table)
setDT(df)[, z := calculatez(x,y), by=seq_len(nrow(df))]
df
# x y z
# 1: A 1 2
# 2: A 2 3
# 3: B 1 3
# 4: B 2 4
# 5: B 1 3