Applying mutate to multiple columns and rows in dplyr - r

A pretty simple question but has me dumbfounded.
I have a table and am trying to round each column to 2 decimal places using mutate_all (or another dplyr function). I know this can be done with certain apply functions but I like the dplyr/tidyverse frame work.
DF = data.frame(A = seq(from = 1, to = 2, by = 0.0255),
B = seq(from = 3, to = 4, by = 0.0255))
Rounded.DF = DF%>%
mutate_all(funs(round(digits = 2)))
This does not work however and just gives me a 2 in every column. Thoughts?

You need a "dot" in the round function. The dot is a placeholder for where mutate_all should place each column that you are trying to manipulate.
Rounded.DF = DF%>%
mutate_all(funs(round(., digits = 2)))
To make it more intuitive you can write the exact same thing as a custom function and then reference that function inside the mutate_all:
round_2_dgts <- function(x) {round(x, digits = 2)}
Rounded.DF = DF%>%
mutate_all(funs(round_2_dgts))

Related

Add a Column created Within a Function to a dataframe in R

I have searched and tried multiple previously asked questions that might be similar to my question, but none worked.
I have a dataframe in R called df2, a column called df2$col. I created a function to take the df, the df$col, and two parameters that are names for two new columns I want created and worked on within the function. After the function finishes running, I want a return df with the two new columns included. I get the two columns back indeed, but they are named after the placeholders in the function shell. See below:
df2 = data.frame(col = c(1, 3, 4, 5),
col1 = c(9, 6, 8, 3),
col2 = c(8, 2, 8, 4))
the function I created will take col and do something to it; return the transformed col, as well as the two newly created columns:
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- 2
hi_perc <- 6
df$df_col_flagH <- as.factor(ifelse(df_col_name<lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name>hi_perc, 1, 0))
df_col_name <- df_col_name + 1.4
df_col_name <- df_col_name * .12
return(df)
}
When I call the function, no_way(df2, col, df$new_col, df$new_col2), instead of getting a df with col, col1, col2, new_col1, new_col2, I get the first three right but get the parametric names for the last two. So something like df, col, col1, col2, df_col_flagH, df_col_flagL. I essentially want the function to return the df with the new columns' names I give it when I am calling it. Please help.
I don't see what your function is trying to do, but this might point you in the right direction:
no_way <- function(df = df2, df_col_name = "col", df_col_flagH = "col1", df_col_flagL = "col2") {
lo_perc <- 2
hi_perc <- 6
df[[df_col_flagH]] <- as.factor(ifelse(df[[df_col_name]] < lo_perc, 1, 0)) # as.factor?
df[[df_col_flagL]] <- as.factor(ifelse(df[[df_col_name]] > hi_perc, 1, 0))
df[[df_col_name]] <- (df[[df_col_name]] + 1.4) * 0.12 # Do in one step
return(df)
}
I needed to call the function with the new column names as strings instead:
no_way(mball, 'TEAM_BATTING_H', 'hi_TBH', 'lo_TBH')
Additionally, I had to use brackets around the target column in my function.

Tidyverse: If_else + str_length + str_pad to mutate 1 column

I have found quite a few threads on each part of the code snippet I am trying to create/use.. but not in the way(s) I am trying to do it.
I have a dataframe of customer information.
1 column is a customer ID (CID), the 2nd column is the customer specific identifier (CSI)
That means customer a single customer id can represent many specific customers from a bigger pool, and the CSI tells me which specific customer from that pool I am looking at.
Data would look like this:
data.frame("CID"=c("1","2","3","4","1","2","3","4"),
"Customer_Pool"=c("Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies",
"Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies"),
"CSI"=c("01","01","01","01","02","02","02","02"),
"Customer_name"=c("Janet","Jane", "Jill", "Jenna", "Joe", "Jim", "Jack", "Jimmy"))
I am trying to combine the CID and CSI numbers.. the problem is I need all the CID to be double digit (01 instead of 1 for example) to match the CID from 10-99
Here is what I have been trying:
DF <- DF %>% mutate(CID = if_else(str_length(CID = 1),
str_pad(CID, width = 2, side = "left), CID))
The error I am getting says: error in str_length(CID = 1): unused argument (CID = 1)
How would I correct this?
You have some syntax issues here. Try
DF <- DF %>% mutate(CID = if_else(str_length(CID) == 1,
str_pad(CID, width = 2, side = "left", pad="0"), CID))
When you call str_length(CID = 1), it looks like you are passing a parameter named "CID" to str_length which it knows nothing about. Rather, you wan to take the string length of CID and then compare that to 1 with == to test for equality (not = which is for parameter names and assignments).
But really the if_else isn't necessary here. If everyhing has to be 2 digits, then just do
DF <- DF %>% mutate(CID = str_pad(CID, width = 2, side = "left", pad="0"))
str_pad will only pad when needed.
Base R solution:
df$p_key <- with(df, paste(ifelse(nchar(CID) == 1, paste0("0", CID), CID), CSI, sep = "-"))
Tidyverse using Mr Flick's clean solution:
library(tidyverse)
df %>%
mutate(p_key = str_c(str_pad(CID, width = 2, side = "left", , pad = "0"), CSI, sep = "-"))

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

How to mutate for loop in dplyr

I want to create multiple lag variables for a column in a data frame for a range of values. I have code that successfully does what I want but is not scalable for what I need (hundreds of iterations)
I have code below that successfully does what I want but is not scalable for what I need (hundreds of iterations)
Lake_Lag <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate) %>%
mutate(lag.Result1 = dplyr::lag(Result, n = 1, default = NA))%>%
mutate(lag.Result5 = dplyr::lag(Result, n = 5, default = NA))%>%
mutate(lag.Result10 = dplyr::lag(Result, n = 10, default = NA))%>%
mutate(lag.Result15 = dplyr::lag(Result, n = 15, default = NA))%>%
mutate(lag.Result20 = dplyr::lag(Result, n = 20, default = NA))
I would like to be able to use a list c(1,5,10,15,20) or a range 1:150 to create lagging variables for my data frame.
Here's an approach that makes use of some 'tidy eval helpers' included in dplyr that come from the rlang package.
The basic idea is to create a new column in mutate() whose name is based on a string supplied by a for-loop.
library(dplyr)
grouped_data <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate)
for (lag_size in c(1, 5, 10, 15, 20)) {
new_col_name <- paste0("lag_result_", lag_size)
grouped_data <- grouped_data %>%
mutate(!!sym(new_col_name) := lag(Result, n = lag_size, default = NA))
}
The sym(new_col_name) := is a dynamic way of writing lag_result_1 =, lag_result_2 =, etc. when using functions like mutate() or summarize() from the dplyr package.
We can use shift from data.table, which can take take multiple valuees for n. According to ?shift
n - Non-negative integer vector denoting the offset to lead or lag the input by. To create multiple lead/lag vectors, provide multiple values to n
Convert the 'data.frame' to 'data.table' (setDT), order by 'StationID', 'Test', 'VisitDate' in i, grouped by 'StationID', 'Test'), get the lag (default type of shift is "lag") of 'Result' with n as a vector of values, and assign (:=) the output to a vector of columns names (created with paste0)
library(data.table)
i1 <- c(1, 5, 10, 15, 20)
setDT(Lake_Champlain_long.term_monitoring_1992_2016)[order(StationID,
Test, VisitDate), paste0("lag.Result", i) := shift(Result, n= i),
by = .(StationID, Test)][]
NOTE: Showed a much efficient solution

dplyr show all rows and columns for small data.frame inside a tbl_df

How do I force dplyr to show all columns and rows of a rather small data.frame. The ddf object below, for example:
df = data.frame(a=rnorm(100), b=c(rep('x', 50), rep('y', 50)), c=sample(1:20, 100, replace=T), d=sample(letters,100, replace=T), e=sample(LETTERS,100,replace=T), f=sample("asdasdasdasdfasdfasdfasdfasdfasdfasdfasd asdfasdfsdfsd", 100, replace=T))
ddf= tbl_df(df)
if you want to still use dplyr and print your dataframe just run
print.data.frame(ddf)
ddf
Ah, I was getting angry with dplyr therefore I could not see. the solution is simple: as.data.frame(ddf). That is to convert dplyr-backed data.frame to generic data.frame.
You can use the function print and adjust the n parameter to adjust the number of rows to show.
For example, the following commdands will show 20 rows.
print(ddf, n = 20)
You can also use the typical dplyr pipe syntax.
ddf %>% print(n = 20)
If you want to show all rows, you can use n = Inf (infinity).
print(ddf, n = Inf)
ddf %>% print(n = Inf)
From the docs:
You can control the default appearance with options:
options(tibble.print_max = n, tibble.print_min = m): if there are more
than n rows, print only the first m rows. Use options(tibble.print_max
= Inf) to always show all rows.
options(tibble.width = Inf) will always print all columns, regardless
of the width of the screen.

Resources