How to write a function to rename multiples columns at once? - r

df1 <- df %>%
rename(newcol1 = oldcol1) %>%
rename(newcol2 = oldcol2) %>%
rename(newcol3 = oldcol3) %>%
rename(newcol4 = oldcol4) %>%
rename(newcol5 = oldcol5)
I am trying to write a function, which I just learned, that will do the same thing as above.
renaming = function(df, oldcol, newcol) {
rename(df, newcol = oldcol)
but then I am not sure how to do with the multiple columns..
any help would be much appreciated!

Using base R
names(df) <- c("newname1", "newname2", "newname3") # for all varnames
names(df)[c(1,3,4)] <- c("newname1", "newname3", "newname4") # for varnames 1,3,4
names(df)[names(df) == "oldname"] <- "newname" # for one varname
Using data.table
setnames(dt, old=c("oldname1", "oldname2"), new=c("newname1", "newname2"))
Using dplyr/tidyverse
df %>% rename(newname1 = oldname1, newname2 = oldname2)

You can use set_names from the tidyverse package purrr.
Reproducible example:
> df <- iris
> df1 <- df %>%
purrr::set_names(c("d","x","y","z","a"))
> df1
d x y z a
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

Related

Turn a data frame column into vector with names as row names

I have an output which looks like the following code:
data.frame(H = c(1.5,4.5,5,8)) %>% `rownames<-`(c("a","b","c","d"))
H
a 1.5
b 4.5
c 5.0
d 8.0
Ideally, using dplyr, I would like to convert it to a vector like this:
a b c d
1.5 4.5 5.0 8.0
Is there anyway I can do this without defining any new variables and using only the pipe operator?
Using unlist will not result in the desired outcome.
data.frame(H = c(1.5,4.5,5,8)) %>% `rownames<-`(c("a","b","c","d")) %>% unlist()
H1 H2 H3 H4
1.5 4.5 5.0 8.0
With rownames_to_column + deframe from tibble:
library(tibble)
df %>%
rownames_to_column() %>%
deframe()
# a b c d
#1.5 4.5 5.0 8.0
Another option with pull
library(dplyr)
library(tibble)
df %>%
rownames_to_column() %>%
pull(H, rowname)
Or with the exposition pipe %$%:
library(magrittr)
df %$%
set_names(H, rownames(.))
For the sake of completeness, the base R one-liner:
setNames(df$H, rownames(df))
That can also be piped, with magrittr's %$%:
df %$%
setNames(H, rownames(.))
Data:
df <- data.frame(H = c(1.5,4.5,5,8)) %>%
`rownames<-`(c("a","b","c","d"))

How to dynamically create variables and combine it to the dataframe in r?

I am running kmeans for multiple number of clusters and then trying to combine cluster results to the original dataframe.
from post https://stats.stackexchange.com/questions/10838/produce-a-list-of-variable-name-in-a-for-loop-then-assign-values-to-the I am using their below mentioned code to create variables dynamically and modifying as per my need.
original code in the above post:
x <- as.list(rnorm(10000))
names(x) <- paste("a", 1:length(x), sep = "")
list2env(x , envir = .GlobalEnv)
Now applying this on iris data:
library(tidyverse)
library(ggthemes)
library(factoextra)
this works fine in creating 3 list of clusters:
# running for 1 to 3 clusters
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# iris_df <- cbind(iris, cluster_res_list)
} )
Issue: When I try to combine them with the original dataset I am getting an error: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"kmeans"’ to a data.frame
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# to combine each cluster result to original df
iris_df <- cbind(iris, cluster_res_list)
} )
The output from kmeans can be viewed as a matrix using the fitted function. The row names of the matrix identify the clusters. If you want to add a column to the original date frame that identifies the cluster assignment, then something like this would work.
Using 3 clusters as an example:
cluster_num <- 3
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris) %>%
tail()
iris_clus Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 2 6.7 3.3 5.7 2.5 virginica
146 2 6.7 3.0 5.2 2.3 virginica
147 1 6.3 2.5 5.0 1.9 virginica
148 2 6.5 3.0 5.2 2.0 virginica
149 2 6.2 3.4 5.4 2.3 virginica
150 1 5.9 3.0 5.1 1.8 virginica
Inserting this into the lapply from your example
lapply(1:3, function(cluster_num) {
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris)
})
Here's one way to combine it all into one data set. With one column per model
clusters <- Reduce(cbind, lapply(1:3, function(cluster_num) {
result <- iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .)
names(result) <- paste("iris_clus", cluster_num, sep = "_")
return(result)
}))
cbind(iris, clusters)

Run the same codes with data and variable names changed in R

I need to run very similar codes for 3 different dataset. My current codes look like this:
## data a
a_dat2 <- merge(a_dat, zip, by = "zip", all.x = T)
a_dat2 <- a_dat2 %>%
group_by(zip) %>%
summarize(dist_a_min = min(dist))
## data b
b_dat2 <- merge(b_dat, zip, by = "zip", all.x = T)
b_dat2 <- b_dat2 %>%
group_by(zip) %>%
summarize(dist_b_min = min(dist))
## data c
c_dat2 <- merge(c_dat, zip, by = "zip", all.x = T)
c_dat2 <- c_dat2 %>%
group_by(zip) %>%
summarize(dist_c_min = min(dist))
The codes for the 3 dataset are same except that the name of the data varies: a_dat, b_dat, c_dat. The variable name dist varies too: dist_a_min, dist_b_min, dist_c_min. What function/loop can be used to shorten the codes so that I don't need to copy and paste for each dataset separately?
An option would be to place the elements in a list with mget, loop through the list with imap, join (?left_join) with 'zip' dataset, grouped by 'zip' and get the min of 'dist' while creating the column name based on the identifier name substring
library(tidyverse)
mget(ls(pattern = "_dat2$")) %>%
imap(~ left_join(.x, zip, by = 'zip') %>%
group_by(zip) %>%
summarise((! str_c('dist_', substr(.y, 1, 1), '_min') := min(dist)))
Or another option is to create a function for repeated tasks
joinSumm <- function(dat, groupName, colName, data2) {
groupName <- enquo(groupName)
colName <- enquo(colName)
nm1 <- str_c('dist_', str_sub(rlang::as_name(enquo(dat)), 1, 1), '_min')
dat %>%
left_join(data2, by = rlang::as_name(groupName)) %>%
group_by(!! groupName) %>%
summarise((!! nm1) := min(!! colName))
}
joinSumm(a_dat2, zip, dist, zip)
joinSumm(b_dat2, zip, dist, zip)
A reproducible example with built-in dataset iris (without the join part)
list(a_dat = iris, b_dat = iris, c_dat = iris) %>%
imap(~ .x %>%
group_by(Species) %>%
summarise(!! str_c('dist_', substr(.y, 1, 1), '_min') := min(Sepal.Length)))
#$a_dat
# A tibble: 3 x 2
# Species dist_a_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9
#$b_dat
# A tibble: 3 x 2
# Species dist_b_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9
$c_dat
# A tibble: 3 x 2
# Species dist_c_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9

Applying function to each group and column of R dataframe

I need to apply this function
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
to dataset that look like this:
Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2
... ... ... ...
I have to apply replace_outliers to each category apart and for each column. How to achieve that?
You can use the package dplyr. Use group_by to do it for each Category and mutate_if to apply the function to all numerical columns
library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)
Use mutate_all within a group_by:
library(dplyr)
DF %>%
group_by(Category) %>%
mutate_all(replace_outliers) %>%
ungroup
Consider base R with by (to split by category), sapply (to call function), and do.call to bind all groups back together:
df_list <- by(data, data$category, function(sub) {
sub[-1] <- sapply(sub[-1], replace_outliers)
sub
})
final_df <- do.call(rbind, unname(df_list))

How to do rowSums over many columns in ``dplyr`` or ``tidyr``?

For example, is it possible to do this in dplyr:
new_name <- "Sepal.Sum"
col_grep <- "Sepal"
iris <- cbind(iris, tmp_name = rowSums(iris[,grep(col_grep, names(iris))]))
names(iris)[names(iris) == "tmp_name"] <- new_name
This adds up all the columns that contain "Sepal" in the name and creates a new variable named "Sepal.Sum".
Importantly, the solution needs to rely on a grep (or dplyr:::matches, dplyr:::one_of, etc.) when selecting the columns for the rowSums function, and have the name of the new column be dynamic.
My application has many new columns being created in a loop, so an even better solution would use mutate_each_ to generate many of these new columns.
Here a dplyr solution that uses the contains special functions to be used inside select.
iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>% rowSums()) -> iris2
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
and here the benchmarks:
Unit: milliseconds
expr
iris2 <- iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>% rowSums())
min lq mean median uq max neval
1.816496 1.86304 2.132217 1.928748 2.509996 5.252626 100
Didn't want to comment this as it's too long.
Not much in it in terms of timing for the solutions (expect the data.table solution which appearsslower) that have been proposed and none stand out as clearly more elegant.
library(dplyr)
library(data.table)
new_name <- "Sepal.Sum"
col_grep <- "Sepal"
# Make iris bigger
data(iris)
for(i in 1:18){
iris <- bind_rows(iris, iris)
}
iris1 <- iris
system.time({
# Base solution
iris1 <- cbind(iris1, tmp_name = rowSums(iris1[,grep(col_grep, names(iris1))]))
names(iris1)[names(iris1) == "tmp_name"] <- new_name
})
# 1.26
system.time({
# less elegant dplyr solution
iris %>% select(matches(col_grep)) %>% rowSums() %>%
data.frame(.) %>% bind_cols(iris, .) %>% setNames(., c(names(iris), new_name))
})
# 1.14
system.time({
# bit more elegant dplyr solution
iris %>% mutate(tmp_name = rowSums(.[] %>% select(matches(col_grep)))) %>%
rename_(.dots = setNames("tmp_name", new_name))
})
# 1.12
data(iris)
# Make iris bigger
for(i in 1:18){
iris <- rbindlist(list(iris, iris))
}
system.time({
setDT(iris)[, tmp_name := rowSums(.SD[,grep(col_grep, names(iris)), with = FALSE])]
setnames(iris, "tmp_name", new_name)
})
# 2.39

Resources