Applying function to each group and column of R dataframe

Applying function to each group and column of R dataframe - r

I need to apply this function
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
to dataset that look like this:
Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2
... ... ... ...
I have to apply replace_outliers to each category apart and for each column. How to achieve that?

You can use the package dplyr. Use group_by to do it for each Category and mutate_if to apply the function to all numerical columns
library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)

Use mutate_all within a group_by:
library(dplyr)
DF %>%
group_by(Category) %>%
mutate_all(replace_outliers) %>%
ungroup

Consider base R with by (to split by category), sapply (to call function), and do.call to bind all groups back together:
df_list <- by(data, data$category, function(sub) {
sub[-1] <- sapply(sub[-1], replace_outliers)
sub
})
final_df <- do.call(rbind, unname(df_list))

Related

Iterate sequentially over two lists in R

I have two df that look something like this
library(tidyverse)
iris <- iris%>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
My aim is to reduce the values of the variables in iris that are above the maximum values of the corresponding variable in iris2, to match the maximum value in iris2.
I have written a function that does this.
max(iris$Sepal.Length)
[1] 9.9
max(iris2$Sepal_Length)
[1] 7.9
# i want every value of iris that is >= to max value of iris2 to be equal to the max value of iris 2.
# my function:
fixmax<- function(data,data2,var1,var2) {
data<- data %>%
mutate("{var1}" := ifelse(get(var1)>=max(data2[[var2]],na.rm = T),
max(data2[[var2]],na.rm = T),get(var1)))
return(data)
}
# apply my function to a variable
tst_iris <- fixmax(iris,iris2,"Sepal.Length","Sepal_Length")
max(tst_iris$Sepal.Length)
7.9 # it works!
The challange I face is that I would like to iterate my function sequentially overtwo lists of variables- i.e. Sepal.Length with Sepal_Length, Sepal.Widthwith Sepal_Width etc.
Does anyone knows how I can do this?
I tried using Map but I am doing something wrong.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
My goal is to obtain a df (final_iris) where every variable has been adjusted using the criteria specified by fixmax.
I know I can do this by running my function on every variable like so.
final_iris <- iris
final_iris <- fixmax(final_iris,iris2,"Sepal.Length","Sepal_Length")
final_iris <- fixmax(final_iris,iris2,"Sepal.Width","Sepal_Width")
final_iris <- fixmax(final_iris,iris2,"Petal.Length","Petal_Length")
final_iris <- fixmax(final_iris,iris2,"Petal.Width","Petal_Width")
But in the real data, I have to run this operation tens of times and I would like to be able to loop my function sequentially.
Does anyone know how I loop my fixmax over lst1 and lst2 sequentially?

Rather than explicitly iterating over the different datasets and columns by name, you can take advantage of the vectorization built into R. If the dataframes have the same column/variable ordering a function mapped to both dataframes using mapply or purrr::map2 will iterate column by column without the need to specify column names.
Given two input data frames (df_small and df_big) the steps are:
Calculate the max of each column in df_small to create df_small_max
Apply the pmin function to each column of df_big and each value of df_small_max using mapply (or purr::map2_dfc if you prefer tidyverse mapping)
#set up fake data
df_small <- iris[,1:4]
df_big <- df_small + 2
# find max of each col in df_small
df_small_max <- sapply(df_small, max)
# replace values of df_big which are larger than df_small_max
df_big_fixed <- mapply(pmin, df_big, df_small_max)
# sanity check -- Note the change in Sepal.Width
df_small_max
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 7.9 4.4 6.9 2.5
head(df_big, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.1 5.5 3.4 2.2
#> 2 6.9 5.0 3.4 2.2
#> 3 6.7 5.2 3.3 2.2
head(df_big_fixed, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 7.1 4.4 3.4 2.2
#> [2,] 6.9 4.4 3.4 2.2
#> [3,] 6.7 4.4 3.3 2.2
Created on 2021-07-31 by the reprex package (v2.0.0)

It's likely that your issue is related to the fact that dataframes are themselves lists. Map() expects the non-function arguments to be lists of the same length. Any arguments that are shorter than the longest list are "recycled" to match it's length.
Currently, you have:
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
This is actually equivalent to:
final_iris<- Map(fixmax,
list(iris$Sepal.Length,
iris$Sepal.Width,
iris$Petal.Length,
iris$Petal.Width,
iris$Species),
list(iris2$Sepal_Length,
iris2$Sepal_Width,
iris2$Petal_Length,
iris2$Petal_Width,
iris2$Species),
lst1,
lst2)
(To understand why, you must remember that dataframes like iris and iris2 are, technically, under the hood, lists of [atomic] vectors.)
I suspect that you want iris and iris2 to be supplied to each call to fixmax(). In order to have Map() recycle these two vectors, they need to be supplied as single-element lists. Like so:
final_iris<- Map(fixmax, list(iris), list(iris2),lst1,lst2)
To combine a list of dataframes into a single dataframe do
do.call(rbind, final_iris)

Here is a mostly base way. I also renamed the variables because I had some trouble replicating since originally the approach would save over the iris object.
The approach is that instead of mutating a data.frame object, we instead only return the vector of the expected values from our modified function. Then, we re-assign those values back to our original data.frame.
fixmax2 = function(x, y) {
max_y = max(y, na.rm = TRUE)
ifelse(x >= max_y, max_y, y)
}
cols = which(sapply(df_plus, is.numeric))
df_plus[cols] = Map(fixmax2, df_plus[cols], df_iris[cols])
df_plus
Raw data:
library(dplyr)
df_plus = iris %>% mutate_at((1:4), ~. + 2) ## let's not save over iris
df_iris = iris
names(df_iris)<-sub(".", "_", names(df_iris), fixed = TRUE)

Is that what you're expecting ?
my_a <- iris %>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
my_var <- which(my_a$Sepal.Length >= max(iris2$Sepal_Length) & my_a$Sepal.Width >= max(iris2$Sepal_Width))
if (length(my_var)) {
my_a <- my_a[my_var,]
}

Your function seems convoluted and hard to read at a first glance. We can tidy up the function to return max(x, max_val) for each value in a column with a quick function
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
Finally, we want to apply this automatically and sequentially using the two dataframes. We will use a simple for loop. Code to set up the problem is attached.
#libraries
library(tidyverse)
#set up fake data
iris_big <- iris%>% mutate_at((1:4),~.+2)
iris_small <- iris
names(iris_small)<- sub(".", "_", names(iris_small), fixed = TRUE)
#check which is the bigger one and the smaller
max(iris_big$Sepal.Length) #bigger
max(iris_small$Sepal_Length) #smaller
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
#apply it to get a final result
iris_final <- iris_big
# iterate over columns, assuming same positions
# you can edit the 1:ncol(iris_final) to only take the columns you want
for (i in 1:ncol(iris_final)) {
#check numeric
if (is.numeric(iris_final[,i])) {
#applies the function - notice we call iris_final and iris_small
iris_final[,i] <- sapply(iris_final[,i],
adjust_max,
max_val = max(iris_small[,i]))
}
}
#check answer is correct
apply(iris_final[,1:4], 2, max)
apply(iris_small[,1:4], 2, max)
tail(iris_final)

For a tidyverse approach you can use transmute instead of mutate. transmute would return only one column in each iteration whereas mutate would return all the columns every time.
Apart from that to make it more tidyverse friendly I am using .data instead of get. Also using pmin instead of complicated ifelse solution.
library(dplyr)
library(purrr)
fixmax<- function(data,data2,var1,var2) {
data<- data %>% transmute("{var1}" := pmin(.data[[var1]], max(data2[[var2]])))
return(data)
}
To apply the function to each pair of columns you can use map2_dfc which will also combine the results in one dataframe.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
Compare the max values of two dataframes before applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 9.9 6.4 8.9 4.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
Apply the function -
iris[lst1] <- map2_dfc(lst1, lst2, ~fixmax(iris, iris2, .x, .y))
Compare the max values of two dataframes after applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 7.9 4.4 6.9 2.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5

You should consider using column indices; a complete (not including the data-frame construction) base R solution could look like:
# Resolve the indices of the numeric vectors in
# iris: num_cols => integer vector
num_cols <- which(
vapply(
iris,
is.numeric,
logical(1)
),
arr.ind = TRUE
)
# Map the pmin function over iris to select the
# minimum of the vector element in iris and the
# maximum values of that vector in iris2:
# iris => data.frame
iris[,num_cols] <- Map(function(i){
pmin(
iris[,i],
max(
iris2[,i],
na.rm = TRUE
)
)
},
num_cols
)

You can do this by creating a matrix of the max value repeated in each column and use pmin to take the minimum values between the max values in iris2 and the values in the other dataframe. I created a new fixmax function which only takes the two dataframes as arguments.
Preparing the data
library(tidyverse)
initial <- iris %>% mutate_at(1:4, ~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
print(max(initial$Sepal.Length))
# [1] 9.9
print(max(iris2$Sepal_Length))
# [1] 7.9
Creating the function
fixmax <- function(df, dfmax){
colids <- which(unlist(lapply(dfmax, is.numeric)))
dfmax <- apply(dfmax[, colids], 2, max) %>%
matrix(nrow=nrow(dfmax), ncol=length(colids), byrow=TRUE) %>%
as.data.frame()
df[, colids] <- pmin(df[,colids], dfmax)
return(df)
}
Testing the function
newiris <- fixmax(initial, iris2)
print(max(newiris$Sepal.Length))
# [1] 7.9
assertthat::assert_that(!identical(newiris, iris2))
# [1] TRUE
assertthat::assert_that(all((initial == newiris) || (iris2 == newiris)))
# [1] TRUE
imax = apply(iris2[, 1:4], 2, max) %>%
matrix(nrow=nrow(iris2), ncol=4, byrow=TRUE) %>%
as.data.frame()
assertthat::assert_that(all(newiris[, 1:4] <= imax))
# [1] TRUE
print(head(newiris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.1 4.4 3.4 2.2 setosa
# 2 6.9 4.4 3.4 2.2 setosa
# 3 6.7 4.4 3.3 2.2 setosa
# 4 6.6 4.4 3.5 2.2 setosa
# 5 7.0 4.4 3.4 2.2 setosa
# 6 7.4 4.4 3.7 2.4 setosa

How to dynamically create variables and combine it to the dataframe in r?

I am running kmeans for multiple number of clusters and then trying to combine cluster results to the original dataframe.
from post https://stats.stackexchange.com/questions/10838/produce-a-list-of-variable-name-in-a-for-loop-then-assign-values-to-the I am using their below mentioned code to create variables dynamically and modifying as per my need.
original code in the above post:
x <- as.list(rnorm(10000))
names(x) <- paste("a", 1:length(x), sep = "")
list2env(x , envir = .GlobalEnv)
Now applying this on iris data:
library(tidyverse)
library(ggthemes)
library(factoextra)
this works fine in creating 3 list of clusters:
# running for 1 to 3 clusters
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# iris_df <- cbind(iris, cluster_res_list)
} )
Issue: When I try to combine them with the original dataset I am getting an error: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"kmeans"’ to a data.frame
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# to combine each cluster result to original df
iris_df <- cbind(iris, cluster_res_list)
} )

The output from kmeans can be viewed as a matrix using the fitted function. The row names of the matrix identify the clusters. If you want to add a column to the original date frame that identifies the cluster assignment, then something like this would work.
Using 3 clusters as an example:
cluster_num <- 3
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris) %>%
tail()
iris_clus Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 2 6.7 3.3 5.7 2.5 virginica
146 2 6.7 3.0 5.2 2.3 virginica
147 1 6.3 2.5 5.0 1.9 virginica
148 2 6.5 3.0 5.2 2.0 virginica
149 2 6.2 3.4 5.4 2.3 virginica
150 1 5.9 3.0 5.1 1.8 virginica
Inserting this into the lapply from your example
lapply(1:3, function(cluster_num) {
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris)
})
Here's one way to combine it all into one data set. With one column per model
clusters <- Reduce(cbind, lapply(1:3, function(cluster_num) {
result <- iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .)
names(result) <- paste("iris_clus", cluster_num, sep = "_")
return(result)
}))
cbind(iris, clusters)

Function in tidyverse

I want to create tidyverse with intermediate function.
I have a structure as
temp1 = sapply(df, function(x) .....)
temp2 = sapply(temp1, function(x) .......... )
temp3 = sapply(df, function(x) ..........)
temp = data.frame(temp2/temp3)
And I want to get something like this
sapply(df, function(x) .......) %>% sapply(df, function(x) ....... )
%>% ......
Reproducible example:
df = data.frame(a = c(1,2,3), b = c(1,2,3))
temp1 = sapply(df, function(x) x*3)
temp2 = sapply(temp1, function(x) x+4 )
temp3 = sapply(df, function(x) x/4)
temp = data.frame(temp2/temp3)

Assuming you have more complicated functions to perform on every column than the one shown you could use purrr functions like :
library(purrr)
map2_df(map(df, ~.x * 3 + 4), map(df, ~.x/4), `/`)
# a b
# <dbl> <dbl>
#1 28 28
#2 20 20
#3 17.3 17.3

To the best of my knowledge, the pipe operator do not remember the first block of the chain, only the previous one, so you have to use an intermediate step.
However, you can simplify the first part of your code to a pipeline:
temp1 = df %>% sapply(function(x) x*3) %>% sapply(function(x) x+4)
temp = temp1/sapply(df, function(x) x/4)

You can use brackets to wrap a whole pipe chain and use it as a data frame.
(df %>% sapply(., function(x) x*3) %>% sapply(., function(x) x+4 )) /
(df %>% sapply(., function(x) x/4) )

Which field is different using dplyr?

I have the following data
df <- data.frame(val1=c(1.2,0.5,3.8,2.5,7.4),
val2=c(1.2,2.5,3.8,2.5,2.4),
val3=c(1.2,2.5,3.6,2.5,7.4),
val4=c(1.2,2.5,3.8,2.5,4.4),
val5=c(1.2,2.5,3.8,2.9,7.4))
I'd like to find which field is different to the rest. Expecting a result like below to be added to the data frame
cbind(df,results = c("all_equal", "val1","val3","val5","morethan1"))
Is there any way to do this in an easy way? I have an extensive loop to get this result which I won't post here. I'm looking for a quick solution that I've missed to see (maybe using dplyr)

First, define a function to calculate the mode. I used the function found here: https://stackoverflow.com/a/8189441/7669809
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
After that, we can use the following code to get the desired output.
apply(df, 1, function(x){
x_mode <- Modes(x)
if (all(x == x_mode)){
return("all_equal")
} else if (sum(x != x_mode) > 1){
return("morethan1")
} else {
ind <- which(x != x_mode)
return(paste0("val", ind))
}
})
# [1] "all_equal" "val1" "val3" "val5" "morethan1"

Here is one option with tidyverse by reshaping the data into 'long' format, apply the conditions to create the 'result' column and then bind the column with the original dataset
library(tidyverse)
rownames_to_column(df, 'rn') %>%
gather(key, val, matches('^val')) %>%
group_by(rn) %>%
mutate(Mode = Modes(val)) %>%
summarise(result = case_when(all(val == Mode) ~ "all_equal",
sum(val != Mode) > 1 ~ "morethan1",
TRUE ~ paste0("val", which(val != Mode), collapse=","))) %>%
select(result) %>%
bind_cols(df, .)
# val1 val2 val3 val4 val5 result
#1 1.2 1.2 1.2 1.2 1.2 all_equal
#2 0.5 2.5 2.5 2.5 2.5 val1
#3 3.8 3.8 3.6 3.8 3.8 val3
#4 2.5 2.5 2.5 2.5 2.9 val5
#5 7.4 2.4 7.4 4.4 7.4 morethan1
The Modes function
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}

How to write a function to rename multiples columns at once?

df1 <- df %>%
rename(newcol1 = oldcol1) %>%
rename(newcol2 = oldcol2) %>%
rename(newcol3 = oldcol3) %>%
rename(newcol4 = oldcol4) %>%
rename(newcol5 = oldcol5)
I am trying to write a function, which I just learned, that will do the same thing as above.
renaming = function(df, oldcol, newcol) {
rename(df, newcol = oldcol)
but then I am not sure how to do with the multiple columns..
any help would be much appreciated!

Using base R
names(df) <- c("newname1", "newname2", "newname3") # for all varnames
names(df)[c(1,3,4)] <- c("newname1", "newname3", "newname4") # for varnames 1,3,4
names(df)[names(df) == "oldname"] <- "newname" # for one varname
Using data.table
setnames(dt, old=c("oldname1", "oldname2"), new=c("newname1", "newname2"))
Using dplyr/tidyverse
df %>% rename(newname1 = oldname1, newname2 = oldname2)

You can use set_names from the tidyverse package purrr.
Reproducible example:
> df <- iris
> df1 <- df %>%
purrr::set_names(c("d","x","y","z","a"))
> df1
d x y z a
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Applying function to each group and column of R dataframe - r

Use mutate_all within a group_by: library(dplyr) DF %>% group_by(Category) %>% mutate_all(replace_outliers) %>% ungroup

Consider base R with by (to split by category), sapply (to call function), and do.call to bind all groups back together: df_list <- by(data, data$category, function(sub) { sub[-1] <- sapply(sub[-1], replace_outliers) sub }) final_df <- do.call(rbind, unname(df_list))

Related

Iterate sequentially over two lists in R

How to dynamically create variables and combine it to the dataframe in r?

Function in tidyverse

Which field is different using dplyr?

How to write a function to rename multiples columns at once?

Categories

Resources