Replace outliers of a dataframe with the mean value [duplicate] - r

This question already has an answer here:
How to replace outlier values?
(1 answer)
Closed 1 year ago.
I want to find all the outliers in a dataframe and replace them by the mean of the variable (column).
This is a big dataframe, composed of 46 obs. of 147 variables.
I was thinking of doing somethings like
new_df <- for (i in scaled.df){
i[!i %in% boxplot.stats(i)$out]
And then replace NULL values, but that function creates a NULL object, I believe the reason is that the new vectors created won´t have the same length.
Any ideas? Thx

You can write a function to do this -
replace_outlier_with_mean <- function(x) {
replace(x, x %in% boxplot.stats(x)$out, mean(x))
}
To apply for multiple columns you can use lapply -
scaled.df[] <- lapply(scaled.df, replace_outlier_with_mean)
Or in dplyr -
library(dplyr)
scaled.df %>% mutate(across(.fns = replace_outlier_with_mean))

Related

A R function to transform column values into new ones does not work [duplicate]

This question already has answers here:
Update data frame via function doesn't work
(6 answers)
Closed 1 year ago.
I have written a function to change values not being NA in each column into a new value. The following example illustrates the problem:
df <- data.frame(A=c(1,NA,1,1,NA),
B=c(NA,1,NA,1,NA),
C=c(1,NA,1,NA,1))
1's should be changed into 0's with the function:
cambio <- function(d,v){
d[[v]][!is.na(d[[v]])] <- 0
}
The column is named within the function with [[]], and it is passed with quotes as argument to the function. I learned this in a clear and useful response to the post Pass a data.frame column name to a function.
However, after running the function, for example, with the first variable,
cambio(df,"A")
the values of tha column keep unchanged.
Why this function does not work as expected?
You have
d[[v]][!is.na(d[[v]])] <- 0
But this tells it to put a zero on any not NA, so you want:
cambio <- function(d,v){
d[[v]][is.na(d[[v]])] <- 0
return(d)
}
EDIT:: you're just missing the return(d) statement.
Here's a few base R solutions:
one:
replace(df, df == 1, 0)
two:
replace(df, !is.na(df), 0)
three:
data.frame(lapply(df, pmin, 0))

Loop over unique subsets [duplicate]

This question already has answers here:
Calculating ratios by group with dplyr
(2 answers)
Closed 2 years ago.
I have a problem with a for() loop running over unique site names in a data frame. The function is running, but it keeps on returning NULL result and I cannot find the mistake I must be making. $POP_RELCATEGORY is numeric. The code is:
x <- split(xx, xx$LOCALITY)
testtest <- for(i in length(unique(names(x)))){
curr_year <- max(x[[i]]$POP_RELCATEGORY[x[[i]]$ROK == 2019])
prev_year <- max(x[[i]]$POP_RELCATEGORY[x[[i]]$ROK == 2018])
return(curr_year/prev_year)
}
testtest
The ideal output would be an vector consisting of curr_year/prev_year for each unique site (locality).
Thank you
You don't need to split the data into various dataframes for this task. There are function available which can help you do to such grouped manipulation. For example, using dplyr you can try :
library(dplyr)
df %>%
group_by(LOCALITY) %>%
summarise(curr_year = max(POP_RELCATEGORY[ROK == 2019]),
prev_year = max(POP_RELCATEGORY[ROK == 2018]),
result = curr_year/prev_year)

R: Function that uses variable dataframe names from a vector [duplicate]

This question already has answers here:
How to convert certain columns only to numeric?
(4 answers)
Make a list from ls(pattern="") [R]
(1 answer)
Closed 2 years ago.
I have a number of x dataframes (depending on previous operation). The names of the dataframes are stored in a different vector:
> list.industries
[1] "misc" "machinery" "electronics" "drugs" "chemicals"
Now, I want to set every column after the 4th as numeric. As the number of created dataframes and, therefore, the names change, I want to ask, if there is any way to do it automatically.
I tried:
for (i in 1:length(list.industries)) {
paste0(list.industries) <- lapply(paste0(list.industries)[,4:ncol(paste0(list.industries))] , as.numeric)
}
Where the function places automatically the name of the dataframe from the vector list.industries to set it as numeric.
Is there any way, how I can place the name of a dataframe as a variable from a vector?
Thanks!
You can use mget to get data as a named list, turn every columns after 4th as numeric and return the dataframe back.
new_data <- lapply(mget(list.industries), function(x) {
x[, 4:ncol(x)] <- lapply(x[, 4:ncol(x)], as.numeric)
x
})
new_data would have list of dataframes, if you want the changes to be reflected in the orignal dataframe use list2env.
list2env(new_data, .GlobalEnv)
You could use this fragment (untested):
one_df <- function(x) {
dat <- get(x)
for (i in seq(4, ncol(dat))) dat[,i] <- as.numeric(dat[,i])
return(dat)
}
ans <- lapply(list.industries, one_df)
So in short: you are looking for get.

Apply a function to each column in a dataframe in R [duplicate]

This question already has answers here:
Replace all occurrences of a string in a data frame
(7 answers)
Closed 2 years ago.
I would like to replace a series of "99"s in my dataframe with NA. To do this for one column I am using the following line of code, which works just fine.
data$column[data$column == "99"] = NA
However, as I have a large number of columns I want to apply this to all columns. The following line of code isn't doing it. I assume it is because the third "x" is again a reference to the dataframe and not to a specific column.
data = lapply(data, function(x) {x[x == "99"] = NA})
Any advice on what I should change?
If you want to replace all 99, simply do
data[data=="99"] <- NA
If you want to stick to the apply function
apply(data, 2, function(x) replace(x, x=="99", NA))

Functionally add new column based on division of two others [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 7 years ago.
Background
Sorry if this is a repeat, I couldn't find an exact match to this question.
So as part of a larger function, I'm trying to add a new column in a data.frame which is basically the division of two variables within that data.frame.
For example:
data(iris)
iris_test <- function(dataset, var1, var2) {
data <- dataset
data$length_width <- data$var1/data$var2
return(data)
}
If i then utilize this function
iris <- iris_test(iris, 'Petal.Length', 'Petal.Width')
I would hopefully generate a new column with data$length_width, however the code is breaking.
Error in `$<-.data.frame`(`*tmp*`, "length_width", value = numeric(0)) :
replacement has 0 rows, data has 150
I suspect you could do something fancy with paste() or formula() but really I want to understand what is happening and wy.
You cannot use character variables for the dollar notation. Try this:
data(iris)
iris_test <- function(dataset, var1, var2) {
data <- dataset
data$length_width <- data[[var1]]/data[[var2]]
return(data)
}

Resources