I am working on analysing some data using wage variables. The variable contains symbol '€' and 'M' or 'K'.
I was trying to use gsub() function to address this issue, yet my code doesn't work
Integer_converter <- function(strWage) {
Factor_Wage = gsub("€", " ", strWage)
}
Factor_converter_1 <- function(strWage) {
Integer_Wage = gsub("M", " ", strWage)
}
Factor_converter_2 <- function(strWage) {
Integer_wage = as.integer(as.integer(gsub("K", "", strWage)) / 100)
}
The actual values are listed as follows:
$ Wage /fct/ €405K, €195K, €205K, €240K, €175K, €25K, €205K, €57K, €140K, €135K, €15K, €45K, €40K, €76K, €17K, €125K, …
and I want to convert it into
$ Wage /int/ 0.405, 0.195, 0.205, 0.240, 0.175, 0.025, 0.205, 0.057, 0.140, 0.135, 0.015, 0.045, 0.040, 0.076, 0.017, 0.125, …enter image description here
We can use parse_number from readr to extract the number and divide by 1000.
library(readr)
parse_number(as.character(df1$Wage))/1000
#[1] 0.405 0.195 0.205 0.240 0.175 0.025 0.205 0.057 0.140
#[10] 0.135 0.015 0.045 0.040 0.076 0.017 0.125
It extracts the numeric part and then just divide by 1000
It can also be done with tidyverse chain
library(dplyr)
df1 %>%
mutate(Wage = parse_number(as.character(Wage))/1000)
If there are "M" in addition to "K", we can use gsubfn
library(gsubfn)
unname(sapply(gsubfn("[A-Z]", list(K = '/1e3', M = '/1e6'),
sub("€", "", df2$Wage)), function(x) eval(parse(text = x))))
data
df1 <- data.frame(Wage = c("€405K", "€195K", "€205K", "€240K", "€175K",
"€25K", "€205K", "€57K", "€140K", "€135K", "€15K", "€45K",
"€40K", "€76K", "€17K", "€125K"))
df2 <- data.frame(Wage = c("€405K", "€195K", "€205K", "€240K", "€175K",
"€25K", "€205K", "€57K", "€140K", "€135K", "€15M", "€45K",
"€40K", "€76K", "€17M", "€125K"))
Related
I have df a which has two columns viz; PC and Price as given below
dput(a)
structure(list(PC = c(0, -0.009, 0, -0.008, 0.003, 0.008, -0.017,
0.032, 0.062, -0.02), Price = c(111.449554, 110.495506, 110.476242,
109.560745, 109.859482, 110.68824, 108.780159, 112.239769, 119.216805,
116.826897)), class = "data.frame", row.names = c(NA, -10L))
I have lookbackPeriod = 2 which means I have to calculate rollsum of a$PC column for every nth lookbackPeriod and
this I could calculate by
b <- data.frame(MScore = rollsum(a$PC,k=lookbackPeriod ))
I also have holdPeriod = 2 which means I have to calculate percentage in price for every nth holdPeriod but nth holdPeriod period starts where the nth lookbackPeriod ends
example if this was a then for the rollsum it will consider values in Yellow but to calculate percentage change it will consider value in orange and green.
So final output in B would look line this for the given input
The output will have nrow(a) - lookbackPeriod + 1 - holdPeriod observations
This kind of gives me the results but it is coercing LHS to list at the last line and first two records of Delt function result have NA
b <- data.frame(MScore = rollsum(a$PC,k=lookbackPeriod ))
b <- b[1:(nrow(b)-holdPeriod+1),]
a <- a[lookbackPeriod:nrow(a),]
b$PChg <- as.data.frame( Delt(a$Price,k=holdPeriod))
Use rollapply with a width equal to the sum of the periods left aligning the results using the appropriate functions:
library(zoo)
k1 <- lookbackPeriod
k <- k1 + holdPeriod
roll <- function(...) rollapply(..., fill = NA, align = "left")
transform(a, MScore = roll(PC, k, function(x) sum(x[1:k1])),
PChg = roll(Price, k, function(x) x[k]/x[k1]-1))
giving:
PC Price MScore PChg
1 0.000 111.4496 -0.009 -0.008459720
2 -0.009 110.4955 -0.009 -0.005582739
3 0.000 110.4762 -0.008 0.010291049
4 -0.008 109.5607 -0.005 -0.009824578
5 0.003 109.8595 0.011 0.014017108
6 0.008 110.6882 -0.009 0.095942551
7 -0.017 108.7802 0.015 0.040869008
8 0.032 112.2398 NA NA
9 0.062 119.2168 NA NA
10 -0.020 116.8269 NA NA
This variation would also work:
f <- function(x) c(MScore = sum(x[1:k1, 1]), PChg = x[k, 2]/x[k1, 2] - 1)
cbind(a, rollapply(a, k, f, fill = NA, align = "left", by.column = FALSE))
I want to calculate normalised ratios in all possible combinations efficiently for a large matrix in R. I have asked a similar question earlier here and with a small data and the solutions provided there worked fine. But when I am trying to apply the same solution for a large dataset (400 x 2151), my system is getting hang. My system is having 16 GB RAM with Intel i7 processer. Here is the code with data
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
Solution provided by #Ronak Shah
cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])
Or the following solution as provided by #akrun
f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")
Both the solutions taking a very long time and the system is getting hang. So, how can I efficiently do it?
Update
Upadate on expected output
library(tidyverse)
#Fake dataset
df = structure(list(var_1 = c(0.035, 0.047, 0.004, 0.011, 0.01, 0.01, 0.024),
var_2 = c(0.034, 0.047, 0.004, 0.012, 0.01, 0.011, 0.025),
var_3 = c(0.034, 0.047, 0.006, 0.013, 0.011, 0.013, 0.026),
var_4 = c(0.034, 0.046, 0.008, 0.016, 0.014, 0.015, 0.028),
var_5 = c(0.034, 0.046, 0.009, 0.017, 0.015, 0.016, 0.029)),
class = "data.frame", row.names = c(NA, -7L))
cols <- 1:ncol(df)
mat_out <- do.call(cbind, lapply(cols, function(xj)
sapply(cols, function(xi) (df[, xj] - df[, xi])/(df[, xj] + df[, xi]))))
colnames(mat_out) <- outer(names(df), names(df), paste, sep = ",")
y <- read.table(text = "s_no y
1 95.512
2 97.9
3 92.897
4 94.209
5 87.472
6 91.109
7 92.83", header = T)
mat_out %>% as.data.frame() %>%
mutate(id = row_number()) %>%
left_join(y, by = c("id" = "s_no")) %>%
pivot_longer(cols = -c(y, id)) %>%
group_by(name) %>%
mutate(correl = cor(value, y, use = "complete.obs")) %>%
distinct(name, .keep_all = TRUE) %>%
separate(name, c("Wav1", "Wav2"), sep = ",") %>%
select(-c("id", "y", "value")) %>%
pivot_wider(names_from = Wav2, values_from = correl)
#> # A tibble: 5 × 6
#> Wav1 var_1 var_2 var_3 var_4 var_5
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 var_1 NA -0.190 -0.358 -0.537 -0.551
#> 2 var_2 0.190 NA -0.322 -0.528 -0.544
#> 3 var_3 0.358 0.322 NA -0.682 -0.667
#> 4 var_4 0.537 0.528 0.682 NA -0.595
#> 5 var_5 0.551 0.544 0.667 0.595 NA
You can use rcpp to make your code faster:
Rcpp::cppFunction("
std::vector<double> my_fun(arma::mat& x, arma::vec& y){
int p = x.n_cols - 1;
std::vector<double> result;
for(int i = 0; i<p; i++){
auto m = (x.cols(i+1, p).each_col() - x.col(i));
m /= (x.cols(i+1, p).each_col() + x.col(i));
auto a = arma::conv_to<std::vector<double>>::from(arma::cor(m, y));
result.insert(result.end(), a.begin(), a.end());
}
return result;
}", 'RcppArmadillo')
my_fun(df, y) # takes approximately 14seconds.
You could use the STL functions to make it even faster. Though the code will be longer. on my computer this takes 6seconds
mat <- matrix(rexp(860400), nrow = 400, ncol = 2151)
y <- rnorm(nrow(df), 700, 10)
my_fun(mat, y) # This works
Since memory seems to be your main issue, how about using iterators? Using the package RcppAlgos*, we can make use of permuteIter to calculate your ratios N at a time.
If one must have names, we need an additional iterator. This means you must keep 2 iterators in sync, which can become tedious. Fortunately, with the summary() methods of permuteIter, we can always see what the current index is and reset them with a wide range of options (e.g. random access [[, front(), back(), or startOver()).
library(RcppAlgos)
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
ratioIter <- permuteIter(ncol(df), 2, FUN = function(x) {
(df[, x[2]] - df[, x[1]]) / (df[, x[2]] + df[, x[1]])
})
## if you really want to name your output, you must have
## an additional name iterator... not very elegant
nameIter <- permuteIter(paste0("col", 1:ncol(df1)), 2, FUN = function(x) {
paste0(rev(x), collapse = "_")
})
firstIter <- matrix(ratioIter$nextIter(), ncol = 1)
firstName <- nameIter$nextIter()
colnames(firstIter) <- firstName
head(firstIter)
col2_col1
[1,] 0.2990054
[2,] -0.9808111
[3,] -0.9041054
[4,] 0.7970873
[5,] 0.8625776
[6,] 0.2768359
## returns a list, so we call do.call(cbind
next5Iter <- do.call(cbind, ratioIter$nextNIter(5))
next5Names <- unlist(nameIter$nextNIter(5))
colnames(next5Iter) <- next5Names
head(next5Iter)
col3_col1 col4_col1 col5_col1 col6_col1 col7_col1
[1,] -0.28099710 0.1665687 0.40565958 -0.7524038 -0.7132844
[2,] -0.81434900 -0.4283759 -0.89811556 -0.8462906 -0.5399741
[3,] -0.02289368 0.4285012 0.05087853 -0.5091659 -0.2328995
[4,] -0.06825458 0.3126928 0.68968843 -0.2180618 0.6651785
[5,] 0.33508319 0.7389108 0.84733425 0.9065263 0.8977107
[6,] 0.61773589 0.3443120 0.61084584 0.5727938 0.3888807
You should note that this does not show results where i == j (these give NaN). So the total number is just under 21512 (In fact it is exactly equal to 2151^2 - 2151).
ratioIter$summary()
$description
[1] "Permutations of 2151 choose 2"
$currentIndex
[1] 6
$totalResults
[1] 4624650
$totalRemaining
[1] 4624644
There are even random access and previous iterators as well:
## Get the last ratio
lastIter <- ratioIter$back()
lastName <- nameIter$back()
mLast <- matrix(lastIter, ncol = 1)
colnames(mLast) <- lastName
head(mLast)
col2150_col2151
[1,] -0.6131926
[2,] 0.9936783
[3,] 0.1373538
[4,] 0.1014347
[5,] -0.5061608
[6,] 0.5773503
## iterate backwards with the previous methods
prev5Iter <- do.call(cbind, ratioIter$prevNIter(5))
prev5Names <- unlist(nameIter$prevNIter(5))
colnames(prev5Iter) <- prev5Names
head(prev5Iter)
col2149_col2151 col2148_col2151 col2147_col2151 col2146_col2151 col2145_col2151
[1,] -0.75500069 -0.72757136 -0.94457988 -0.82858884 -0.25398782
[2,] 0.99696694 0.99674084 0.99778638 0.99826472 0.95738947
[3,] 0.27701596 0.45696010 0.00682574 0.01529448 -0.62368764
[4,] -0.09508689 -0.90698165 -0.38221934 -0.41405984 0.01371556
[5,] -0.31580709 -0.06561386 -0.07435058 -0.08033145 -0.90692881
[6,] 0.82697720 0.86858595 0.81707206 0.75627297 0.46272349
## Get a random sample
set.seed(123)
randomIter <- do.call(cbind, ratioIter[[sample(4624650, 5)]])
## We must reset the seed in order to get the same output for the names
set.seed(123)
randomNames <- unlist(nameIter[[sample(4624650, 5)]])
colnames(randomIter) <- randomNames
head(randomIter)
col1044_col939 col20_col1552 col412_col2014 col1751_col1521 col337_col1295
[1,] -0.3902066 0.4482747 -0.108018200 -0.1662857 -0.3822436
[2,] -0.2358101 0.9266657 -0.657135882 0.0671608 -0.6821823
[3,] -0.7054217 0.8944720 0.092363665 0.2667708 0.1908249
[4,] -0.1574657 0.2775225 -0.221737223 0.3381454 -0.5705021
[5,] -0.4282909 -0.4406433 0.092783086 -0.7506674 -0.1276932
[6,] 0.9998189 -0.2497586 -0.009375891 0.7071864 -0.2425258
Lastly, it is written in C++ so it is very fast:
system.time(ratioIter$nextNIter(1e3))
# user system elapsed
# 0 0 0
* I am the author of RcppAlgos
I have a dataset in this format:
structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26,
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47,
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43,
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11,
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA,
-4L))
Every id is unique.
I would like to make two clusters:
First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before
Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after
I used two-step cluster in spss for this.
How is it possible to make it in R?
This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.
Note:
this is how I think I could approach the problem;
I do not know the two-step clustering, so I use a kmeans;
it's based on your data, but you can easily generalize it: I've made it dependent to your data because it's simpler to explain.
So, you create the first clustering with the before variables, then the value of the variable changes (after variables), and you want to see if the id are in the same cluster.
This leads me to think that you only need the first set of clusters (for the before variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.
# first, you make your model of clustering, I'm using a simple kmeans
set.seed(1234)
model <- kmeans(df[,2:6],2)
# you put the clusters in the dataset
df$before_cluster <- model$cluster
Now the idea is to calculate the Euclidean distance from the ids with the new variables (after variables), to the centroids calculated on the before variabiles:
# for the first cluster
cl1 <- list()
for (i in 1:nrow(df)) {
cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
}
cl1 <- do.call(rbind, cl1)
colnames(cl1) <- 'first'
# for the second cluster
cl2 <- list()
for (i in 1:nrow(df)) {
cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
}
cl2 <- do.call(rbind, cl2)
colnames(cl2) <- 'second'
# put them together
df <- cbind(df, cl1, cl2)
Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.
df$new_cl <- ifelse(df$first < df$second, 1,2)
df
id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after first second before_cluster first second new_cl
1 1 -0.16 -0.70 2.47 -1.08 0.54 -0.43 -0.69 2.97 -1.11 1.22 0.6852372 0.8151840 2 0.6852372 0.8151840 1
2 2 -0.31 -1.06 2.97 -1.22 0.40 -0.18 -0.38 3.15 -0.99 0.41 0.7331098 0.5208887 1 0.7331098 0.5208887 2
3 3 -0.26 -0.51 2.91 -0.92 0.46 -0.59 -1.19 3.35 -1.26 1.01 0.6117598 1.1180004 2 0.6117598 1.1180004 1
4 4 -0.77 -0.81 3.01 -1.16 0.79 0.64 -0.77 1.52 -0.39 0.24 2.0848381 1.5994765 1 2.0848381 1.5994765 2
Seems they all have changed cluster.
I have 48 dataframes and I wish to calculate a linear regression for each of the stocks in each of the dataframes (the CAPM). Each dataframe contains the same amount of stocks which is around 470, the S&P 500 and has 36 months worth of data. Originally I had one large dataframe but I have successfully managed to split the data into the 48 dataframes (this might not have been the smartest move but it is the way I solved the problem).
When I run the following code, it works fine. Noting that I have hard coded in Block 1.
beta_results <- lapply(symbols, function(x) {
temp <- as.data.frame(Block1)
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Now rather than change the coding for each of the 48 blocks (ie Block1 to Block2 etc), I attempted the following, which in hindsight is complete rubbish. What I need is a way to increment the i from 1 to 48. I had tried to put all the dataframes in list, but given the way I have regression working I would be processing two lists and that was beyond me.
beta_results <- lapply(seq_along(symbols), function(i,x) {
temp <- as.data.frame(paste0("Block",i))
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Code for some example dataframes etc are:
symbols <- c("A", "AAPL", "BRKB")
Block1 to BlockN would take the form of
A AAPL BRKB SP500
2016-04-29 -0.139 0.111 0.122 0.150
2016-05-31 0.071 0.095 0.330 0.200
2016-06-30 -0.042 -0.009 0.230 0.150
2016-07-29 0.090 0.060 0.200 0.100
2016-08-31 0.023 0.013 0.005 0.050
2016-09-30 0.065 0.088 0.002 0.100
Consider a nested lapply where outer loop iterates through a list of dataframes and inner loop through each symbol. The result is a 48-member list, each containing 470 sets of beta coefficents.
Also, as an aside, it is preferred to use lists of many similiarly structured objects especially to run same operations and avoid flooding your global environment (manage 1 list vs 48 dataframes):
# LIST OF DATA FRAMES FROM ALL GLOBAL VARIABLES CONTAINING "Block"
dfList <- mget(ls(pattern="Block"))
# NESTED LAPPLY
results_list <- lapply(dfList, function(df) {
beta_results <- lapply(symbols, function(x) {
input <- reformulate(quote(SP500), response=x)
capm <- lm(input, data=df)
coefficients(capm)
})
})
#Parfait's answer is the correct one for OPs question of using lapply to process a list of data frames.
The following example shows how data.table can be used to get the coefficients of lm(stock~SP500) for each stock (using the Block1 example data):
library(data.table)
dt <- structure(list(date = c("2016-04-29", "2016-05-31", "2016-06-30",
"2016-07-29", "2016-08-31", "2016-09-30"), A = c(-0.139, 0.071,
-0.042, 0.09, 0.023, 0.065), AAPL = c(0.111, 0.095, -0.009, 0.06,
0.013, 0.088), BRKB = c(0.122, 0.33, 0.23, 0.2, 0.005, 0.002),
SP500 = c(0.15, 0.2, 0.15, 0.1, 0.05, 0.1)), .Names = c("date",
"A", "AAPL", "BRKB", "SP500"), row.names = c(NA, -6L), class = "data.frame")
setDT(dt)
# Convert to long format for easier lm
dt_melt <- melt(dt, id.vars = c("date", "SP500"))
# Extract coefficients by doing lm for each unique variable (i.e. stock)
dt_lm <- dt_melt[, as.list(coefficients(lm(value~SP500))), by = variable]
# Fix column names
setnames(dt_lm, c("stock", "intercept", "slope"))
> dt_lm
stock intercept slope
1: A 0.05496970 -0.3490909
2: AAPL 0.01421212 0.3636364
3: BRKB -0.10751515 2.0454545
Here is the test I'm interested in:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
How can I adapt this code into a function that accepts a vector of numeric values and returns a logical vector specifying which data points to remove?
I have attempted to do this below, but I'm getting stuck because when I sort the vector to return, it doesn't line up with the input vector data.
# input data
y = c(-0.25, 0.68, 0.94, 1.15, 1.20, 1.26, 1.26,
1.34, 1.38, 1.43, 1.49, 1.49, 1.55, 1.56,
1.58, 1.65, 1.69, 1.70, 1.76, 1.77, 1.81,
1.91, 1.94, 1.96, 1.99, 2.06, 2.09, 2.10,
2.14, 2.15, 2.23, 2.24, 2.26, 2.35, 2.37,
2.40, 2.47, 2.54, 2.62, 2.64, 2.90, 2.92,
2.92, 2.93, 3.21, 3.26, 3.30, 3.59, 3.68,
4.30, 4.64, 5.34, 5.42, 6.01)
## Generate normal probability plot.
qqnorm(y)
removeoutliers = function(dfinputcol) {
y = as.vector(dfinputcol)
## Create function to compute the test statistic.
rval = function(y){
ares = abs(y - mean(y))/sd(y)
df = data.frame(y, ares)
r = max(df$ares)
list(r, df)}
## Define values and vectors.
n = length(y)
alpha = 0.05
lam = c(1:10)
R = c(1:10)
## Compute test statistic until r=10 values have been
## removed from the sample.
for (i in 1:10){
if(i==1){
rt = rval(y)
R[i] = unlist(rt[1])
df = data.frame(rt[2])
newdf = df[df$ares!=max(df$ares),]}
else if(i!=1){
rt = rval(newdf$y)
R[i] = unlist(rt[1])
df = data.frame(rt[2])
newdf = df[df$ares!=max(df$ares),]}
## Compute critical value.
p = 1 - alpha/(2*(n-i+1))
t = qt(p,(n-i-1))
lam[i] = t*(n-i) / sqrt((n-i-1+t**2)*(n-i+1))
}
## Print results.
newdf = data.frame(c(1:10),R,lam)
names(newdf)=c("Outliers","TestStat.", "CriticalVal.")
# determine how many outliers to remove
toremove = max(newdf$Outliers[newdf$TestStat. > newdf$CriticalVal.])
# create vector of same size as input vector
logicalvectorTifshouldremove = logical(length=length(y))
# but how to determine which outliers to remove?
# set largest data points as outliers to remove.. but could be the smallest in some data sets..
logicalvectorTifshouldremove = replace(logicalvectorTifshouldremove, tail(sort(y), toremove), TRUE)
return (logicalvectorTifshouldremove)
}
# this should have 3 data points set to TRUE .. but it has 2 and they aren't the correct ones
output = removeoutliers(y)
length(output[output==T])
I think it is written in the page you gave (not exactly but in two sentences):
Remove the r observations that maximizes |x_i - mean(x)|
So you get the data without the outliers simply by removing the r ones which fall over the difference, using:
y[abs(y-mean(y)) < sort(abs(y-mean(y)),decreasing=TRUE)[toremove]]
You do not need the last two lines of your code. By the way, you can compute directly:
toremove = which(newdf$TestStat > newdf$CriticalVal)
To simplify a bit, the final function could be:
# Compute the critical value for ESD Test
esd.critical <- function(alpha, n, i) {
p = 1 - alpha/(2*(n-i+1))
t = qt(p,(n-i-1))
return(t*(n-i) / sqrt((n-i-1+t**2)*(n-i+1)))
}
removeoutliers = function(y) {
## Define values and vectors.
y2 = y
n = length(y)
alpha = 0.05
toremove = 0
## Compute test statistic until r=10 values have been
## removed from the sample.
for (i in 1:10){
if(sd(y2)==0) break
ares = abs(y2 - mean(y2))/sd(y2)
Ri = max(ares)
y2 = y2[ares!=Ri]
## Compute critical value.
if(Ri>esd.critical(alpha,n,i))
toremove = i
}
# Values to keep
if(toremove>0)
y = y[abs(y-mean(y)) < sort(abs(y-mean(y)),decreasing=TRUE)[toremove]]
return (y)
}
which returns:
> removeoutliers(y)
[1] -0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49
[12] 1.49 1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91
[23] 1.94 1.96 1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26
[34] 2.35 2.37 2.40 2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93
[45] 3.21 3.26 3.30 3.59 3.68 4.30 4.64
You can use winsorize on library robustHD
library('robustHD')
set.seed(1234)
x <- rnorm(10)
x[1] <- x[1] * 10
x[2] <- x[2] * 11
x[10] <- x[10] * 10
x
[1] -12.0706575 3.0517217 1.0844412 -2.3456977 0.4291247 0.5060559 -0.5747400 -0.5466319 -0.5644520 -8.9003783
boxplot(x)
y <- winsorize(x)
y
[1] -4.5609058 3.0517217 1.0844412 -2.3456977 0.4291247 0.5060559 -0.5747400 -0.5466319 -0.5644520 -4.5609058
boxplot(y)
so if you have dataframe or vector you can use sapply to perform winsorize function.
For more information about this library you can follow this link http://cran.r-project.org/web/packages/robustHD/index.html