Inserting outliers to a dataframe

Inserting outliers to a dataframe - r

I try to create a function to inject outliers to an existing data frame.
I started creating a new dataframe outsusing the maxand minvalues of the original dataframe. This outs dataframe will containing a certain amountof outliered data.
Later I want to inject the outliered values of the outs dataframe to the original dataframe.
What I want to get is a function to inject a certain amount of outliers to an original dataframe.
I have different problems for example: I do know if I am using correctly runif to create a dataframe of outliers and second I do not know how to inject the outliers to temp
The code I've tried until now is:
addOutlier <- function (data, amount){
maxi <- apply(data, 2, function(x) (mean(x)+(3*(sd(x)))))
mini <- apply(data, 2, function(x) (mean(x)-(3*(sd(x)))))
temp <- data
amount2 <- ifelse(amount<1, (prod(dim(data))*amount), amount)
outs <- runif(amount2, 2, min = mini, max = maxi) # outliers
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:length(outs))
temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- outs
return (temp)
}
Please any help to make this work, will be deeply appreciated

My understanding is that what you're trying to achieve is adding a set amount of outliers to each column in your vector. Alternatively, you seem to also be looking into adding a % of outliers to each column. I wrote down a solution only for the former case, but the latter should pretty easy to implement if you really need it. Note how I broke things down into two functions, to (hopefully) help clarify what is going on. Hope this helps!
add.outlier.to.vector <- function(vector, amount) {
cells.to.modify <- sample(1:length(vector), amount, replace=F)
mean.val <- mean(vector)
sd.val <- sd(vector)
min.val <- mean.val - 3 * sd.val
max.val <- mean.val + 3 * sd.val
vector[cells.to.modify] <- runif(amount, min=min.val, max=max.val)
return(vector)
}
add.outlier.to.data.frame <- function (temp, amount){
for (i in 1:ncol(temp)) {
temp[,i] <- add.outlier.to.vector(temp[,i], amount)
}
return (temp)
}
data <- data.frame(
a=c(1,2,3,4),
b=c(7,8,9,10)
)
add.outlier.to.data.frame(data, 2)

Related

R: How to find the mean across categories of variables

I have a data frame with about 50 numeric variables. I want to create a new column with values for the mean of a certain number of these variables which fall into the same category. So for example, I might want to create a new variable called df$mean_weight which contains values for the averages across rows of respondents' df$weight1, df$weight2, df$weight3. And the same thing with height variables etc.
This is what I have so far:
find_mean = function(...) {
input_list = list(...)
output_list = sapply(input_list,mean, na.rm=TRUE)
return(output_list)
}
df$mean_weight = find_mean(df$weight1, df$weight2, df$weight3)
The problem is this gives me an error saying the replacement has fewer rows than my original data. For some reason this error isn't present when I try height variables with the same code, however.

I can't reproduce your error. The function works fine for a sample dataset I generate.
# Sample data
set.seed(2017);
df <- as.data.frame(matrix(runif(200), ncol = 5));
colnames(df) <- paste0("weight", seq(1:5));
# Your function
find_mean = function(...) {
input_list = list(...)
output_list = sapply(input_list,mean, na.rm=TRUE)
return(output_list)
}
find_mean(df$weight1, df$weight2, df$weight3)
#[1] 0.4736851 0.5569710 0.4300163
You can also achieve the same output in a single line:
sapply(c("weight1", "weight2", "weight3"), function(x) mean(df[, x]))
# weight1 weight2 weight3
#0.4736851 0.5569710 0.4300163

function to be run multiple times to generate the final dataset in R

I am new to R and have written a function that needs to be run multiple times to generate the final dataset.
So the multiple times is determined by the vector of unique years and again based on these years every single time the function gives an output.
Still I am not getting the right output.
Desired output: for eg it takes 10 samples from each year, after 10th run I should have 100 rows of correct output.
create_strsample <- function(n1,n2){
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <-sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE)
r2<-sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE)
#final.data <-k1[c(r1,r2),]
sample.data <- lapply(yr, function(x) {f.data<-create_strsample(200,800)})
k1 <- do.call(rbind,k1)
return(k1)
}
final <- do.call(rbind,sample.data)
return(final)
}
stratified.sample.data <- create_strsample(200,800)

A MWE would have been nice, but I'll give you a template for these kind of questions. Note, that this is not optimized for speed (or anything else), but only for the ease of understanding.
As noted in the comments, that call to create_strsample in the loop looks weird and probably isn't what you really want.
data <- data.frame() # we need an empty, but existing variable for the first loop iteration
for (i in 1:10) {
temp <- runif(1,max=i) # do something...
data <- rbind(data,temp) # ... and add this to 'data'
} # repeat 10 times
rm(temp) # don't need this anymore
That return(k1) in the loop also looks wrong.

I tried this later after your suggestion #herbaman for the desired output minus the lapply.
create_strsample <- function(n1,n2){
final.data <- NULL
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <- k1[sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE), ]
r2 <- k1[sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE), ]
sample.data <- rbind(r1,r2)
final.data <- rbind(final.data, sample.data)
}
return(final.data)
}
stratified.sample.data <- create_strsample(200,800)

Removing rows with Z score above 3

data : https://drive.google.com/file/d/0B20HmmYd0lsFbnE4RUh6N0xtUHc/edit?usp=sharing
Where dat$C, I want to remove items with RT of Z scores 3 or above for each sxS combination.
I had two ways (clean function and line using plyr package below) I thought I could do this, but one removes more rows than the other. Can somebody explain to me why my clean function does not agree with the line using plyr. package?
dat <- read.table(file="dat.txt")
# 3SD clean function
clean <- function(df) {
dfc <- df[as.logical(df$C),]
n=tapply(df$RT,list(df$s,df$S),length)
ns=tapply(df$RT,list(df$s),length)
mn=tapply(df$RT,list(df$s,df$S),mean)
sd=tapply(df$RT,list(df$s,df$S),sd)
upper <- mn+3*sd
bad <- logical(dim(df)[1])
levs <- paste(df$s,df$S,sep=".")
for (i in levels(df$s)) for (j in levels(df$S)) {
lev <- paste(i,j,sep=".")
bad[levs==lev] <- df[levs==lev,"RT"] > upper[i,j]
}
df=df[!bad,]
nok=tapply(df$RT,list(df$s,df$S),length)
pbad=100-100*nok/n
print(aperm(round(pbad,1),c(2,1)))
nok=tapply(df$RT,list(df$s),length)
pbad=100-100*nok/ns
print(sort(round(pbad,1)))
print(mean(pbad,na.rm=T))
df
}
require(plyr)
str(ddply(dat,.(s,S,C),function(x) x[scale(x$RT)< 3.00,]))
str(clean(dat))

I could not able to get your sample data.
Assuming you have zscore calculated already and put it into a data frame
You could simply say
mydata[mydata$score <=3, ]
should be enough!

r Hmisc::dataframeReduce - replicate actions from one dataset to identically structured dataset

I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.
dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.
Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?
An example
Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1)
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]

If your goal is to have the same variables with the same levels, then you need to avoid using dataframeReduce a second time, and instead use the same columns as produced by the dataframeReduce operation on hte train-set and apply factor reduction logic to the test-set in a manner that results in whatever degree of homology is needed of subsequent comparison operations. If it is a predict operation that is planned then you need to get the levels to be the same and you need to modify the code in dataframeReduce that works on the levels:
if (is.category(x) || length(unique(x)) == 2) {
tab <- table(x)
if ((min(tab)/n) < minprev) {
if (is.category(x)) {
x <- combine.levels(x, minlev = minprev)
s <- "grouped categories"
if (length(levels(x)) < 2)
s <- paste("prevalence<", minprev, sep = "")
}
else s <- paste("prevalence<", minprev, sep = "")
}
}
So a better problem statement is likely to produce a better strategy. This will probably require both knowing what levels are in the entire set and in the train and test sets as well as what testing or predictions are anticipated (but not yet stated).

What is wrong with this looping?

I can't inderstand what is wrong with this R code, I have several rows and col with a measuament or NA and I basically want to get the min and max value in each line looking amongs the several cols:
require(plyr)
census <- read.csv("sps_census.csv")
info <- read.csv("sps_info.csv")
for (i in 1: nrow(census)) {
trans <- census[i,c("dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9")]
index.1 <- which (trans != "NA") #some NAs are in the data
census$min.dbh <- min(trans[1,index.1])
census$min.dbh.index <- min(index.1)
census$max.dbh <- max(trans[1,index.1])
census$max.dbh.index <- max(index.1)
}

In this line (and the other three similar lines):
census$min.dbh <- min(trans[1,index.1])
you are assigning an entire column, to all the same value. Clearly not what you intend.
Perhaps you want something like this:
census$min.dbh[i] <- min(trans[1,index.1])
Note that you can use apply to do this sort of thing. It would be a lot easier for someone to write a working apply example, if you supply example data (i.e., make your question a reproducible example).

You can use apply:
index <- c"dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9") #or paste("dbh",1:9,sep="")
census$min.dbh <- apply(census[index], 1, min, na.rm=T)
census$min.db.index <- apply(census[index], 1, function(x){ min(which(!is.na(x))) })
census$max.dbh <- apply(census[index], 1, max, na.rm=T)
census$max.db.index <- apply(census[index], 1, function(x){ max(which(!is.na(x))) })
Note that I'm using is.na(x) instead of x != "NA".

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Inserting outliers to a dataframe - r

Related

R: How to find the mean across categories of variables

function to be run multiple times to generate the final dataset in R

Removing rows with Z score above 3

r Hmisc::dataframeReduce - replicate actions from one dataset to identically structured dataset

What is wrong with this looping?

Categories

Resources