data : https://drive.google.com/file/d/0B20HmmYd0lsFbnE4RUh6N0xtUHc/edit?usp=sharing
Where dat$C, I want to remove items with RT of Z scores 3 or above for each sxS combination.
I had two ways (clean function and line using plyr package below) I thought I could do this, but one removes more rows than the other. Can somebody explain to me why my clean function does not agree with the line using plyr. package?
dat <- read.table(file="dat.txt")
# 3SD clean function
clean <- function(df) {
dfc <- df[as.logical(df$C),]
n=tapply(df$RT,list(df$s,df$S),length)
ns=tapply(df$RT,list(df$s),length)
mn=tapply(df$RT,list(df$s,df$S),mean)
sd=tapply(df$RT,list(df$s,df$S),sd)
upper <- mn+3*sd
bad <- logical(dim(df)[1])
levs <- paste(df$s,df$S,sep=".")
for (i in levels(df$s)) for (j in levels(df$S)) {
lev <- paste(i,j,sep=".")
bad[levs==lev] <- df[levs==lev,"RT"] > upper[i,j]
}
df=df[!bad,]
nok=tapply(df$RT,list(df$s,df$S),length)
pbad=100-100*nok/n
print(aperm(round(pbad,1),c(2,1)))
nok=tapply(df$RT,list(df$s),length)
pbad=100-100*nok/ns
print(sort(round(pbad,1)))
print(mean(pbad,na.rm=T))
df
}
require(plyr)
str(ddply(dat,.(s,S,C),function(x) x[scale(x$RT)< 3.00,]))
str(clean(dat))
I could not able to get your sample data.
Assuming you have zscore calculated already and put it into a data frame
You could simply say
mydata[mydata$score <=3, ]
should be enough!
Related
I try to simulate changes in a data frame through different steps depending on each others. Let's try to take a very simple example to illustrate my problem.
I create a dataframe with two columns
a=runif(10)
b=runif(10)
data_1=data.frame(a,b)
data_1
a b
1 0.94922669 0.47418098
2 0.26702201 0.79179699
3 0.57398333 0.25158378
4 0.52724079 0.61531202
5 0.03999831 0.95233479
6 0.15171673 0.64564561
7 0.51353129 0.75676464
8 0.60312432 0.85318316
9 0.52900913 0.06297818
10 0.75459362 0.40209925
Then, I would like to create n steps, where each step consists in creating a new dataframe at i+1 which is function (let's call it "whatever") of the dataframe at i: data_2 is a transformation of data_1, data_3 a transformation of data_2, etc.
iterations=function(nsteps)
{
lapply(1:nsteps,function(i)
{
data_i+1=whatever(data_i)
})
}
Whatever the function I use, I have an error message saying:
Error in whatever(data_i) : object 'data_i' not found
Can someone help me figure out what I am missing?
See if you can get some inspiration from the following example.
First, a whatever function to be applied to the previous dataframe.
whatever <- function(DF) {
DF[[2]] <- DF[[2]]*2
DF
}
Now the function you want. I have added an extra argument, the dataframe x.
The function starts by creating the object to be returned. Each member of the list data_list will be a dataframe function of the previous dataframe.
iterations <- function(nsteps, x){
data_list <- vector("list", length = nsteps)
data_list[[1]] <- x
for(i in seq_len(nsteps)[-1]){
data_list[[i]] <- whatever(data_list[[i - 1]])
}
names(data_list) <- sprintf("data_%d", seq_len(nsteps))
data_list
}
And apply iterations to an example dataframe.
df1 <- data.frame(A = letters[1:10], X = 1:10)
iterations(10, df1)
You might be looking for a combination of assign and paste:
assign(paste("data_", i + 1, sep = ""), whatever(data_i))
I am new to R and have written a function that needs to be run multiple times to generate the final dataset.
So the multiple times is determined by the vector of unique years and again based on these years every single time the function gives an output.
Still I am not getting the right output.
Desired output: for eg it takes 10 samples from each year, after 10th run I should have 100 rows of correct output.
create_strsample <- function(n1,n2){
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <-sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE)
r2<-sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE)
#final.data <-k1[c(r1,r2),]
sample.data <- lapply(yr, function(x) {f.data<-create_strsample(200,800)})
k1 <- do.call(rbind,k1)
return(k1)
}
final <- do.call(rbind,sample.data)
return(final)
}
stratified.sample.data <- create_strsample(200,800)
A MWE would have been nice, but I'll give you a template for these kind of questions. Note, that this is not optimized for speed (or anything else), but only for the ease of understanding.
As noted in the comments, that call to create_strsample in the loop looks weird and probably isn't what you really want.
data <- data.frame() # we need an empty, but existing variable for the first loop iteration
for (i in 1:10) {
temp <- runif(1,max=i) # do something...
data <- rbind(data,temp) # ... and add this to 'data'
} # repeat 10 times
rm(temp) # don't need this anymore
That return(k1) in the loop also looks wrong.
I tried this later after your suggestion #herbaman for the desired output minus the lapply.
create_strsample <- function(n1,n2){
final.data <- NULL
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <- k1[sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE), ]
r2 <- k1[sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE), ]
sample.data <- rbind(r1,r2)
final.data <- rbind(final.data, sample.data)
}
return(final.data)
}
stratified.sample.data <- create_strsample(200,800)
I try to create a function to inject outliers to an existing data frame.
I started creating a new dataframe outsusing the maxand minvalues of the original dataframe. This outs dataframe will containing a certain amountof outliered data.
Later I want to inject the outliered values of the outs dataframe to the original dataframe.
What I want to get is a function to inject a certain amount of outliers to an original dataframe.
I have different problems for example: I do know if I am using correctly runif to create a dataframe of outliers and second I do not know how to inject the outliers to temp
The code I've tried until now is:
addOutlier <- function (data, amount){
maxi <- apply(data, 2, function(x) (mean(x)+(3*(sd(x)))))
mini <- apply(data, 2, function(x) (mean(x)-(3*(sd(x)))))
temp <- data
amount2 <- ifelse(amount<1, (prod(dim(data))*amount), amount)
outs <- runif(amount2, 2, min = mini, max = maxi) # outliers
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:length(outs))
temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- outs
return (temp)
}
Please any help to make this work, will be deeply appreciated
My understanding is that what you're trying to achieve is adding a set amount of outliers to each column in your vector. Alternatively, you seem to also be looking into adding a % of outliers to each column. I wrote down a solution only for the former case, but the latter should pretty easy to implement if you really need it. Note how I broke things down into two functions, to (hopefully) help clarify what is going on. Hope this helps!
add.outlier.to.vector <- function(vector, amount) {
cells.to.modify <- sample(1:length(vector), amount, replace=F)
mean.val <- mean(vector)
sd.val <- sd(vector)
min.val <- mean.val - 3 * sd.val
max.val <- mean.val + 3 * sd.val
vector[cells.to.modify] <- runif(amount, min=min.val, max=max.val)
return(vector)
}
add.outlier.to.data.frame <- function (temp, amount){
for (i in 1:ncol(temp)) {
temp[,i] <- add.outlier.to.vector(temp[,i], amount)
}
return (temp)
}
data <- data.frame(
a=c(1,2,3,4),
b=c(7,8,9,10)
)
add.outlier.to.data.frame(data, 2)
My dataset looks something like this:
a <- rnorm(2)
b <- rnorm(2)-3
x <- rnorm(13)
y <- rnorm(2)-1
z <- rnorm(2)-2
eg <- expand.grid(a,b,x,y,z)
treatment <- c(rep(1, 2), rep(0,3))
eg <- data.frame(t(eg))
row.names(eg) <- NULL
eg <- cbind(treatment, eg)
What I need to do is run t-tests on each column, comparing the treatment =1 group to the treatment=0 group. I'd like to then have a vector of p-values. I've tried (several versions of) doing this through a loop, but I continue to receive the same error message: "undefined columns selected." Here's my code currently:
p.values <- c(rep(NA, 208))
for (i in 2:209) {
x <- data.frame(eg[eg$treatment==1][,i][1:2])
y <- data.frame(eg[eg$treatment==0][,i][3:5])
value <- t.test(x=x, y=y)['p.value']
p.values[i] <- value
}
I added the data.frame() after reading someone mention that for loops only loop through dataframes, but it didn't change anything. I am sure there is an easier way to do this, perhaps by using something in the apply family? Does anyone have any suggestions? Thanks so much!
A couple of options, both using sapply:
sapply(
eg[-1], function(x) t.test(x[eg$treatment==1],x[eg$treatment==0])[["p.value"]]
)
Or looping over the names instead:
sapply(
names(eg[-1]),
function(x) t.test(as.formula(paste(x,"~ treatment")),data=eg)[["p.value"]]
)
Or even mapply:
mapply(function(x,y) t.test(x ~ y,data=cbind(x,y))[["p.value"]], eg[-1], eg[1])
I am trying to assign the values from the dataframe into a matrix. The columns 2 and 3 are mapped to rows and columns respectively in the matrix. This is not working since the sim.mat is not storing the values.
score <- function(x, sim.mat) {
r <- as.numeric(x[2])
c <- as.numeric(x[3])
sim.mat[r,c] <- as.numeric(x[4])
}
mat <- apply(sim.data, 1, score, sim.mat)
Is this the right approach? If yes how can I get it to work.
No need for apply, try this:
score <- function(x, sim.mat) {
r <- as.numeric(x[[2]])
c <- as.numeric(x[[3]])
sim.mat[cbind(r,c)] <- as.numeric(x[[4]])
sim.mat
}
mat <- score(sim.data, sim.mat)
Check the "Matrices and arrays" section of ?"[" for documentation.
If you really wanted to use apply like you did, you would need your function to modify sim.data in the calling environment, do:
score <- function(x, sim.mat) {
r <- as.numeric(x[2])
c <- as.numeric(x[3])
sim.mat[r,c] <<- as.numeric(x[4])
}
apply(sim.data, 1, score, sim.mat)
sim.mat
This type of programming where functions have side-effects is really not recommended.