I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.
I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]
I have two columns of paired values in a data frame, I want to bin the data in one column using the cut2 function from the Hmisc package so that there are at least say 25 data points in each bin. I however need the corresponding values from the other column. Is there a convenient way for that using R? I have to bin the column B.
A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348
.........
It's not clear what you mean by wanting the "corresponding values of the other column". The first part is easy to accomplish using the g (# of groups) argument:
dfrm$Agrp <- cut2(dfrm$A, g=trunc(length(dfrm$A)/25) )
You can aggregate means or medians of B within Agrp's using tapply or ave or one of the Hmisc summary functions. There are several worked examples in one of today's questions: How to get Summary statistics by group as well as many other examples of using those functions or aggregate or the pkg:plyr functions.
Given that the number of B values will not necessarily be constant across groups the only way I can think to deliver the individual values by A-grouped-value would be with split. I added an extra row to illustrate that a non-even split might need to return a list rather than a more "rectangular" object :
dat <- read.table(text="A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348\n 3.5943 3.796", header=TRUE)
dat$Agrp <- cut2(dat$A, g=trunc(length(dat$A)/3) )
split(dat$B, dat$Agrp)
#-----
$`[-16.49, 2.63)`
[1] 1.680173 1.868667 1.900002 2.005348
$`[ 2.63,11.01]`
[1] 1.866603 1.960929 3.796000
If you want the vector of values on which the splits were done then that can be accomplished by using regex on levels(dat$Agrp).