How to remove outliers along with factor variables? - r

I am trying to remove outliers from a dataset s consisted of 3 variables:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
b 10 winter
c 12 winter
I used outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out] to remove the outliers from s and got something like this:
consumption
0.1
0.2
0.3
0.4
However, I want to get something like this below:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
But the $out function only allows me to remove the column with numbers (not with factors).

I found a solution which is to find the min of the output I got from outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out], which is l in this case:
consumption
0.1
0.2
0.3
0.4
By knowing my min value, I can then take a subset of s by setting a condition where consumption has to be less than min(l).
new <- subset(s, consumption < min(l))

Related

R - How to create a new column in a dataframe with calculations based on condition of another column

In a project, I measured the iodine concentration of tumors (column=ROI_IC) at different off center positions (column=Offcenter) (table heights) in a CT scanner. I know the true concentration of each of the tumors (column=Real_IC; there are 4 different tumors with 4 different real_IC concentrations). Each tumor was measured at each off-center position 10 times (column=Measurement_repeat). I calculated an absolute error between the measured iodine concentration and the real iodine concentration (column=absError_IC)
This is just the head of the data:
Offcenter Measurement_repeat Real_IC ROI_IC absError_IC
1 0 1 0.0 0.4 0.4
2 0 2 0.0 0.3 0.3
3 0 3 0.0 0.3 0.3
4 0 4 0.0 0.0 0.0
5 0 5 0.0 0.0 0.0
6 0 6 0.0 -0.1 0.1
7 0 7 0.0 -0.2 0.2
8 0 8 0.0 -0.2 0.2
9 0 9 0.0 -0.1 0.1
10 0 10 0.0 0.0 0.0
11 0 1 0.4 0.4 0.0
12 0 2 0.4 0.3 0.1
13 0 3 0.4 0.2 0.2
14 0 4 0.4 0.0 0.4
15 0 5 0.4 0.0 0.4
16 0 6 0.4 -0.1 0.5
17 0 7 0.4 0.1 0.3
18 0 8 0.4 0.3 0.1
19 0 9 0.4 0.6 0.2
20 0 10 0.4 0.7 0.3
Now I would like to create a new column called corrError_IC.
In this column, the measured iodine concentration (ROI_IC) should be corrected based on the mean absolute error (mean of 10 measurements) that was found for that specific Real_IC concentration at Offcenter = 0
Because there are 4 tumor concentrations there are 4 mean values at Off-center =0 that I want to apply on the other off-center-values.
mean1=mean of the 10 absError-IC measurements of the `Real_IC=0`
mean2=mean of the 10 absError-IC measurements of the `Real_IC=0.4`
mean3=mean of the 10 absError-IC measurements of the `Real_IC=3`
mean4=mean of the 10 absError-IC measurements of the `Real_IC=5`
Basically, I want the average absolute error for a specific tumor at Offcenter = 0 (there are 4 different tumor types with four different Real_IC) and then I want correct all tumors at the other Offcenter positions by this absolute error values that were derived from the Offcenter = 0 data.
I tried ifelse statements but I was not able to figure it out.
EDIT: Off-center has specific levels: c(-6,-4,-3,-2,-1,0,1,2,3,4,6)
Here is how I would approach this problem.
compute mean of absError_IC grouped by Real_IC.
left join original data.frame with grouped mean
Code Example
## replicate sample data sets
ROI_IC = c(0.4, 0.3, 0.3, 0.0, 0.0, -0.1, -0.2, -0.2, -0.1, 0.0,
0.4, 0.3, 0.2, 0.0, 0.0, -0.1, 0.1, 0.3, 0.6, 0.7)
df = data.frame("Offcenter"=rep(0, 40),
"Measurement_repeat"=rep( c(1:10), 4),
"Real_IC"=rep( c(0,0.4,3,5), each=10),
"ROI_IC"=rep(ROI_IC, 2),
stringsAsFactors=F)
df$absError_IC = abs(df$Real_IC - df$ROI_IC)
## compute mean of "absError_IC" grouped by "Real_IC"
mean_values = aggregate(df[df$Offcenter==0, c("absError_IC")],
by=list("Real_IC"=df$Real_IC),
FUN=mean)
names(mean_values)[which(names(mean_values)=="x")] = "MAE"
## left join to append column
df = merge(df, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
## notice that column order shifts based on "key"
df[c(1:5, 10:15), ]
I suggest using data.table package which is particularly useful when there is need to manipulate large data.
library(data.table)
## dt = data.table(df) or dt = fread(<path>)
## dt[dt$Offcenter==0, c("absError_IC") := abs(dt$Real_IC - dt$ROI_IC)]
## compute grouped mean
mean_values = dt[, j=list("MAE"=mean(absError_IC)), by=list(Real_IC)]
## left join
dt = merge(dt, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
Consider ave for inline aggregation where its first argument is the numeric quantity field, next arguments is grouping fields, and very last argument requiring named parameter, FUN, is the numeric function: ave(num_vector, ..., FUN=func).
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=mean))
To handle NAs, extend the function argument for na.rm argument:
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=function(x) mean(x, na.rm=TRUE))
I found a way to compute what I want by creating an extra column taking the average absolute errors from the 4 Real_IC levels for Off-center = 0 and matching them whenever Real_IC has a certain level.
In a second step, I subtract these from the ROI_ICs. However, how can I simplify that code to a more general form (at the moment I calculate the average absErrors based on their row location)? Sorry I am an absolute beginner ;(
Of note: My data.frame is called "ds_M"
#Define absolute errors for the 4 Real_IC levels as variables
average1<-mean(ds_M$absError_IC[1:10]) #for Real_IC=0
average2<-mean(ds_M$absError_IC[11:20]) #for Real_IC=0.4
average3<-mean(ds_M$absError_IC[21:30]) #for Real_IC=3
average4<-mean(ds_M$absError_IC[31:40]) #for Real_IC=5
# New column assigning the correction factor to each Real_IC level
ds_M$absCorr[ds_M$Real_IC==0]<-average1
ds_M$absCorr[ds_M$Real_IC==0.4]<-average2
ds_M$absCorr[ds_M$Real_IC==3]<-average3
ds_M$absCorr[ds_M$Real_IC==5]<-average4
# Calculate new column with corrected ROI_ICs
ds_M$corrError_IC<-ds_M$ROI_IC - ds_M$absCorr

Clustering a set of countries based on cultural similarity on R

I am having some problems trying to cluster countries using a sort of cultural correlation that I already have.
basically, the dataset looks like this: with 90 countries, 91 columns (90 country columns + one to identify the nations on the rows)
and 90 rows
Nation Ita Fra Ger Esp Eng ...
Ita NA 0.2 0.1 0.6 0.4 ...
Fra 0.2 NA 0.2 0.1 0.3 ...
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA
... .....
...
I am looking for an algorithm that clusters my countries in groups (for instance groups of 3, or even better, more flexible clusters, such that the number of clusters and the number of countries per cluster is not fixed ex-ante
so that the output is for instance
Nation cluster
Ita 1
Fra 2
Ger 3
Esp 1
Eng 3
......
#DATA
df1 = read.table(strip.white = TRUE, stringsAsFactors = FALSE, header = TRUE, text =
"Nation Ita Fra Ger Esp Eng
Ita NA 0.2 0.1 0.6 0.4
Fra 0.2 NA 0.2 0.1 0.3
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA")
df1 = replace(df1, is.na(df1), 0)
row.names(df1) = df1[,1]
df1 = df1[,-1]
# Run PCA to visualize similarities
pca = prcomp(as.matrix(df1))
pca_m = as.data.frame(pca$x)
plot(pca_m$PC1, pca_m$PC2)
text(x = pca_m$PC1, pca_m$PC2, labels = row.names(df1))
# Run k-means and choose centers based on pca plot
kk = kmeans(x = df1, centers = 3)
kk$cluster
# Ita Fra Ger Esp Eng
# 3 1 2 1 1
Hierarchical Agglomerative Clustering (HAC), one of the oldest clustering methods, can also be implemented with similarity instead of distance.
Conceptually, you always search for the maximum (e.g., ita ger) and merge these until the desired number of clusters remain.
Although in your case it's probably easier to just use 1-sim as distance and use the existing implementations.
You might consider using spectral clustering, which is k-means applied to the dominant eigenvector(s) of the laplacian underlying your similarity graph. https://en.wikipedia.org/wiki/Spectral_clustering

Selecting rows with time in R

I have a data frame that looks like this:
Subject Time Freq1 Freq2 ...
A 6:20 0.6 0.1
A 6:30 0.1 0.5
A 6:40 0.6 0.1
A 6:50 0.6 0.1
A 7:00 0.3 0.4
A 7:10 0.1 0.5
A 7:20 0.1 0.5
B 6:00 ... ...
I need to delete the rows in the time range it is not from 7:00 to 7:30.So in this case, all the 6:00, 6:10, 6:20...
I have tried creating a data frame with just the times I want to keep but I does not seem to recognize the times as a number nor as a name. And I get the same error when trying to directly remove the ones I don't need. It is probably quite simple but I haven't found any solution.
Any suggestions?
We can convert the time column to a Period class under the package lubridate and then filter the data frame based on that column.
library(dplyr)
library(lubridate)
dat2 <- dat %>%
mutate(HM = hm(Time)) %>%
filter(HM < hm("7:00") | HM > hm("7:30")) %>%
select(-HM)
dat2
# Subject Time Freq1 Freq2
# 1 A 6:20 0.6 0.1
# 2 A 6:30 0.1 0.5
# 3 A 6:40 0.6 0.1
# 4 A 6:50 0.6 0.1
# 5 B 6:00 NA NA
DATA
dat <- read.table(text = "Subject Time Freq1 Freq2
A '6:20' 0.6 0.1
A '6:30' 0.1 0.5
A '6:40' 0.6 0.1
A '6:50' 0.6 0.1
A '7:00' 0.3 0.4
A '7:10' 0.1 0.5
A '7:20' 0.1 0.5
B '6:00' NA NA",
header = TRUE)

Calculate average within a specified range

I am using the 'diamonds' dataset from ggplot2 and am wanting to find the average of the 'carat' column. However, I want to find the average every 0.1:
Between
0.2 and 0.29
0.3 and 0.39
0.4 and 0.49
etc.
You can use function aggregate to mean by group which is calculated with carat %/% 0.1
library(ggplot2)
averageBy <- 0.1
aggregate(diamonds$carat, list(diamonds$carat %/% averageBy * averageBy), mean)
Which gives mean by 0.1
Group.1 x
1 0.2 0.2830764
2 0.3 0.3355529
3 0.4 0.4181711
4 0.5 0.5341423
5 0.6 0.6821408
6 0.7 0.7327491
...

r data simulations using parLapply

I am trying to simulate using the following code and data.For each of the iterations it is simulating n from rpois and then n values from function rbeta.This bit is working fine.
The only issue is that for each of n it should getid from the table below based on probability weighted(id_prob) sampling using function sample but for some reason it is only getting one ID for all values of n.
cl <- makeCluster(num_cores)
clusterEvalQ(cl,library(evir))
clusterExport(cl, varlist=c("Sims","ID","id_prob","beta_a","beta_b")
Sims<-10000
set.seed(0)
system.time(x1<-parLapply(cl,1:Sims, function(i){
id<-sample(ID,1,replace=TRUE,prob=id_prob)
n<-rpois(1,9)
rbeta(n,beta_a[id],beta_b[id])
}
))
ID Rate id_prob Beta_a Beta_b
1 1.5 16.7% 0.5 0.5
2 2 22.2% 0.4 0.4
3 1 11.1% 0.3 0.3
4 1.5 16.7% 0.6 0.6
5 2 22.2% 0.1 0.1
6 1 11.1% 0.2 0.2

Resources