I am having some problems trying to cluster countries using a sort of cultural correlation that I already have.
basically, the dataset looks like this: with 90 countries, 91 columns (90 country columns + one to identify the nations on the rows)
and 90 rows
Nation Ita Fra Ger Esp Eng ...
Ita NA 0.2 0.1 0.6 0.4 ...
Fra 0.2 NA 0.2 0.1 0.3 ...
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA
... .....
...
I am looking for an algorithm that clusters my countries in groups (for instance groups of 3, or even better, more flexible clusters, such that the number of clusters and the number of countries per cluster is not fixed ex-ante
so that the output is for instance
Nation cluster
Ita 1
Fra 2
Ger 3
Esp 1
Eng 3
......
#DATA
df1 = read.table(strip.white = TRUE, stringsAsFactors = FALSE, header = TRUE, text =
"Nation Ita Fra Ger Esp Eng
Ita NA 0.2 0.1 0.6 0.4
Fra 0.2 NA 0.2 0.1 0.3
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA")
df1 = replace(df1, is.na(df1), 0)
row.names(df1) = df1[,1]
df1 = df1[,-1]
# Run PCA to visualize similarities
pca = prcomp(as.matrix(df1))
pca_m = as.data.frame(pca$x)
plot(pca_m$PC1, pca_m$PC2)
text(x = pca_m$PC1, pca_m$PC2, labels = row.names(df1))
# Run k-means and choose centers based on pca plot
kk = kmeans(x = df1, centers = 3)
kk$cluster
# Ita Fra Ger Esp Eng
# 3 1 2 1 1
Hierarchical Agglomerative Clustering (HAC), one of the oldest clustering methods, can also be implemented with similarity instead of distance.
Conceptually, you always search for the maximum (e.g., ita ger) and merge these until the desired number of clusters remain.
Although in your case it's probably easier to just use 1-sim as distance and use the existing implementations.
You might consider using spectral clustering, which is k-means applied to the dominant eigenvector(s) of the laplacian underlying your similarity graph. https://en.wikipedia.org/wiki/Spectral_clustering
Related
I am trying to remove outliers from a dataset s consisted of 3 variables:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
b 10 winter
c 12 winter
I used outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out] to remove the outliers from s and got something like this:
consumption
0.1
0.2
0.3
0.4
However, I want to get something like this below:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
But the $out function only allows me to remove the column with numbers (not with factors).
I found a solution which is to find the min of the output I got from outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out], which is l in this case:
consumption
0.1
0.2
0.3
0.4
By knowing my min value, I can then take a subset of s by setting a condition where consumption has to be less than min(l).
new <- subset(s, consumption < min(l))
In a project, I measured the iodine concentration of tumors (column=ROI_IC) at different off center positions (column=Offcenter) (table heights) in a CT scanner. I know the true concentration of each of the tumors (column=Real_IC; there are 4 different tumors with 4 different real_IC concentrations). Each tumor was measured at each off-center position 10 times (column=Measurement_repeat). I calculated an absolute error between the measured iodine concentration and the real iodine concentration (column=absError_IC)
This is just the head of the data:
Offcenter Measurement_repeat Real_IC ROI_IC absError_IC
1 0 1 0.0 0.4 0.4
2 0 2 0.0 0.3 0.3
3 0 3 0.0 0.3 0.3
4 0 4 0.0 0.0 0.0
5 0 5 0.0 0.0 0.0
6 0 6 0.0 -0.1 0.1
7 0 7 0.0 -0.2 0.2
8 0 8 0.0 -0.2 0.2
9 0 9 0.0 -0.1 0.1
10 0 10 0.0 0.0 0.0
11 0 1 0.4 0.4 0.0
12 0 2 0.4 0.3 0.1
13 0 3 0.4 0.2 0.2
14 0 4 0.4 0.0 0.4
15 0 5 0.4 0.0 0.4
16 0 6 0.4 -0.1 0.5
17 0 7 0.4 0.1 0.3
18 0 8 0.4 0.3 0.1
19 0 9 0.4 0.6 0.2
20 0 10 0.4 0.7 0.3
Now I would like to create a new column called corrError_IC.
In this column, the measured iodine concentration (ROI_IC) should be corrected based on the mean absolute error (mean of 10 measurements) that was found for that specific Real_IC concentration at Offcenter = 0
Because there are 4 tumor concentrations there are 4 mean values at Off-center =0 that I want to apply on the other off-center-values.
mean1=mean of the 10 absError-IC measurements of the `Real_IC=0`
mean2=mean of the 10 absError-IC measurements of the `Real_IC=0.4`
mean3=mean of the 10 absError-IC measurements of the `Real_IC=3`
mean4=mean of the 10 absError-IC measurements of the `Real_IC=5`
Basically, I want the average absolute error for a specific tumor at Offcenter = 0 (there are 4 different tumor types with four different Real_IC) and then I want correct all tumors at the other Offcenter positions by this absolute error values that were derived from the Offcenter = 0 data.
I tried ifelse statements but I was not able to figure it out.
EDIT: Off-center has specific levels: c(-6,-4,-3,-2,-1,0,1,2,3,4,6)
Here is how I would approach this problem.
compute mean of absError_IC grouped by Real_IC.
left join original data.frame with grouped mean
Code Example
## replicate sample data sets
ROI_IC = c(0.4, 0.3, 0.3, 0.0, 0.0, -0.1, -0.2, -0.2, -0.1, 0.0,
0.4, 0.3, 0.2, 0.0, 0.0, -0.1, 0.1, 0.3, 0.6, 0.7)
df = data.frame("Offcenter"=rep(0, 40),
"Measurement_repeat"=rep( c(1:10), 4),
"Real_IC"=rep( c(0,0.4,3,5), each=10),
"ROI_IC"=rep(ROI_IC, 2),
stringsAsFactors=F)
df$absError_IC = abs(df$Real_IC - df$ROI_IC)
## compute mean of "absError_IC" grouped by "Real_IC"
mean_values = aggregate(df[df$Offcenter==0, c("absError_IC")],
by=list("Real_IC"=df$Real_IC),
FUN=mean)
names(mean_values)[which(names(mean_values)=="x")] = "MAE"
## left join to append column
df = merge(df, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
## notice that column order shifts based on "key"
df[c(1:5, 10:15), ]
I suggest using data.table package which is particularly useful when there is need to manipulate large data.
library(data.table)
## dt = data.table(df) or dt = fread(<path>)
## dt[dt$Offcenter==0, c("absError_IC") := abs(dt$Real_IC - dt$ROI_IC)]
## compute grouped mean
mean_values = dt[, j=list("MAE"=mean(absError_IC)), by=list(Real_IC)]
## left join
dt = merge(dt, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
Consider ave for inline aggregation where its first argument is the numeric quantity field, next arguments is grouping fields, and very last argument requiring named parameter, FUN, is the numeric function: ave(num_vector, ..., FUN=func).
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=mean))
To handle NAs, extend the function argument for na.rm argument:
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=function(x) mean(x, na.rm=TRUE))
I found a way to compute what I want by creating an extra column taking the average absolute errors from the 4 Real_IC levels for Off-center = 0 and matching them whenever Real_IC has a certain level.
In a second step, I subtract these from the ROI_ICs. However, how can I simplify that code to a more general form (at the moment I calculate the average absErrors based on their row location)? Sorry I am an absolute beginner ;(
Of note: My data.frame is called "ds_M"
#Define absolute errors for the 4 Real_IC levels as variables
average1<-mean(ds_M$absError_IC[1:10]) #for Real_IC=0
average2<-mean(ds_M$absError_IC[11:20]) #for Real_IC=0.4
average3<-mean(ds_M$absError_IC[21:30]) #for Real_IC=3
average4<-mean(ds_M$absError_IC[31:40]) #for Real_IC=5
# New column assigning the correction factor to each Real_IC level
ds_M$absCorr[ds_M$Real_IC==0]<-average1
ds_M$absCorr[ds_M$Real_IC==0.4]<-average2
ds_M$absCorr[ds_M$Real_IC==3]<-average3
ds_M$absCorr[ds_M$Real_IC==5]<-average4
# Calculate new column with corrected ROI_ICs
ds_M$corrError_IC<-ds_M$ROI_IC - ds_M$absCorr
I have data like below. I wish to extract the first and last year from each string here called my.string. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result below the example data set. I am using R.
When a string contains two years those years are contained within a portion of the string that looks like this ga49.51 or ea22.24
When a string contains only one year that year is contained in a portion of the string that looks like this: time11
I know a bit about regex, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.
EDIT
Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.
my.data <- read.table(text = '
my.string cov1 cov2
42:Alpha:ga6.8 -0.1 2.2
43:Alpha:ga9.11 -2.5 0.6
44:Alpha:ga30.32 -1.3 0.5
45:Alpha:ga49.51 -2.5 0.6
50:Alpha:time1:ga.time -1.7 0.9
51:Alpha:time2:ga.time -1.5 0.8
52:Alpha:time3:ga.time -1.0 1.0
2:Beta:ea2.9 -1.7 0.6
3:Beta:ea17.19 -5.0 0.8
4:Beta:ea22.24 -6.4 1.0
8:Beta:as 0.2 0.6
9:Beta:sd 1.7 0.4
12:Beta:time1:ea.tim -2.6 1.8
13:Beta:time10:ea.ti -3.6 1.1
14:Beta:time11:ea.ti -3.1 0.7
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
desired.result <- read.table(text = '
my.string cov1 cov2 time1 time2
42:Alpha:ga6.8 -0.1 2.2 6 8
43:Alpha:ga9.11 -2.5 0.6 9 11
44:Alpha:ga30.32 -1.3 0.5 30 32
45:Alpha:ga49.51 -2.5 0.6 49 51
50:Alpha:time1:ga.time -1.7 0.9 1 NA
51:Alpha:time2:ga.time -1.5 0.8 2 NA
52:Alpha:time3:ga.time -1.0 1.0 3 NA
2:Beta:ea2.9 -1.7 0.6 2 9
3:Beta:ea17.19 -5.0 0.8 17 19
4:Beta:ea22.24 -6.4 1.0 22 24
8:Beta:as 0.2 0.6 NA NA
9:Beta:sd 1.7 0.4 NA NA
12:Beta:time1:ea.tim -2.6 1.8 1 NA
13:Beta:time10:ea.ti -3.6 1.1 10 NA
14:Beta:time11:ea.ti -3.1 0.7 11 NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:
> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
my.string cov1 cov2 time1 time2
1 42:Alpha:ga6.8 -0.1 2.2 6 8
2 43:Alpha:ga9.11 -2.5 0.6 9 11
3 44:Alpha:ga30.32 -1.3 0.5 30 32
4 45:Alpha:ga49.51 -2.5 0.6 49 51
5 50:Alpha:time1:ga.time -1.7 0.9 1 <NA>
6 51:Alpha:time2:ga.time -1.5 0.8 2 <NA>
7 52:Alpha:time3:ga.time -1.0 1.0 3 <NA>
8 2:Beta:ea2.9 -1.7 0.6 2 9
9 3:Beta:ea17.19 -5.0 0.8 17 19
10 4:Beta:ea22.24 -6.4 1.0 22 24
11 8:Beta:as 0.2 0.6 <NA> <NA>
12 9:Beta:sd 1.7 0.4 <NA> <NA>
13 12:Beta:time1:ea.tim -2.6 1.8 1 <NA>
14 13:Beta:time10:ea.ti -3.6 1.1 10 <NA>
15 14:Beta:time11:ea.ti -3.1 0.7 11 <NA>
The first regex matches:
(?<=time)\\d+ - 1+ digits that have time before them
| - or
(?<=\\b[ge]a)\\d+ - 1+ digits that have ge or ea` as a whole word in front
The second regex matches:
(?<=\\b[ge]a\\d{1,100}\\.) - check if the current position is preceded with ge or ea as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
\\d+ - 1+ digits
Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:
Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*
Replace: $0\t$1\t$2\t$3
Breakdown:
.*(?: ... ).* ensures that the whole line is matched, and uses a non-capturing group for the main alternation
time(\d+): this is the first half of the alternation, capturing any digits after a "time"
(?:[ge]a)(\d+)\.(\d+): the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group
Replacement: $0 puts the whole line back. Each of the other capture groups are added, with tabs in-between.
See regex101 example
I am trying to simulate using the following code and data.For each of the iterations it is simulating n from rpois and then n values from function rbeta.This bit is working fine.
The only issue is that for each of n it should getid from the table below based on probability weighted(id_prob) sampling using function sample but for some reason it is only getting one ID for all values of n.
cl <- makeCluster(num_cores)
clusterEvalQ(cl,library(evir))
clusterExport(cl, varlist=c("Sims","ID","id_prob","beta_a","beta_b")
Sims<-10000
set.seed(0)
system.time(x1<-parLapply(cl,1:Sims, function(i){
id<-sample(ID,1,replace=TRUE,prob=id_prob)
n<-rpois(1,9)
rbeta(n,beta_a[id],beta_b[id])
}
))
ID Rate id_prob Beta_a Beta_b
1 1.5 16.7% 0.5 0.5
2 2 22.2% 0.4 0.4
3 1 11.1% 0.3 0.3
4 1.5 16.7% 0.6 0.6
5 2 22.2% 0.1 0.1
6 1 11.1% 0.2 0.2
I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew
Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.