I would like to explore the profile of two modalities of a categorical variable over time with respect to a given set of other categorical variables. I paste a reproducible example of such a dataset below.
set.seed(90114)
V1<-sample(rep(c("a", "A"), 100))
V2<-sample(rep(c("a", "A", "b", "B"), 50))
V3<-sample(rep(c("F", "M", "I"), 67), 200)
V4<-sample(rep(c("C", "R"), 100))
V5<-sample(rep(c(1970, 1980, 1990, 2000, 2010), 40))
data<-data.frame(V1, V2, V3, V4, V5)
To explore the behavior of such modalities, I decided to use Multiple Correspondence Analysis (package FactoMineR). To account for variation over time, one possibility is to split the dataset into 5 subsamples which represent the different levels of V5 and then run MCA on each subset. The rest of the analysis consists in comparing the position of the modalities across the different biplots. However, such practice is not without problems if the original dataset is too small. In such a case, the dimensions could be flipped or worse, the location of the active variables are likely to change from one plot to the other.
To avoid the problem, one solution could be to stabilize the position of the active variables across all the subsets and predict the coordinates of the supplementary variable afterwards, allowing the latter to move over time. I read somewhere that the coordinates of a modality can be obtained by computing the weighted mean of the coordinates of individuals in which this modality is found. So finding the coordinates of a modality for the year 1970 would boil down to computing the weighted mean of the coordinates of the individuals in the 1970 subset for that modality. However, I don't know whether it's common practice and if yes, I just don't know how to implement such calculations. I paste the rest of the code in order for you to visualize the problem.
data.mca<-MCA(data[, -5], quali.sup=1, graph=F)
# Retrieve the coordinates of the first and second dimension
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
# Append the coordinates to the original dataframe
data1<-data.frame(data, DIM1, DIM2)
# Split the data into 5 clusters according to V5 ("year")
data1.split<-split(data1, data1$V5)
data1.split<-lapply(data1.split, function(x) x=x[, -5]) # to remove the fifth column with the years, no longer needed
seventies<-as.data.frame(data1.split[1])
eightties<-as.data.frame(data1.split[2])
# ...
a.1970<-seventies[seventies$X1970.V1=="a",]
A.1970<-seventies[seventies$X1970.V1=="A",]
# The idea, then, is to find the coordinates of the modalities "a" and "A" by computing the weighted mean of their respective indivuduals for each subset. The arithmetic mean would yield
# a.1970.DIM1<-mean(a.1970$X1970.DIM1) # 0.0818
# a.1970.DIM2<-mean(a.1970$X1970.DIM2) # 0.1104
# and so on for the other levels of V5.
I thank you in advance for your help!
I found a solution to my problem. We can simply weight the mean of the coordinates by the value returned by row.w in FactoMineR. To account for the dilatation of the MCA, the values of the resulting coordinates of the barycentres should be divided by the square root of the eigenvalue of the dimension.
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
WEIGHT<-data.mca$call$row.w
data1<-data.frame(data, WEIGHT, DIM1, DIM2)
# Splitting the dataset according to values of V1
v1_a<-data1[data1$V1=="a",]
v1_A<-data1[data1$V1=="A",]
# Computing the weighted average of the coordinates of Dim1 and Dim2 for the first category of V1
V1_a_Dim1<-sum(v1_a$WEIGHT*v1_a$DIM1)/100 # -0.0248
v1_a_Dim2<-sum(v1_a$WEIGHT*v1_a$DIM2)/100 # -0.0382
# Account for the dilatation of the dimensions...
V1_a_Dim1/sqrt(data.mca$eig[1,1])
[1] -0.03923839
v1_a_Dim2/sqrt(data.mca$eig[2,1])
[1] -0.06338353
# ... which is the same as the following:
categories<-data.mca$quali.sup$coord[, 1:2]
categories
# Dim 1 Dim 2
# V1_a -0.03923839 -0.06338353
# V1_A 0.03923839 0.06338353
This can be applied to different partitions of the data according to V5 or any other categorical variable.
Related
I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
rm(list=ls())
library (vegan)
library(dplyr)
# Species and environmental data
dune2.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune2.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
filter(Use=='Pasture'|Use=='Hayfield')
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dplyr::select(1:28)
dune_en <- dune_data%>%
dplyr::select(29:33)
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
I am trying to arrive at the following output:
Group
before_distance
n_before
sd_before
after_distance
n_after
sd_after
Dune
0.4009
5
?
0.4314
7
?
I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21
I'm dealing with a dataframe of multiple rows (objects) over multiple columns (variables). I want to see if any rows (objects) are correlated. I've been through reading corr() and it seems for one variable, I can transpose my dataframe and feed it into the corr() function. but how do I deal with multiple variables of each observation/object? The end goal, plot the correlation matrix on a heatmap to eyeball interesting objects.
Examples as below:
Treatment <- c('Drug A','Drug B','Drug C','Drug D','Drug E','Drug F')
Measurment_V1 <- runif(6, 0, 3000)
Measurment_V2 <- runif(6, 0, 20)
Measurment_V3 <- runif(6, 0, 1)
Measurment_V4 <- runif(6, 0, 120000)
Measurment_V5 <- runif(6, 0, 100)
df<- as.data.frame(cbind(Treatment,Measurment_V1,Measurment_V2,Measurment_V3,Measurment_V4,Measurment_V5))
Each drug is explained by measurments V1-V5 (in realit there are a few hundreds columns)
So how can get a correlation matrix between all the drugs ABCD then plot their correlation on a heatmap like the Hmisc library could do?
This might do it:
# Redo your data frame
df <- data.frame(Treatment,Measurment_V1,Measurment_V2,Measurment_V3,Measurment_V4,Measurment_V5)
# Transpose numeric columns
dft <- as.data.frame(t(df[,2:6]))
# Rename vars
names(dft) <- c("Drug_A","Drug_B","Drug_C","Drug_D","Drug_E","Drug_F")
# Correlation matrix
cor(dft)
Output:
Drug_A Drug_B Drug_C Drug_D Drug_E Drug_F
Drug_A 1.0000000 0.9995697 0.9999240 0.9999939 0.9998902 0.9999665
Drug_B 0.9995697 1.0000000 0.9998554 0.9994612 0.9998946 0.9997758
Drug_C 0.9999240 0.9998554 1.0000000 0.9998748 0.9999969 0.9999911
Drug_D 0.9999939 0.9994612 0.9998748 1.0000000 0.9998324 0.9999320
Drug_E 0.9998902 0.9998946 0.9999969 0.9998324 1.0000000 0.9999777
Drug_F 0.9999665 0.9997758 0.9999911 0.9999320 0.9999777 1.0000000
You can then use the above correlation matrix to plot a heatmap.
Notice that I used data.frame() to redo your data frame since it makes numeric columns.
I think you're actually looking at this problem the wrong way around. You should be treating the drugs as the variables and investigating the correlation structure of the measurements.
I.e. the correlation matrix of interest is
cor(cbind(Measurment_V1, Measurment_V2, Measurment_V3, Measurment_V4, Measurment_V5))
One approach is to do PCA on the measurements so that you can place the drugs in a standardised space.
Then you could look for clustering of the drugs in that space to see which are similar to each other. Note it's much harder to do clustering in the original space of the measurements as they are on very different scales - you have to standardise them somehow, which is what the PCA can do. It also reduces the dimensionality of the measurement space which will help you visualise what is going on.
I am using the butter function from the signal package in R to filter my data.
The data is Fourier-transformed (complex numbers) - and I want to filter out specific frequencies (I have a 256Hz sampling frequency, hence a 128Hz spectrum and say I want to filter only data from 4-8Hz).
I find conflicting documentation on what the correct values for W in this function should be:
butter(n, W, type = c("low", "high", "stop", "pass"),
plane = c("z", "s"), ...)
should they be 4/256 and 8/256?
I find it hard to interpret whether my results with 4/256 and 8/256 are correct.
strength <- ftdata
low <- 4
high <- 8
bf <- butter(5, low/256, type="high")
bfsig <- filter(bf, strength)
bf <- butter(5, high/256, type="low")
bfsig <- filter(bf, bfsig)
plot(f,bfsig, type='l')
# Where ftdata is just the output from my Fourier transform. The 5 is the order of the filter.
I just want to know whether the "/256" makes sense here.
EDIT:
Also, any idea why the below does not return the same result as the above?
filt <- butter(5,c(low/256,high/256), "pass")
bfsig <- filter(filt, strength)
It should just be a combination of the low and high-pass filters.
In DSP, frequency is treated as "Normalized frequency". Normalized frequency is calculated by scaling frequency range 0 to smapling frequency into 0 to 1.0 . So you have to divide specific frequencies by 256.
To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.