Related
I applied the UMAP dimentionaity reduction over my data, and clustred it. I got three different clusters:
I have the data that specefices to which cluster does eahc sample belong, with the name of the sample and everything. Here is a subsample of it, let's call it df_cluster:
structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
The samples of df_cluster are the same in the original data, data, which I used for the clustering. Which is basically just the samples you saw as rows, and features as columns, looks something like this:
structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
I just want to view each of those features (the columns of data), in each cluster, using a box plot or a violin plot. Kind of a comparison between the clusters.
So in the X-axis I'll have clusters 1, 2, and 3, the Y-axis would be the values. Each feature will get a plot. I've drawn an example by hand to make it more clear:
You could use facets.
But first you need to pivot the dataframe.
df_cluster <- structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
data <- structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
library(tidyverse)
data %>%
as.data.frame() %>%
rownames_to_column("Patient") %>%
left_join(df_cluster %>% rownames_to_column("Patient") %>% select(Patient, cluster)) %>%
pivot_longer(- c(cluster, Patient)) %>% #Pivot the dataframe
ggplot(aes(as.factor(cluster), value)) +
geom_boxplot() +
facet_grid(~ name)
I have some data that looks like this: https://i.imgur.com/hzEd7bT.png
These will be pro league of legends matches once they occur over the course of the next few months. I filled out a few as examples.
Rows 6-10 are champions that each team banned. Rows 11-15 are champions that each team picked.
Each week has about 10 games and there are 9 weeks.
The B and R at the top are Blue (side) and Red (side) in the game. Blue side always gets first choice of champion and red side always gets last choice.
I want to find the best (or worst) synergizing champions
To clarify what I mean by this, in my screenshot the team with Brand and Yuumi won both times while the team with Aurelion Sol and Azir lost both times.
Optimally, I want to know how many times a 2, 3, 4, or 5 characters were picked and the corresponding winrate.
Edit: I am not sure exactly how the data needs to look in R because I have never done this before, but I made two different versions of inputting it below.
LoLGames <- matrix(c('W','Annie','Ezreal','Yuumi','Camille','Brand',
'L','Nasus', 'Aurelion Sol', 'Azir', 'Blitzcrank', 'Caitlyn',
'L','Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille',
'W','Bard', 'Ashe', 'Yuumi', 'Kogmaw', 'Brand'),
ncol = 6, byrow = TRUE)
colnames(LoLGames) <- c("Result","Champ1","Champ2","Champ3","Champ4","Champ5")
rownames(LoLGames) <- c("Game1","Game2","Game3","Game4")
LoLGames <- as.table(LoLGames)
*Corresponding dput
structure(c("W", "L", "L", "W", "Annie", "Nasus", "Nasus", "Bard",
"Ezreal", "Aurelion Sol", "Aurelion Sol", "Ashe", "Yuumi", "Azir",
"Blitzcrank", "Yuumi", "Camille", "Blitzcrank", "Ezreal", "Kogmaw",
"Brand", "Caitlyn", "Camille", "Brand"), .Dim = c(4L, 6L), .Dimnames = list(
c("Game1", "Game2", "Game3", "Game4"), c("Result", "Champ1",
"Champ2", "Champ3", "Champ4", "Champ5")), class = "table")
Result <- c('W','L','L','W',NA)
G1W <- c('Annie','Ezreal','Yuumi','Camille','Brand')
G1L <- c('Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille')
G2L <- c('Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille')
G2W <- c('Bard', 'Ashe', 'Yuumi', 'Kogmaw', 'Brand')
LoLDf <- data.frame(Result, G1W, G1L, G2L, G2W)
*Corresponding dput
structure(list(Result = structure(c(2L, 1L, 1L, 2L, NA), .Label = c("L",
"W"), class = "factor"), G1W = structure(c(1L, 4L, 5L, 3L, 2L
), .Label = c("Annie", "Brand", "Camille", "Ezreal", "Yuumi"), class = "factor"),
G1L = structure(c(5L, 1L, 2L, 4L, 3L), .Label = c("Aurelion Sol",
"Blitzcrank", "Camille", "Ezreal", "Nasus"), class = "factor"),
G2L = structure(c(5L, 1L, 2L, 4L, 3L), .Label = c("Aurelion Sol",
"Blitzcrank", "Camille", "Ezreal", "Nasus"), class = "factor"),
G2W = structure(c(2L, 1L, 5L, 4L, 3L), .Label = c("Ashe",
"Bard", "Brand", "Kogmaw", "Yuumi"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I am plotting the proportion of deep sleep (y axis) vs days (x axis). I would like to add vertical shaded area for a better understanding (e.g. grey for week-ends, orange for sick period...).
I have tried using geom_ribbon (I created a variable taking the value of 30, with is the top of my y axis if the data is during the WE - information given in another column), but instead of getting rectangles, I get trapezes.
In another post, someone proposed the use of "geom_rect", or "annotate" if one's know the x and y coordinates, but I don't see how to adapt it in my case, when I want to have the colored area repeated to all week-end (it is not exactly every 7 days because some data are missing).
Do you have any idea ?
Many thanks in advance !
ggplot(Sleep.data, aes(x = DATEID)) +
geom_line(aes(y = P.DEEP, group = 1), col = "deepskyblue3") +
geom_point(aes(y = P.DEEP, group = 1, col = Sign.deep)) +
guides(col=FALSE) +
geom_ribbon(aes(ymin = min, ymax = max.WE), fill = '#6495ED80') +
facet_grid(MONTH~.) +
geom_hline(yintercept = 15, col = "forestgreen") +
geom_hline(yintercept = 20, col = "forestgreen", linetype = "dashed") +
geom_vline(xintercept = c(7,14,21,28), col = "grey") +
scale_x_continuous(breaks=seq(0,28,7)) +
scale_y_continuous(breaks=seq(0,30,5)) +
labs(x = "Days",y="Proportion of deep sleep stage", title = "Deep sleep")
Proportion of deep sleep vs time
Head(Sleep.data)
> dput(head(Sleep.data))
structure(list(DATE = structure(c(1L, 4L, 7L, 10L, 13L, 16L), .Label = c("01-Dec-17",
"01-Feb-18", "01-Jan-18", "02-Dec-17", "02-Feb-18", "02-Jan-18",
"03-Dec-17", "03-Feb-18", "03-Jan-18", "04-Dec-17", "04-Feb-18",
"04-Jan-18", "05-Dec-17", "05-Feb-18", "05-Jan-18", "06-Dec-17",
"06-Feb-18", "06-Jan-18", "07-Dec-17", "07-Feb-18", "07-Jan-18",
"08-Dec-17", "08-Jan-18", "09-Dec-17", "09-Feb-18", "09-Jan-18",
"10-Dec-17", "10-Jan-18", "11-Dec-17", "11-Feb-18", "11-Jan-18",
"12-Dec-17", "12-Jan-18", "13-Dec-17", "13-Feb-18", "13-Jan-18",
"14-Dec-17", "14-Feb-18", "14-Jan-18", "15-Dec-17", "15-Jan-18",
"16-Dec-17", "16-Jan-18", "17-Dec-17", "17-Jan-18", "18-Dec-17",
"18-Jan-18", "19-Dec-17", "19-Jan-18", "20-Dec-17", "21-Dec-17",
"21-Jan-18", "22-Dec-17", "22-Jan-18", "23-Dec-17", "23-Jan-18",
"24-Dec-17", "24-Jan-18", "25-Dec-17", "25-Jan-18", "26-Dec-17",
"26-Jan-18", "27-Dec-17", "27-Jan-18", "28-Dec-17", "28-Jan-18",
"29-Dec-17", "29-Jan-18", "30-Dec-17", "30-Jan-18", "31-Dec-17",
"31-Jan-18"), class = "factor"), DATEID = 1:6, MONTH = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Decembre", "Janvier", "Février"
), class = "factor"), DURATION = c(8.08, 7.43, 6.85, 6.23, 7.27,
6.62), D.DEEP = c(1.67, 1.37, 1.62, 1.75, 1.95, 0.9), P.DEEP = c(17L,
17L, 21L, 24L, 25L, 12L), STIMS = c(0L, 0L, 0L, 0L, 390L, 147L
), D.REM = c(1.7, 0.95, 0.95, 1.43, 1.47, 0.72), P.REM = c(17L,
11L, 12L, 20L, 19L, 9L), D.LIGHT = c(4.7, 5.12, 4.27, 3.05, 3.83,
4.98), P.LIGHT = c(49L, 63L, 55L, 43L, 49L, 66L), D.AWAKE = c(1.45,
0.58, 0.47, 0.87, 0.37, 0.85), P.AWAKE = c(15L, 7L, 6L, 12L,
4L, 11L), WAKE.UP = c(-2L, 0L, 2L, -1L, 3L, 1L), AGITATION = c(-1L,
-3L, -1L, -2L, 2L, -1L), FRAGMENTATION = c(1L, -2L, 2L, 1L, 0L,
-1L), PERIOD = structure(c(3L, 3L, 4L, 4L, 4L, 4L), .Label = c("HOLIDAYS",
"SICK", "WE", "WORK"), class = "factor"), SPORT = structure(c(2L,
1L, 2L, 2L, 2L, 1L), .Label = c("", "Day", "Evening"), class = "factor"),
ACTIVITY = structure(c(6L, 1L, 3L, 4L, 5L, 1L), .Label = c("",
"Bkool", "eBike", "Gym", "Natation", "Run"), class = "factor"),
TABLETS = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), Ratio = c(1.15,
2.36, 3.45, 2.01, 5.27, 1.06), Sign = structure(c(2L, 2L,
2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
Sign.ratio = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), Sign.deep = structure(c(2L, 2L,
2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
Sign.awake = structure(c(1L, 2L, 2L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), Sign.light = structure(c(2L, 1L,
1L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
index = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), min = c(0, 0, 0, 0, 0, 0), max.WE = c(30,
30, 0, 0, 0, 0)), .Names = c("DATE", "DATEID", "MONTH", "DURATION",
"D.DEEP", "P.DEEP", "STIMS", "D.REM", "P.REM", "D.LIGHT", "P.LIGHT",
"D.AWAKE", "P.AWAKE", "WAKE.UP", "AGITATION", "FRAGMENTATION",
"PERIOD", "SPORT", "ACTIVITY", "TABLETS", "Ratio", "Sign", "Sign.ratio",
"Sign.deep", "Sign.awake", "Sign.light", "index", "min", "max.WE"
), row.names = c(NA, 6L), class = "data.frame")
Thanks for adding the data, that makes it easier to understand exactly what you're working with and to confirm that an answer actually addresses your question.
I thought it would be helpful to make a separate table with just the start and end of each contiguous set of rows with the same PERIOD. I did this using dplyr::case_when, assuming we should mark dates as a "start" if they are the first row in the table (row_number() == 1), or they have a different PERIOD value than the prior row. I mark dates as an "end" if they are the last row of the table, or have a different PERIOD than the next row. I only keep the starts and ends, and spread these into new columns called start and end.
library(tidyverse)
Period_ranges <- Sleep.data %>%
mutate(period_status = case_when(row_number() == 1 ~ "start",
PERIOD != lag(PERIOD) ~ "start",
row_number() == n() ~ "end",
PERIOD != lead(PERIOD) ~ "end",
TRUE ~ "other")) %>%
filter(period_status %in% c("start", "end")) %>%
select(DATEID, PERIOD, period_status) %>%
mutate(PERIOD_NUM = cumsum(PERIOD != lag(PERIOD) | row_number() == 1)) %>%
spread(period_status, DATEID)
# Output based on sample data only. If there's a problem with the full data, please add more. To share full data, use `dput(Sleep.data)` or to share 20 rows use `dput(head(Sleep.data, 20))`.
>Period_ranges
PERIOD PERIOD_NUM end start
1 WE 1 2 1
2 WORK 2 6 3
We can now use that in the plot. If you want to toggle the inclusion or fiddle with the appearance separately of different PERIOD types, you could modify the code below with Period_ranges %>% filter(PERIOD == "WE"),
ggplot(Sleep.data, aes(x = DATEID)) +
# Here I specify that this geom should use its own data.
# I start the rectangles half a day before and end half a day after to fill the space.
geom_rect(data = Period_ranges, inherit.aes = F,
aes(xmin = start - 0.5, xmax = end + 0.5,
ymin = 0, ymax = 30,
fill = PERIOD), alpha = 0.5) +
# Here we can specify the shading color for each type of PERIOD
scale_fill_manual(values = c(
"WE" = '#6495ED80',
"WORK" = "gray60"
)) +
# rest of your code
Chart based on data sample:
I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.