Aggregating Column Values In Kusto - azure-data-explorer

How do I transform kusto data that looks like this:
let fauxData = datatable (OrgName:string, Status:string, EastUS:long, SouthCentralUS:long, WestUS2:long)
['Apple', 'Red', 50, 10, 90,
'Apple', 'Orange', 30, 30, 10,
'Apple', 'Yellow', 10, 0, 0,
'Apple', 'Green', 10, 60, 0,
'Ball', 'Red', 20, 20, 20,
'Ball', 'Orange', 30, 30, 30,
'Ball', 'Yellow', 0, 0, 0,
'Ball', 'Green', 50, 50, 50,
];
To look like this:
['Apple', 'ComboOfRedandOrange', 80, 40, 100,
'Apple', 'ComboOfGreenandYellow', 20, 60, 0,
'Ball', 'ComboOfRedandOrange', 50, 50, 50,
'Ball', 'ComboOfGreenandYellow', 50, 50, 50,
]

You can use next query to achieve your goal:
let fauxData = datatable (OrgName:string, Status:string, EastUS:long, SouthCentralUS:long, WestUS2:long)
['Apple', 'Red', 50, 10, 90,
'Apple', 'Orange', 30, 30, 10,
'Apple', 'Yellow', 10, 0, 0,
'Apple', 'Green', 10, 60, 0,
'Ball', 'Red', 20, 20, 20,
'Ball', 'Orange', 30, 30, 30,
'Ball', 'Yellow', 0, 0, 0,
'Ball', 'Green', 50, 50, 50,
];
fauxData
| extend combo = case(Status in ('Red', 'Orange'), 'ComboOfRedandOrange',
Status in ('Green', 'Yellow'), 'ComboOfGreenandYellow',
'Unknown')
| summarize sum(EastUS), sum(SouthCentralUS), sum(SouthCentralUS) by OrgName, combo

you could try something like this:
let T =
datatable(OrgName: string, Status: string, EastUS: long, SouthCentralUS: long, WestUS2: long)
[
'Apple', 'Red', 50, 10, 90,
'Apple', 'Orange', 30, 30, 10,
'Apple', 'Yellow', 10, 0, 0,
'Apple', 'Green', 10, 60, 0,
'Ball', 'Red', 20, 20, 20,
'Ball', 'Orange', 30, 30, 30,
'Ball', 'Yellow', 0, 0, 0,
'Ball', 'Green', 50, 50, 50,
]
;
let F = (statuses:dynamic)
{
T
| where Status in(statuses)
| summarize sum(EastUS), sum(SouthCentralUS), sum(WestUS2) by OrgName, Status = strcat("ComboOf", strcat_array(statuses, "And"))
}
;
union
F(dynamic(['Red', 'Orange'])),
F(dynamic(['Green', 'Yellow']))
| order by OrgName asc, Status asc

Related

plotting a graph with multiple bars in R

I am struggling to plot the following data and think it is because of the format of the data.
structure(list(HE_Provider = c("Coventry University", "The University of Leicester",
"Total"), Bath_and_North_East_Somerset = c(15, 20, 205), Bedford = c(85,
90, 1040), Blackburn_with_Darwen = c(10, 20, 95), Blackpool = c(10,
5, 60), `Bournemouth,_Poole_and_Christchurch` = c(35, 15, 285
), Bracknell_Forest = c(15, 10, 210), Buckinghamshire = c(195,
145, 1835), Cambridgeshire = c(130, 160, 2500), Central_Bedfordshire = c(115,
70, 1120), Cheshire_East = c(45, 55, 935), Cheshire_West_and_Chester = c(25,
40, 535), City_of_Bristol = c(40, 35, 390), City_of_Derby = c(65,
135, 4115), City_of_Kingston_upon_Hull = c(25, 20, 265), City_of_Leicester = c(315,
1275, 6860), City_of_Nottingham = c(65, 145, 5405), City_of_Plymouth = c(15,
10, 135), City_of_Portsmouth = c(15, 15, 130), City_of_Southampton = c(15,
20, 140), `City_of_Stoke-on-Trent` = c(50, 15, 475), City_of_York = c(35,
20, 350), Cornwall = c(25, 25, 300), County_Durham = c(20, 40,
330), Cumbria = c(30, 20, 305), Darlington = c(0, 15, 110), Derbyshire = c(100,
145, 6925), Devon = c(50, 50, 630), Dorset = c(30, 20, 285),
East_Riding_of_Yorkshire = c(75, 45, 760), East_Sussex = c(55,
50, 650), Essex = c(365, 180, 3320), Gloucestershire = c(150,
85, 905), Greater_London = c(5550, 1930, 18285), Greater_Manchester = c(245,
280, 2820), Halton = c(5, 10, 80), Hampshire = c(180, 120,
1485), Hartlepool = c(5, 10, 55), Herefordshire = c(50, 15,
235), Hertfordshire = c(385, 270, 4815), Isle_of_Wight = c(10,
5, 90), Isles_of_Scilly = c(0, 0, 0), Kent = c(365, 195,
2590), Lancashire = c(75, 125, 985), Leicestershire = c(540,
980, 8010), Lincolnshire = c(145, 190, 7710), Luton = c(105,
75, 685), Medway = c(95, 35, 425), Merseyside = c(75, 120,
975), Middlesbrough = c(10, 5, 65), Milton_Keynes = c(265,
170, 2205), Norfolk = c(120, 115, 2410), North_East_Lincolnshire = c(20,
10, 810), North_Lincolnshire = c(20, 20, 810), North_Somerset = c(25,
15, 205), North_Yorkshire = c(500, 80, 1160), Northamptonshire = c(680,
510, 7505), Northumberland = c(10, 25, 235), Nottinghamshire = c(140,
185, 9410), Oxfordshire = c(280, 135, 1785), Peterborough = c(85,
135, 1560), Reading = c(75, 25, 260), Redcar_and_Cleveland = c(5,
5, 90), Rutland = c(5, 35, 345), Shropshire = c(60, 30, 500
), Slough = c(95, 40, 270), Somerset = c(40, 40, 490), South_Gloucestershire = c(40,
25, 310), South_Yorkshire = c(105, 180, 3220), `Southend-on-Sea` = c(35,
25, 345), Staffordshire = c(370, 150, 3825), `Stockton-on-Tees` = c(20,
15, 145), Suffolk = c(115, 115, 1935), Surrey = c(195, 155,
2900), Swindon = c(50, 25, 225), Telford_and_Wrekin = c(60,
20, 360), Thurrock = c(140, 40, 370), Torbay = c(5, 5, 65
), Tyne_and_Wear = c(45, 60, 680), Warrington = c(20, 20,
290), Warwickshire = c(2080, 210, 2825), West_Berkshire = c(35,
25, 300), West_Midlands = c(8315, 915, 8220), West_Sussex = c(105,
95, 1115), West_Yorkshire = c(200, 245, 3005), Wiltshire = c(90,
55, 630), Windsor_and_Maidenhead = c(40, 25, 405), Wokingham = c(70,
35, 395), Worcestershire = c(350, 110, 1350), `England_(county_unitary_authority_unknown)` = c(0,
10, 770), Total_England = c(24990, 11530, 154930), Total = c(25380,
11845, 158480)), row.names = c(NA, -3L), class = "data.frame")
I would like to plot the Region on the bottom but don't have a title for these regions, with the numbers up the y axis and the fill being the university.
This type of problems generally has to do with reshaping the data. The format should be the long format and the data is in wide format. See this post on how to reshape the data from wide to long format.
Reshape the data and plot with geom_col.
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(ggplot2)
})
df1 %>%
select(-matches("England"), -matches("Total")) %>%
pivot_longer(-HE_Provider, names_to = "Region") %>%
ggplot(aes(Region, value, fill = HE_Provider)) +
geom_col() +
theme_bw(base_size = 10) +
theme(axis.text.x = element_text(size = 7, angle = 75, vjust = 1, hjust = 1),
legend.position = "bottom")
Created on 2022-12-06 with reprex v2.0.2
We could bring the data in long format. For y we used log scale:
library(tidyverse)
df %>%
pivot_longer(-HE_Provider) %>%
group_by(HE_Provider, name) %>%
summarise(sum_value = sum(value)) %>%
ggplot(aes(x=name, y=log(sum_value), fill=HE_Provider))+
geom_col(position=position_dodge())+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

fill delaunay triangles with colors of vertex points in R

here is a reprex
data<- structure(list(lanmark_id = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67), V1 = c(0.00291280916742007,
0.00738863171211713, 0.0226678081211574, 0.0475105228945172,
0.0932285720818941, 0.167467706279089, 0.257162845610094, 0.365202733889021,
0.49347857580521, 0.623654594804239, 0.738846221030799, 0.838001377618909,
0.911583795022151, 0.954620025430512, 0.976736039833402, 0.99275439380643,
1.00100526672829, 0.0751484964183746, 0.136267471453466, 0.223219796351563,
0.312829176190895, 0.396253287447153, 0.589077347394549, 0.682150866526948,
0.771279538477539, 0.856242644022999, 0.915433541338973, 0.493665602840245,
0.491283285973581, 0.488913167946858, 0.486968906096063, 0.384707082576335,
0.43516446651127, 0.48730704698643, 0.541730425616146, 0.590794609520034,
0.176234316360877, 0.230353437655898, 0.295908510434122, 0.350673723300921,
0.2927721757992, 0.228392965512228, 0.634474821310078, 0.692554938010577,
0.757884656518485, 0.809961553290539, 0.760324208523177, 0.696892501347341,
0.299062528225204, 0.371899560139738, 0.440183530232855, 0.488448817156316,
0.542120710507391, 0.613931454931259, 0.683122622479693, 0.614367295821043,
0.544516611213321, 0.487065702940653, 0.43466839036949, 0.367662837035504,
0.329392110306872, 0.439192556373207, 0.488617118648197, 0.543288506065858,
0.652131615571443, 0.541622182786469, 0.486664920417254, 0.437126878794749
), V2 = c(0.201088019764115, 0.335422141956174, 0.468591127485112,
0.597955245417373, 0.719502795031081, 0.826191980419368, 0.912263437847338,
0.978932088608654, 0.996572250349122, 0.975164350943783, 0.906204543800476,
0.817791059656974, 0.711167374856116, 0.587462637963028, 0.457981280500493,
0.327526817895531, 0.19652402489511, 0.0832018969548692, 0.0247526745448235,
0.00543973063471442, 0.0169853862992864, 0.0463565705952832,
0.0442986445765913, 0.0151651597693172, 0.00747493463745755,
0.0263496825405166, 0.0805712600069456, 0.160307477500307, 0.24640401358039,
0.332244740019727, 0.420995916418539, 0.486383354389177, 0.505514985155285,
0.521022030162301, 0.5059272511442, 0.48818970795347, 0.184054088286897,
0.153658218058329, 0.153359749238857, 0.186997311695192, 0.20294291755153,
0.204166125257439, 0.186997311695192, 0.153386090373069, 0.155932705636629,
0.184603717976376, 0.203900583330345, 0.202836636618411, 0.670663080116174,
0.635972857244521, 0.619932598923225, 0.632625553953685, 0.620132318139554,
0.637530241507316, 0.668109937001625, 0.718821664744205, 0.73956412947459,
0.744898219300658, 0.74046882628352, 0.720755964662638, 0.672731384920681,
0.666152981987244, 0.670464844757437, 0.664772611108765, 0.671145517468628,
0.673968618595099, 0.67986363963374, 0.675352028351748), coef2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0.565178003460693, 0, 0, 0, 0, 0, 0, 0,
0, 0.0433232019717308, 0.0433232019717308, 0.442833876807268,
0.574211955093656, 0.574211955093656, 0.574211955093656, 0.574211955093656,
0.442833876807268, 0.0433232019717308, 0.0433232019717308, 0.0612451242746323,
0.0612451242746323, 0, 0, 0, 0, 0, 0, 0, 0.343056259557492, 0.701076795777046,
0.674029769391816, 0, 0.538117834886036, 0.990039002564078, 0.451921167678043,
0.701076795777046, 0.701076795777046, 0.316009233172263, 0.990039002564078,
0.990039002564078, 0.878350036859346, 0.343364662128988, 0.282119537854356,
0.282119537854356, 0.282119537854356, 0.343364662128988, 0.384793696241895,
0.608382647917744, 0.608382647917744, 1, 0.608382647917744, 0.608382647917744,
0.384793696241895, 0.501936678206125, 0.501936678206125, 0, 0.878350036859346,
0, 0.501936678206125, 0.501936678206125)), row.names = c(NA,
-68L), class = c("tbl_df", "tbl", "data.frame"))
I used this data to create a deulanay plot in R
library(tidyverse)
library(ggforce)
data%>%
mutate(coef2 = coef2/max(coef2))%>%
ggplot(aes(V1, V2))+
geom_delaunay_tile(aes(colour = coef2, fill = coef2), alpha = .5)+
geom_delaunay_segment2(aes(colour = coef2, fill = coef2))+
geom_point(aes(colour = coef2))+
ylim(1,0)+
scale_color_viridis_c(option = "magma")+
scale_fill_viridis_c(option = "magma")+
theme_minimal()
which gives this
I want to fill all triangles with a blend of colors that match the color of each point, just as the lines are colored.
as you can see I have tried using fill = coef2 within de geom_delaunay but this doesn't really achieve what I want.
is there a way to do this in R.
Many thanks!

Normalize and control for missing data in an extreme dataset

I have the following dataset
structure(list(q1 = c(5, 40, 200, 100, 100, 3, 200, 10, 10, 50,
50, 20, 600, 20, 15, 20, 80, 50, 0, 0, 45, 40, 20, 100, 20, 100,
3, 30, 10, 3, 20, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 100, 0, 5, 5,
5, 2, 0, 0, 0, 0, 0, 100, 0, 0, 0, 10, 5, 0, 50), q2 = c(5, 40,
200, 80, 100, 2, 100, 11, 10, 5, 50, 60, 600, 10, 10, 30, 50,
0, 0, 0, 45, 30, 10, 20, 20, 20, 5, 30, 30, 3, 20, 0, 20, 0,
0, 0, 20, 0, 5, 2, 60, 0, 40, 10, 5, 0, 0, 0, 0, 5, 0, 0, 0,
0, 0, 0, 10, 20, 0, 0), q3 = c(2, 70, 400, 160, 350, 100, 500,
20, 100, 500, 300, 20, 1000, 20, 20, 200, 80, 100, 70, 50, 0,
20, 40, 0, 0, 200, 5, 0, 100, 3, 50, 60, 0, 0, 0, 20, 100, 30,
40, 50, 50, 1000, 60, 0, 10, 160, 20, 40, 40, 200, 20, 20, 15,
150, 10, 15, 10, 100, 0, 10), q4 = c(50, 30, 300, 160, 300, 100,
500, 20, 100, 25, 200, 30, 600, 20, 0, 0, 50, 20, 200, 50, 50,
20, 30, 0, 0, 50, 3, 20, 60, 3, 0, 60, 0, 0, 0, 15, 100, 30,
30, 20, 100, 1000, 30, 10, 10, 50, 3, 20, 0, 100, 15, 20, 1510,
0, 10, 20, 0, 50, 0, 0), q5 = c(20, 50, 200, 40, 100, 100, 100,
15, 20, 50, 50, 50, 1000, 20, 15, 30, 50, 30, 15, 15, 25, 20,
20, 20, 20, 150, 3, 50, 30, 10, 30, 30, 50, 20, 20, 15, 20, 30,
8, 20, 100, 500, 30, 10, 30, 20, 3, 20, 20, 15, 30, 0, 45, 20,
0, 15, 30, 40, 20, 15), q6 = c(0, 70, 100, 160, 100, 100, 50,
15, 10, 25, 1000, 50, 1000, 20, 0, 0, 80, 0, 0, 0, 35, 30, 10,
20, 20, 100, 3, 10, 60, 10, 0, 100, 30, 50, 100, 15, 30, 30,
17, 5, 30, 1000, 80, 20, 30, 80, 40, 80, 20, 20, 40, 30, 30,
0, 0, 20, 10, 40, 20, 50), q7 = c(5, 50, 200, 100, 100, 5, 20,
10, 0, 300, 50, 20, 300, 20, 0, 200, 80, 10, 15, 0, 30, 20, 40,
20, 20, 100, 3, 15, 50, 15, 80, 20, 0, 30, 0, 15, 20, 30, 10,
20, 30, 100, 70, 20, 3, 20, 30, 40, 30, 10, 15, 0, 30, 30, 0,
5, 50, 30, 0, 30), q8 = c(0, 30, 50, 100, 20, 5, 5, 8, 10, 5,
30, 20, 100, 20, 0, 0, 50, 20, 0, 0, 35, 20, 20, 0, 30, 20, 5,
6, 30, 15, 10, 10, 30, 0, 0, 0, 20, 30, 6, 5, 50, 100, 10, 10,
5, 35, 20, 80, 20, 20, 15, 0, 15, 0, 0, 5, 10, 40, 0, 15), q9 = c(20,
40, 0, 180, 0, 0, 0, 1, 20, 500, 100, 20, 1000, 0, 20, 0, 80,
50, 0, 15, 45, 20, 20, 0, 20, 200, 3, 80, 50, 15, 30, 30, 30,
0, 20, 0, 50, 0, 45, 200, 0, 0, 5, 20, 10, 180, 50, 90, 20, 50,
20, 0, 15, 0, 0, 30, 50, 40, 0, 30), q10 = c(10, 70, 0, 200,
0, 0, 10, 1, 15, 15, 100, 20, 1000, 0, 0, 0, 80, 30, 0, 10, 30,
30, 10, 0, 15, 20, 5, 30, 40, 15, 10, 30, 100, 0, 0, 5, 50, 30,
20, 15, 30, 0, 5, 10, 10, 90, 25, 90, 15, 25, 20, 0, 15, 0, 0,
35, 10, 20, 0, 15), q11 = c(20, 60, 200, 120, 100, 9, 100, 15,
25, 150, 100, 30, 100, 20, 15, 50, 80, 50, 20, 15, 30, 20, 30,
20, 15, 150, 10, 20, 50, 10, 35, 20, 50, 20, 0, 20, 0, 30, 35,
20, 80, 100, 60, 20, 50, 20, 60, 20, 50, 25, 35, 0, 30, 0, 0,
30, 30, 40, 20, 20), q12 = c(20, 50, 200, 120, 100, 3, 50, 12,
10, 15, 50, 30, 100, 20, 0, 30, 60, 0, 0, 5, 25, 30, 10, 20,
10, 1000, 5, 0, 60, 10, 20, 0, 5, 25, 0, 15, 0, 30, 31, 2, 35,
1000, 10, 10, 15, 20, 25, 80, 50, 20, 35, 0, 20, 0, 0, 10, 20,
30, 0, 15), q13 = c(200, 80, 0, 200, 25, 200, 10, 20, 50, 15,
1000, 70, 1000, 50, 0, 0, 80, 40, 30, 0, 100, 30, 20, 20, 40,
100, 5, 50, 100, 20, 0, 30, 30, 0, 50, 10, 30, 30, 45, 10, 120,
1000, 50, 202, 100, 200, 15, 120, 25, 20, 35, 0, 45, 0, 50, 50,
50, 30, 0, 30), q14 = c(0, 50, 200, 200, 0, 5, 100, 5, 20, 300,
300, 40, 1000, 10020, 20, 0, 80, 30, 0, 15, 50, 50, 20, 0, 40,
300, 3, 20, 100, 5, 0, 50, 100, 0, 0, 0, 30, 100, 20, 100, 40,
100, 5, 10, 10, 10, 50, 120, 0, 50, 15, 50, 50, 0, 50, 15, 100,
40, 0, 50), q15 = c(50, 40, 50, 150, 100, 30, 0, 8, 25, 100,
100, 100, 0, 100, 0, 0, 50, 10, 0, 50, 150, 1000, 10, 0, 120,
0, 5, 100, 20, 10, 10, 0, 100, 0, 0, 5, 100, 30, 45, 200, 100,
200, 20, 5, 0, 0, 50, 100, 50, 100, 10, 0, 0, 0, 50, 30, 100,
50, 0, 50), q16 = c(50, 50, 200, 100, 200, 15, 200, 15, 50, 500,
150, 50, 1000, 20, 0, 100, 100, 30, 0, 50, 60, 30, 50, 100, 100,
100, 10, 100, 100, 15, 200, 50, 30, 0, 0, 15, 30, 30, 5, 50,
15, 1000, 5, 20, 100, 0, 80, 20, 0, 300, 20, 0, 100, 0, 0, 20,
100, 100, 0, 200), q17 = c(0, 30, 100, 140, 100, 5, 100, 15,
15, 15, 100, 60, 1000, 50, 0, 0, 50, 0, 0, 0, 60, 20, 10, 0,
40, 100, 5, 30, 60, 15, 10, 30, 0, 0, 20, 15, 20, 30, 10, 10,
50, 1000, 30, 10, 20, 30, 0, 80, 0, 50, 15, 0, 30, 0, 0, 15,
10, 60, 0, 50), q18 = c(0, 60, 0, 80, 20, 5, 0, 5, 25, 500, 250,
70, 800, 0, 20, 100, 100, 100, 50, 50, 70, 30, 50, 0, 50, 300,
5, 100, 50, 15, 20, 50, 30, 0, 0, 0, 50, 0, 90, 100, 50, 100,
0, 10, 1000, 0, 20, 80, 5, 100, 20, 0, 0, 0, 0, 30, 0, 100, 0,
0), q19 = c(0, 30, 0, 80, 0, 5, 0, 15, 25, 15, 100, 60, 800,
50, 0, 0, 80, 0, 0, 0, 45, 20, 10, 0, 20, 500, 5, 30, 60, 15,
50, 50, 0, 0, 50, 0, 20, 0, 20, 15, 0, 0, 0, 10, 75, 100, 10,
80, 5, 30, 20, 0, 15, 0, 0, 20, 0, 50, 10, 0), q20 = c(100, 60,
200, 150, 200, 30, 200, 100, 50, 1500, 100, 40, 400, 5020, 35,
150, 80, 100, 100, 50, 70, 30, 40, 100, 50, 200, 20, 0, 50, 10,
100, 30, 0, 60, 30, 50, 20, 30, 63, 40, 100, 100, 0, 20, 50,
200, 50, 50, 30, 50, 30, 0, 45, 35, 30, 45, 50, 50, 30, 40),
q21 = c(100, 30, 200, 150, 100, 40, 100, 10, 20, 15, 100,
30, 400, 20, 10, 0, 60, 0, 0, 0, 10, 20, 10, 20, 15, 20,
5, 30, 50, 10, 10, 20, 0, 0, 0, 15, 20, 30, 15, 10, 30, 100,
0, 10, 15, 0, 30, 120, 10, 10, 35, 0, 2525, 35, 50, 40, 10,
30, 20, 15), q22 = c(100, 70, 100, 150, 100, 5, 100, 5, 25,
250, 100, 50, 1000, 20, 15, 70, 80, 100, 10, 20, 30, 30,
20, 50, 50, 200, 10, 40, 40, 15, 100, 20, 50, 60, 20, 15,
30, 30, 10, 30, 100, 100, 25, 20, 10, 100, 80, 50, 25, 20,
35, 0, 30, 20, 0, 20, 50, 50, 0, 50), q23 = c(10, 40, 100,
150, 100, 3, 10, 10, 20, 4, 100, 60, 700, 20, 0, 0, 60, 0,
0, 0, 20, 20, 10, 20, 40, 20, 5, 2, 60, 15, 10, 20, 5, 0,
20, 0, 30, 30, 10, 2, 1010, 0, 10, 1010, 10, 10, 5, 80, 3,
20, 20, 0, 25, 0, 0, 20, 10, 30, 0, 15)), row.names = c(NA,
-60L), class = "data.frame")
edit*: 0 is not missing data as it is values in $
when looking at it graphically it looks far from ideal
boxplot(as.matrix(example))
plot(density(as.matrix(example)))
I would like to normalize this data by a transformation and control for outliers so I have 2 questions:
QUESTION 1
how would you deal with outliers in this dataset. I don't want to lose data so I would like to replace them, however which method to use is unclear to me. On this matter, is there any package that would help me automate this? I also wanna look at the rationale of the method used
QUESTION 2
Having controlled for outliers I want to transform the variables into normality. For this I have two packages I tend to use:
library(rcompanion)
a<- transformTukey(as.matrix(example))
and
library(LambertW)
b<-Gaussianize(example, type = "h")
however I am not too sure mathematically how they work and how to asess if they are doing a good job, which is better or if there is another more practical solution.
It's not completely clear what you're trying to do with the data, but I would pretty much start off with the simple things and go from there. For example, what's the hist() look like and all the normal distribution (sure you can find that online somewhere better) checks. I think the one that I always go for in outliers is the simple lm() which will have the graphs for the outliers and where the 'cutoff' would be if you went through the graphs. Normally, the data type would also give you a little insight as to normalization methods, but in general log norm usually is a good default choice

How to make a profile plot (principal component analysis) in R?

I'm currently running principal component analysis. For the interpretation I want to create a profile (pattern) plot to visualize the correlation between each principal component and the original variables. Is anyone familiar with a package or code to create this in R? I'm using the prcomp() function in R.
See examples:
https://canadianaudiologist.ca/predicting-speech-perception-from-the-audiogram-and-vice-versa/
https://blogs.sas.com/content/iml/2019/11/04/interpret-graphs-principal-components.html
This is similar data to my db:
db <- structure(list(T025 = c(20, 60, 20, 10, 85, 5, 15, 10, 10, 25,
15, 5, 15, 30, 15, 15, 10, 25, 45, 25, 55, 20, 65, 20, 10, 10,
15, 15, 30, 35, 10, 50, 20, 15, 30, 15, 20, 35, 30, 20, 10, 20,
30, 15, 40, 15, 10, 10, 20, 25, -5, 10, 40, 0, 15, 5, 15, 30,
15, 80, 15, 35, 10, 50, 25, 10, 15, 20, 20, 20, 25, 20, 30, 10,
20, 50, 25, 25, 55, 30, 20, 30, 15, 10, 15, 15, 35, 20, 30, 15,
40, 20, 25, 15, 20, 35, 15, 25, 20, 40, 0, 20, 10, 10, 15, 10,
20, 10, 35, 35, 25, 30, 20, 25, 15, 30, 35, 25, 30, 5, 20, 30,
15, 25, 10), T05 = c(0, 25, 0, 5, 25, 5, 0, 0, 5, 5, 5, -5, 5,
15, 15, 5, 0, 15, 25, 15, 50, 20, 45, 5, 5, 5, 0, 10, 10, 10,
5, 20, 15, 10, 20, 10, -5, 10, 30, -5, 0, 10, 35, 5, 40, 0, 0,
-5, 15, 25, 0, 5, 35, -5, 5, 0, 5, 5, 10, 70, 0, 20, 5, 30, 10,
10, 5, 5, 25, 10, 20, 5, 25, 5, 10, 35, 15, 10, 45, 15, 15, 25,
10, 5, 10, 5, 20, 15, 15, 5, 10, 10, 20, 5, 15, 25, 5, 20, 10,
35, -10, 5, 0, -5, 0, 5, 15, 5, 15, 35, 20, 25, 10, 15, 15, 25,
45, 0, 25, 0, 5, 25, 0, 20, 5), T1 = c(25, 20, 25, 20, 50, 10,
15, 20, 25, 25, 25, 25, 15, 45, 25, 25, 20, 35, 40, 35, 65, 45,
45, 30, 25, 20, 5, 20, 30, 25, 20, 35, 25, 25, 35, 15, 15, 25,
45, 20, 25, 35, 40, 25, 60, 15, 15, 15, 25, 45, 20, 20, 60, 15,
20, 25, 45, 45, 25, 75, 10, 45, 15, 50, 20, 25, 20, 15, 40, 30,
50, 20, 40, 20, 35, 50, 35, 15, 50, 30, 20, 45, 25, 25, 20, 45,
30, 35, 30, 30, 15, 15, 30, 25, 25, 25, 15, 40, 25, 55, 20, 30,
10, 15, 50, 15, 40, 20, 20, 55, 35, 45, 20, 50, 35, 20, 65, 10,
35, 15, 30, 55, 25, 15, 25), T2 = c(20, 20, 15, 25, 70, 10, 15,
45, 50, 30, 20, 25, 10, 40, 20, 40, 30, 40, 25, 30, 45, 25, 50,
20, 20, 20, 10, 10, 45, 10, 5, 40, 20, 15, 50, 25, 15, 20, 25,
30, 20, 30, 35, 15, 65, 20, 25, 10, 10, 60, 25, 20, 70, 5, 15,
15, 15, 25, 15, 60, 25, 55, 5, 50, 30, 35, 5, 10, 30, 10, 55,
25, 40, 35, 40, 45, 25, 20, 35, 40, 5, 40, 10, 25, 10, 40, 30,
20, 25, 25, 10, 25, 30, 45, 20, 25, 10, 55, 40, 60, 5, 10, 10,
5, 20, 0, 40, 20, 35, 80, 25, 40, 15, 55, 25, 15, 65, 5, 25,
5, 35, 45, 10, 5, 10), T4 = c(10, 25, 35, 35, 70, 20, 15, 70,
55, 30, 50, 35, 40, 40, 35, 45, 60, 50, 15, 25, 70, 10, 60, 40,
30, 15, 15, 15, 50, 5, 20, 70, 5, 35, 65, 40, 20, 65, 50, 30,
45, 55, 65, 35, 45, 35, 40, 20, 5, 65, 20, 25, 75, 10, 25, 25,
10, 25, 20, 55, 20, 65, 5, 60, 70, 45, 15, 25, 35, 5, 70, 55,
65, 40, 35, 55, 35, 45, 45, 45, 20, 40, 25, 50, 15, 55, 55, 40,
30, 60, 10, 60, 40, 35, 30, 65, 5, 75, 55, 80, 15, 30, 55, 15,
50, 25, 45, 30, 45, 90, 20, 45, 20, 40, 35, 20, 70, 20, 30, 45,
50, 55, 45, 5, 45), T8 = c(5, 55, 55, 40, 75, 40, 5, 70, 25,
10, 50, 55, 5, 35, 10, 30, 40, 55, 20, 20, 65, -5, 55, 50, -10,
45, 5, 50, 65, 20, 0, 75, 15, 30, 50, 50, 30, 70, 45, 25, 35,
40, 85, 30, 60, 50, 55, 15, 10, 75, 60, 20, 90, 0, 20, 55, -10,
20, 10, 45, 20, 65, 0, 70, 85, 0, -5, 30, 35, 5, 80, 45, 60,
25, 35, 55, 30, 45, 65, 45, -5, 35, 35, 40, 50, 55, 50, 70, 45,
40, 0, 55, 45, 30, 0, 56, 0, 45, 50, 70, 15, 20, 45, -10, 45,
55, 45, 20, 50, 85, 5, 50, 10, 20, 25, 0, 70, 0, 25, 5, 45, 35,
40, -5, 25)), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22", "23", "24", "25", "26",
"177", "191", "200", "205", "208", "212", "231", "236", "240",
"246", "250", "259", "263", "264", "275", "276", "282", "293",
"303", "304", "307", "309", "315", "316", "320", "322", "324",
"327", "333", "338", "343", "356", "365", "377", "379", "393",
"395", "399", "405", "411", "426", "428", "439", "448", "451",
"459", "490", "495", "498", "513", "515", "521", "524", "528",
"532", "550", "552", "559", "566", "570", "577", "583", "587",
"595", "624", "638", "641", "645", "647", "650", "660", "668",
"677", "683", "688", "691", "702", "704", "710", "719", "730",
"732", "748", "752", "758", "766", "772", "780", "782", "790",
"810", "828", "830", "836", "853", "862", "880", "889", "896"
), class = "data.frame")
db.pca <- prcomp(db, center= TRUE, scale.=TRUE)
summary(db.pca)
str(db.pca)
ggbiplot(db.pca)
screeplot(db.pca, type="line")
Here is a way with package FactoMineR to get the correlations. The plot is a base R plot.
library(FactoMineR)
res.pca <- PCA(iris[-5], graph = FALSE)
cos2 <- res.pca$var$cos2
old_par <- par(xpd = TRUE)
matplot(
cos2,
type = "l",
xlab = "variable",
ylab = "correlation",
main = "Component Pattern Profiles",
xaxt = "n"
)
axis(1, at = 1:nrow(cos2), labels = rownames(cos2))
legend(
x = "bottom",
inset = c(0, -0.2),
legend = colnames(cos2),
col = 1:ncol(cos2),
lty = 1:ncol(cos2),
bty = "n",
horiz = TRUE
)
par(old_par)
using your data I did this:
comp = prcomp(db, center=T, scale.=T)
b =matrix(ncol = 3)[-1,]
for(i in 1:ncol(comp$x)){
for(j in colnames(db)){
b = rbind(b, c(i,j,cor.test(comp$x[,i], db[,j])$estimate))
}
}
b= as.data.frame(b)
b$cor= as.numeric(b$cor)
ggplot(b,aes(x=V2,y=cor, group = V1, col= V1))+
geom_line()+
theme_classic()
And I obtained this :
did it help?

R clustering- silhouette with observation labels

I do hierarchical clustering with the cluster package in R. Using the silhouette function, I can get the silhouette plot of my cluster output for any given height (h) cut-off in the dendrogram.
# run hierarchical clustering
if(!require("cluster")) { install.packages("cluster"); require("cluster") }
tmp <- matrix(c( 0, 20, 20, 20, 40, 60, 60, 60, 100, 120, 120, 120,
20, 0, 30, 50, 60, 80, 40, 80, 120, 100, 140, 120,
20, 30, 0, 40, 60, 80, 80, 80, 120, 140, 140, 80,
20, 50, 40, 0, 60, 80, 80, 80, 120, 140, 140, 140,
40, 60, 60, 60, 0, 20, 20, 20, 60, 80, 80, 80,
60, 80, 80, 80, 20, 0, 20, 20, 40, 60, 60, 60,
60, 40, 80, 80, 20, 20, 0, 20, 60, 80, 80, 80,
60, 80, 80, 80, 20, 20, 20, 0, 60, 80, 80, 80,
100, 120, 120, 120, 60, 40, 60, 60, 0, 20, 20, 20,
120, 100, 140, 140, 80, 60, 80, 80, 20, 0, 20, 20,
120, 140, 140, 140, 80, 60, 80, 80, 20, 20, 0, 20,
120, 120, 80, 140, 80, 60, 80, 80, 20, 20, 20, 0),
nr=12, dimnames=list(LETTERS[1:12], LETTERS[1:12]))
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
plot(sil_cl)
This gives the figure below, which is the point that frustrates me. How can I use the observation labels rownames(tmp) in the silhouette plot as opposed to the numeric indices (1 to 12) - which make no sense whatsoever to me.
I'm not sure why but the silhouette call seems to drop the row names. You can add them back with
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
rownames(sil_cl) <- rownames(tmp)
plot(sil_cl)
I found that adding the argument cex.names = par("cex.axis") to the plot() function gives you the desired labels:
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
plot(sil_cl, cex.names = par("cex.axis"))

Resources