Sum correlated variables - r

I have a list of 200 variables and I want to sum those that are highly correlated.
Assuming this is my data
mydata <- structure(list(APPLE= c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L),
PEAR= c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L),
PLUM = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,20L, 10L, 10L, 10L, 10L, 10L),
BANANA= c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L),
LEMON = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)),
.Names = c("APPLE", "PEAR", "PLUM", "BANANA", "LEMON"),
class = "data.frame", row.names = c(NA,-16L))
I have found this code which I am not sure how to tweak in order to leverage it for my purpose
https://stackoverflow.com/a/39484353/4797853
var.corelation <- cor(as.matrix(mydata), method="pearson")
library(igraph)
# prevent duplicated pairs
var.corelation <- var.corelation*lower.tri(var.corelation)
check.corelation <- which(var.corelation>0.62, arr.ind=TRUE)
graph.cor <- graph.data.frame(check.corelation, directed = FALSE)
groups.cor <- split(unique(as.vector(check.corelation)), clusters(graph.cor)$membership)
lapply(groups.cor,FUN=function(list.cor){rownames(var.corelation)[list.cor]})
The output that I am looking for is 2 data frames as follow:
DF1
GROUP1 GROUP2
3 16
4 40
ETC..
The values are the sum of the values within a group
DF2
ORIGINAL_VAR GROUP
APPLE 1
PEAR 1
PLUM 2
BANANA 2
LEMON 2

Try this (assuming that you have only clustered into 2 groups):
DF1 <- cbind.data.frame(GROUP1=rowSums(mydata[,groups.cor[[1]]]),
GROUP2=rowSums(mydata[,groups.cor[[2]]]))
DF1
GROUP1 GROUP2
1 3 16
2 4 40
3 10 72
4 8 24
5 732 14
6 130 30
7 86 5
8 912 10
9 1752 3
10 156 2114
11 1374 22
12 756 14
13 756 114
14 68 22
15 106 14
16 84 14
DF2 <- NULL
for (i in 1:2) {
DF2 <- rbind(DF2,
cbind.data.frame(ORIGINAL_VAR=rownames(var.corelation)[groups.cor[[i]]],
GROUP=i))
}
DF2
ORIGINAL_VAR GROUP
1 PEAR 1
2 APPLE 1
3 BANANA 2
4 LEMON 2
5 PLUM 2

Related

R: how to display a table with a heat map-type representation of percentage values

R: how to display a table with a heat map-type representation of percentage values, as in Excel. same as displayed in SC.
In heat-map table/plot, I want all columns shown below except Total(%), with conditional formatting such that lower values are displayed in green while higher values are displayed in red.
The 0 or early(%) column should not be highlighted in heat map.
Check attached screenshot of excel to understand what I am looking for.
I am unable to understand what to do in this type of excel to R conversion.
In database displayed below columns in table.
User 0 or early(%) <=5(%) <=10(%) <=15(%) <=20(%) <=25(%) TOTAL (%)
A 57 15 18 5 5 0 100
B 64 22 12 2 0 0 100
C 73 12 10 3 2 0 100
D 45 37 7 4 3 5 100
E 87 4 2 2 1 4 100
F 44 39 3 0 1 13 100
G 84 7 2 5 2 0 100
H 90 3 0 7 0 0 100
I 88 2 2 7 2 0 100
J 43 17 0 34 6 0 100
K 69 4 2 20 2 2 100
L 37 5 5 0 5 49 100
M 69 18 0 10 3 0 100
N 59 8 3 30 0 0 100
O 91 6 3 0 0 0 100
P 50 7 10 27 3 3 100
Q 40 23 7 13 10 7 100
If you want to reproduce the same "heatmap" than the one you obtained with excel, I will rather consider using formattable package instead of ggplot2. formattable allow to make data frames to be rendered as HTML table with formatter functions applied, which resembles conditional formatting in Microsoft Excel (https://cran.r-project.org/web/packages/formattable/vignettes/formattable-data-frame.html).
I inspired from #MrFlick's answer on this post: Is it possible to use more than 2 colors in the color_tile function? to build the following answer.
First, we are creating a function that will make the color pattern for the heatmap. Based on your excel output, 0% values are green and then you have a gradient from yellow to orange to red.
library(formattable)
color_tile2 <- function (...) {
formatter("span", style = function(x) {
style(display = "block",
padding = "0 4px",
`border-radius` = "4px",
`background-color` = ifelse(x ==0, "green", csscolor(matrix(as.integer(colorRamp(...)(normalize(as.numeric(x)))),
byrow=TRUE, dimnames=list(c("red","green","blue"), NULL), nrow=3))))
},
x ~ percent(x/100))}
Here, applying the function made below to the dataframe and getting particular columns colored and other not:
library(formattable)
formattable(df, align = "c", list(
area(col = `<=5(%)`:`<=25(%)`) ~color_tile2(c("yellow","orange","red")),
User = FALSE,
`TOTAL_(%)` = FALSE,
`0_or_early(%)` = formatter("span", style = ~style(color = "darkgreen"), x ~ percent(x/100)))
)
Does it look what you are trying to get ?
Reproducible example
structure(list(User = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P", "Q"), `0_or_early(%)` = c(57L,
64L, 73L, 45L, 87L, 44L, 84L, 90L, 88L, 43L, 69L, 37L, 69L, 59L,
91L, 50L, 40L), `<=5(%)` = c(15L, 22L, 12L, 37L, 4L, 39L, 7L,
3L, 2L, 17L, 4L, 5L, 18L, 8L, 6L, 7L, 23L), `<=10(%)` = c(18L,
12L, 10L, 7L, 2L, 3L, 2L, 0L, 2L, 0L, 2L, 5L, 0L, 3L, 3L, 10L,
7L), `<=15(%)` = c(5L, 2L, 3L, 4L, 2L, 0L, 5L, 7L, 7L, 34L, 20L,
0L, 10L, 30L, 0L, 27L, 13L), `<=20(%)` = c(5L, 0L, 2L, 3L, 1L,
1L, 2L, 0L, 2L, 6L, 2L, 5L, 3L, 0L, 0L, 3L, 10L), `<=25(%)` = c(0L,
0L, 0L, 5L, 4L, 13L, 0L, 0L, 0L, 0L, 2L, 49L, 0L, 0L, 0L, 3L,
7L), `TOTAL_(%)` = c(100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L)), row.names = c(NA,
-17L), class = c("data.table", "data.frame"))

Calculate mean for column grouped by values of two other columns [duplicate]

This question already has answers here:
How to group by two columns in R
(4 answers)
Closed 4 years ago.
I have a dataframe with 5 columns. I know how to calculate the mean for one column grouped by another column. However, i need to group it by two columns. For example, I want to calculate the mean for column 5 grouped by column 1 and column 2.
df <- structure(list(Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), .Label = c("AT", "CH", "DE"), class = "factor"),
Occupation = c(1L, 3L, 5L, 3L, 1L, 2L, 5L, 3L, 5L, 3L, 1L,
2L, 1L, 5L, 3L, 3L, 1L, 3L, 2L, 5L, 5L, 1L, 2L, 1L, 3L),
Age = c(20L, 46L, 30L, 12L, 73L, 53L, 19L, 43L, 65L, 53L,
19L, 34L, 76L, 25L, 45L, 39L, 18L, 59L, 37L, 24L, 19L, 60L,
51L, 32L, 29L), Gender = structure(c(1L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L), .Label = c("female", "male"), class = "factor"),
Income = c(100L, 80L, 78L, 29L, 156L, 56L, 95L, 104L, 87L,
56L, 203L, 45L, 112L, 78L, 56L, 140L, 99L, 67L, 89L, 109L,
43L, 145L, 30L, 101L, 77L)), class = "data.frame", row.names = c(NA,
-25L))
head(df)
Country Occupation Age Gender Income
1 AT 1 20 female 100
2 AT 3 46 female 80
3 AT 5 30 male 78
4 AT 3 12 male 29
5 AT 1 73 male 156
6 AT 2 53 female 56
So what I want to to is calculate the mean for column ‘income’, grouped by country and occupation. E.g., I want to calculate the mean of ‘income’ for all those people living in country ‘AT’ with occupation ‘3’, the mean of ‘income’ for all those living in country ‘CH’ with occupation ‘1’ and so on.
(1) base method (aggregate)
mean.df <- aggregate(Income ~ Country + Occupation, df, mean)
names(mean.df)[3] <- "Income_Mean"
merge(df, mean.df)
(2) base method (tapply)
mean.df1 <- tapply(df$Income, list(df$Country, df$Occupation), mean)
mean.df2 <- as.data.frame(as.table(mean.df1))
names(mean.df2) <- c("Country", "Occupation", "Income_Mean")
merge(df, mean.df2)
(3) stats method (ave)
df2 <- df
df2$Income_Mean <- ave(df$Income, df$Country, df$Occupation)
(4) dplyr method
df %>% group_by(Country, Occupation) %>%
mutate(Income_Mean = mean(Income))
Output :
Country Occupation Age Gender Income Income_Mean
<fct> <int> <int> <fct> <int> <dbl>
1 AT 1 20 female 100 128
2 AT 3 46 female 80 71
3 AT 5 30 male 78 86.5
4 AT 3 12 male 29 71
5 AT 1 73 male 156 128
6 AT 2 53 female 56 56
7 AT 5 19 male 95 86.5
8 AT 3 43 male 104 71
9 CH 5 65 male 87 82.5
10 CH 3 53 female 56 84
# ... with 15 more rows
Using sqldf:
sqldf("select Country,Occupation,Age,Gender,avg(Income) from df group by Country,Occupation")
OR
Using data.table:
library(data.table)
df=data.table(df)
df[, mean(Income), by = list(Country,Occupation)]
Output:
Country Occupation Age Gender avg(Income)
1 AT 1 73 male 128.0
2 AT 2 53 female 56.0
3 AT 3 43 male 71.0
4 AT 5 19 male 86.5
5 CH 1 18 female 138.0
6 CH 2 34 male 45.0
7 CH 3 39 male 84.0
8 CH 5 25 female 82.5
9 DE 1 32 female 123.0
10 DE 2 51 female 59.5
11 DE 3 29 male 72.0
12 DE 5 19 male 76.0

delete observations by days in R

My dataset has the next structure
df=structure(list(Data = structure(c(12L, 13L, 14L, 15L, 16L, 17L,
18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("01.01.2018",
"02.01.2018", "03.01.2018", "04.01.2018", "05.01.2018", "06.01.2018",
"07.01.2018", "12.02.2018", "13.02.2018", "14.02.2018", "15.02.2018",
"25.12.2017", "26.12.2017", "27.12.2017", "28.12.2017", "29.12.2017",
"30.12.2017", "31.12.2017"), class = "factor"), sku = 1:18, metric = c(100L,
210L, 320L, 430L, 540L, 650L, 760L, 870L, 980L, 1090L, 1200L,
1310L, 1420L, 1530L, 1640L, 1750L, 1860L, 1970L), action = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("Data", "sku", "metric", "action"), class = "data.frame", row.names = c(NA,
-18L))
I need to delete observations that have certain dates.
But in this dataset there is action variable. The action column has only two values 0 and 1.
Observations on these certain dates should be deleted only for the zero category of action.
these dates are presented in a separate datase.
datedata=structure(list(Data = structure(c(18L, 19L, 20L, 21L, 22L, 5L,
7L, 9L, 11L, 13L, 15L, 17L, 23L, 1L, 2L, 3L, 4L, 6L, 8L, 10L,
12L, 14L, 16L), .Label = c("01.05.2018", "02.05.2018", "03.05.2018",
"04.05.2018", "05.03.2018", "05.05.2018", "06.03.2018", "06.05.2018",
"07.03.2018", "07.05.2018", "08.03.2018", "08.05.2018", "09.03.2018",
"09.05.2018", "10.03.2018", "10.05.2018", "11.03.2018", "21.02.2018",
"22.02.2018", "23.02.2018", "24.02.2018", "25.02.2018", "30.04.2018"
), class = "factor")), .Names = "Data", class = "data.frame", row.names = c(NA,
-23L))
how can i do it?
A solution is to use dplyr::filter as:
library(dplyr)
library(lubridate)
df %>% mutate(Data = dmy(Data)) %>%
filter(action==1 | (action==0 & !(Data %in% dmy(datedata$Data))))
# Data sku metric action
# 1 2017-12-25 1 100 0
# 2 2017-12-26 2 210 0
# 3 2017-12-27 3 320 0
# 4 2017-12-28 4 430 0
# 5 2017-12-29 5 540 0
# 6 2017-12-30 6 650 0
# 7 2017-12-31 7 760 0
# 8 2018-01-01 8 870 0
# 9 2018-01-02 9 980 1
# 10 2018-01-03 10 1090 1
# 11 2018-01-04 11 1200 1
# 12 2018-01-05 12 1310 1
# 13 2018-01-06 13 1420 1
# 14 2018-01-07 14 1530 1
# 15 2018-02-12 15 1640 1
# 16 2018-02-13 16 1750 1
# 17 2018-02-14 17 1860 1
# 18 2018-02-15 18 1970 1
I guess this will work. Fist use match to see weather there is a match in the day of df and the day in datedata, then filter it
library (dplyr)
df <- df %>% mutate (Data.flag = match(Data,datedata$Data)) %>%
filter(!is.na(Data.flag) & action == 0)

How to see the distribution of a column in R?

I have a dataset that looks like this:
A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
...
What can I run to see the distribution of wage (the first column). Specifically, I want to see how many people have a wage of under $300.
What ggplot function can I run?
You can get the cumulative histogram:
library(ggplot2)
ggplot(df,aes(wage))+geom_histogram(aes(y=cumsum(..count..)))+
stat_bin(aes(y=cumsum(..count..)),geom="line",color="green")
If you specifically want to know the count of entries with a certain condition, in base r you can use the following:
count(df[df$wage > 1000,])
## # A tibble: 1 x 1
## n
## <int>
## 1 3
Data:
df <- structure(list(wage = c(769L, 808L, 825L, 650L, 562L, 1400L,
600L, 1081L, 1154L, 1000L), hours = c(40L, 50L, 40L, 40L, 40L,
40L, 40L, 40L, 45L, 40L), iq = c(93L, 119L, 108L, 96L, 74L, 116L,
91L, 114L, 111L, 95L), kww = c(35L, 41L, 46L, 32L, 27L, 43L,
24L, 50L, 37L, 44L), educ = c(12L, 18L, 14L, 12L, 11L, 16L, 10L,
18L, 15L, 12L), exper = c(11L, 11L, 11L, 13L, 14L, 14L, 13L,
8L, 13L, 16L), tenure = c(2L, 16L, 9L, 7L, 5L, 2L, 0L, 14L, 1L,
16L), age = c(31L, 37L, 33L, 32L, 34L, 35L, 30L, 38L, 36L, 36L
), married = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L), black = c(0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), south = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), urban = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L), sibs = c(1L, 1L, 1L, 4L, 10L, 1L, 1L, 2L, 2L, 1L
), brthord = c(2L, NA, 2L, 3L, 6L, 2L, 2L, 3L, 3L, 1L), meduc = c(8L,
14L, 14L, 12L, 6L, 8L, 8L, 8L, 14L, 12L)), .Names = c("wage",
"hours", "iq", "kww", "educ", "exper", "tenure", "age", "married",
"black", "south", "urban", "sibs", "brthord", "meduc"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Try this:
library(dplyr)
library(ggplot2)
df <- df %>% filter(wage < 300)
qplot(wage, data = df)

R ggplot2 - How to plot 2 boxplots on the same x value

suppose I have two boxplots.
trial1 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial1))
trial1 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
trial2 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial2))
trial2 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
How can I plot trial 1 and trial 2 on the same plot and same respective X? they have the same range of y.
I looked at geom_boxplot(position="identity"), but that plots the two conditions(fill) on the same X.
I want to plot two y column on the same X.
Edit: the dataset
User Condition Scenario Trial1 Trial2
1 1 ME a 67 41
2 1 ME b 70 42
3 1 ME c 40 15
4 1 ME d 65 23
5 1 ME e 45 45
6 1 SE a 100 34
7 1 SE b 54 23
8 1 SE c 70 23
9 1 SE d 56 15
10 1 SE e 30 20
11 2 ME a 42 23
12 2 ME b 22 12
13 2 ME c 28 8
14 2 ME d 22 8
15 2 ME e 38 37
16 2 SE a 59 18
17 2 SE b 65 14
18 2 SE c 75 7
19 2 SE d 37 9
20 2 SE e 31 7
dput()
structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Condition = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("ME", "SE"), class = "factor"), Scenario =
structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
Trial1 = c(67L, 70L, 40L, 65L, 45L, 100L, 54L, 70L, 56L,
30L, 42L, 22L, 28L, 22L, 38L, 59L, 65L, 75L, 37L, 31L), Trial2 = c(41L,
42L, 15L, 23L, 45L, 34L, 23L, 23L, 15L, 20L, 23L, 12L, 8L,
8L, 37L, 18L, 14L, 7L, 9L, 7L)), .Names = c("User", "Condition",
"Scenario", "Trial1", "Trial2"), class = "data.frame", row.names = c(NA,
-20L))
You could try using interaction to combine two of your factors and plot against a third. For example, assuming you want to fill by condition as in your original code:
library(tidyr)
completionTime %>%
gather(trial, value, -Scenario, -Condition, -User) %>%
ggplot(aes(interaction(Scenario, trial), value)) + geom_boxplot(aes(fill = Condition))
Result:

Resources