Combine or merge values - r

I would like to combine/merge my values into each other and become one value in R. For instance, to combine 1+3, 5+6, 10+11, 12+13. Does anyone know how to do that? :-)
tibble::tibble(
Educational_level = c(1, 3, 5, 6, 10, 11, 12, 13)
This is what I have tried, but it do not merge the factors that I would like when I run the linear regression.
ess7no <- ess7no %>%
mutate(edlvdno = as_factor(edlvdno)) %>%
mutate(edlvdno = recode(edlvdno, "1" = "3" , "5" = "6", "10" = "11", "12" = "13"))

df <- tibble(
Educational_level = c(1, 3, 5, 6, 10, 11, 12, 13),
Labels = c(
"Not graduated", "Primary school", "High school", "High school",
"Bachelor", "Bachelor", "Master", "Master"
)
)
library(dplyr)
df %>%
mutate(Labels = ifelse(Labels %in% c(
"Not graduated",
"Primary school"
), stringr::str_c("Not graduated", "_", "Primary school"), Labels)) %>%
group_by(Labels) %>%
summarise(Educational_level = sum(Educational_level))

Related

Labelling min, median, max of boxplot, using R-base

I am trying to label the min, median, and max data into the boxplot that I created. However, the boxplot is created with two different data frames, and thus it confused of how should I label the data value
Dummy variable:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
boxplot(class1$Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
I am trying to include the data value into the boxplot (shown in image), and its indication (ex: min, median, max)
Many thanks
You could use the function text with fivenum to get the numbers of each boxplot with labels argument and place them using x and y positions like this:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age), labels = fivenum(class1$Age), x=0.5)
text(y = fivenum(class2$Age), labels = fivenum(class2$Age), x=2.5)
Created on 2023-01-01 with reprex v2.0.2
If you only want the min (1), median(3) and max(5) you can simply extract the first, third and fifth value of the fivenum function like this:
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age)[c(1,3,5)], labels = fivenum(class1$Age)[c(1,3,5)], x=0.5)
text(y = fivenum(class2$Age)[c(1,3,5)], labels = fivenum(class2$Age)[c(1,3,5)], x=2.5)
Created on 2023-01-01 with reprex v2.0.2
The following code adds a new column Class which contains the Classnames to both DF. With rbind both DF are bind together.
Then the boxplot is created in which at defines a bit more space between each boxplot.
With tapply fivenum is calculated for each Class. And with these numbers a new DF is made which contain the necessary text for the annotations in text.
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
Class <- rep("Class1", 5)
class1 <- data.frame(Name, Age, Class)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1 <- c(33, 21, 56, 32, 65, 32, 89)
Class1 <- rep("Class2", 7)
class2 <- data.frame(Name = Name1, Age = Age1, Class = Class1)
df <- rbind(class1, class2)
bp <- boxplot(df$Age ~ factor(df$Class),
names = c("Class1", "Class2"),
ylim = c(0, 100),
xlim = c(0, 5),
xlab = "", ylab = "Age",
frame = F,
at = c(1, 3)
)
box(bty = "l")
fn <- tapply(df$Age, df$Class, fivenum)
tex <- data.frame(
Class = c("Class1", "Class2"),
max = c(fn$Class1[5], fn$Class2[5]),
min = c(fn$Class1[1], fn$Class2[1]),
median = c(fn$Class1[3], fn$Class2[3])
)
text(x = c(1, 3), y = tex$max + 2.5, paste(tex$max, "(max)", sep = ""))
text(x = c(1, 3), y = tex$min - 2.5, paste(tex$min, "(min)", sep = ""))
text(x = c(1.9, 3.9), y = tex$median, paste(tex$median, "(median)", sep = ""))

plot multiple datasets and compare categories with barplots

I have three datasets with the same variables and I want to compare one variables over 29 different categories between the three datasets. The example below should work as a reproducible example. I tried already to plot it but the out put was not as expected. I would like to have the three bars next to each other and a small plot in the plot for every category.
number_trackers = c(1, 2, 3, 4, 5, 6),
category = c("Ads", "Analytics", "Ads", "Analytics", "Ads", "Ads"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
List_short_after=data.frame = c("Tracker1", "Tracker2", "Tracker3", "Tracker4","Tracker5","Tracker6"),
number_trackers = c(1, 2, 3, 4, 5, 6),
category = c("Ads", "Analytics", "Ads", "Analytics", "Ads", "Ads"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
List_after=data.frame = c("Tracker1", "Tracker2", "Tracker3", "Tracker4","Tracker5","Tracker6"),
number_trackers = c(1, 2, 3, 4, 5, 6),
category = c("Ads", "Analytics", "Ads", "Analytics", "Ads", "Ads"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
ggplot(data = NULL,
mapping = aes(y = number_trackers,x=category)) +
geom_col(data = List_before,fill= "#ca93ef", colour="#ca93ef") +
geom_col(data = List_short_after,fill= "#5034c4", colour="#5034c4") +
geom_col(data = List_after,fill= "#795fc6", colour="#795fc6") +
facet_wrap(facets = vars(category))+
theme_minimal() +
theme(text = element_text(color = "#795fc6",size=12,face="bold"),
axis.text = element_text(color = "#795fc6",size=14,face="bold"))+
labs( y = "Number Trackers", x = "Categories")
[![This is how the plot shut look like just with 3 bars instead of 2][1]][1]
[1]: https://i.stack.imgur.com/nDq36.png
Here's code that may help you reach your goal. Note that I took some liberties with your input data because it seems to be incomplete in your question.
library(ggplot2)
List_before <- data.frame(
list_id = "list_before",
name = c("Tracker1", "Tracker2", "Tracker3", "Tracker4","Tracker5","Tracker6"),
number_trackers = sample(c(1, 2, 3, 4, 5, 6)),
category = c("Ads", "Analytics", "Other 1", "Other 2", "Other 3", "Other 4"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
List_short_after <- data.frame(
list_id = "list_short_after",
name = c("Tracker1", "Tracker2", "Tracker3", "Tracker4","Tracker5","Tracker6"),
number_trackers = sample(c(1, 2, 3, 4, 5, 6)),
category = c("Ads", "Analytics", "Other 1", "Other 2", "Other 3", "Other 4"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
List_after <- data.frame(
list_id = "list_after",
name = c("Tracker1", "Tracker2", "Tracker3", "Tracker4","Tracker5","Tracker6"),
number_trackers = sample(c(1, 2, 3, 4, 5, 6)),
category = c("Ads", "Analytics", "Other 1", "Other 2", "Other 3", "Other 4"),
c4 = c("url1.com","ur2.com","url3.com","url4.com","url5.com","url6.com"))
df <- rbind(List_before, List_short_after, List_after)
df$list_id <- as.factor(df$list_id)
df$category <- as.factor(df$category)
ggplot(df, aes(y = number_trackers, x = list_id)) +
geom_bar(aes(fill = list_id), stat = "identity", position = position_dodge()) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
facet_grid(~category) +
labs(y = "Number of Trackers", x = NULL)

Heatmap in R with raw values

I have this dataframe:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
I would like to create a heatmap, with patient ID in the x axis and clas1, clas2 and clas3 in the y axis. The values represented in the heat map would be the raw value of each "clas". Here I post a drawing of what I would like
I apologise because I don't have available more colours to represent this, but this is only an example and any colour scale could be used.
An important thing is that I would like to distinguish between zeros and NAs so ideally NAs have their own colour or appear in white (empty).
I hope this is understandable enough.
But any questions just ask
Many thanks!
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
library(tidyverse)
df %>% pivot_longer(!PatientID) %>%
ggplot(aes(x= PatientID, y = name, fill = value)) +
geom_tile()
Created on 2021-05-25 by the reprex package (v2.0.0)
Here is a base R option with ``heatmap`
heatmap(t(`row.names<-`(as.matrix(df[-1]), df$PatientID)))
# Which is like
# x <- as.matrix(df[-1]
# row.names(x) <- df$PatientID
# heatmap(t(x))
Preparing the data
I'll give 4 options, in all four you need to assign the rownames and remove the id column. I.e.:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
rownames(df) <- df$PatientID
df$PatientID <- NULL
df
The output is:
> df
clas1 clas2 clas3
3454 1 4 1
345 0 1 NA
5 5 0 0
348 NA 3 5
567 NA 1 5
79 4 0 5
Base R
With base R (decent output):
heatmap(as.matrix(df))
gplots
With gplots (a bit ugly, but many more parameters to control):
library(gplots)
heatmap.2(as.matrix(df))
heatmaply
With heatmaply you have nicer defaults to use for the dendrograms (it also organizes them in a more "optimal" way).
You can learn more about the package here.
Static
Static heatmap with heatmaply (better defaults, IMHO)
library(heatmaply)
ggheatmap(df)
Now with colored dendrograms
library(heatmaply)
ggheatmap(df, k_row = 3, k_col = 2)
With no dendrogram:
library(heatmaply)
ggheatmap(df, dendrogram = F)
Interactive
Interactive heatmap with heatmaply (hover tooltip, and the ability to zoom - it's interactive!):
library(heatmaply)
heatmaply(df)
And anything you can do with the static ggheatmap you can also do with the interactive heatmaply version.
Here is another option:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
# named vector for heatmap
cols <- c("0" = "white",
"1" = "green",
"2" = "orange",
"3" = "yellow",
"4" = "pink",
"5" = "black",
"99" = "grey")
labels_legend <- c("0" = "0",
"1" = "1",
"2" = "2",
"3" = "3",
"4" = "4",
"5" = "5",
"99" = "NA")
df1 <- df %>%
pivot_longer(
cols = starts_with("clas"),
names_to = "names",
values_to = "values"
) %>%
mutate(PatientID = factor(PatientID, levels = c("3454", "345", "5", "348", "567", "79")))
ggplot(
df1,
aes(factor(PatientID), factor(names))) +
geom_tile(aes(fill= factor(values))) +
# geom_text(aes(label = values), size = 5, color = "black") + # text in tiles
scale_fill_manual(
values = cols,
breaks = c("0", "1", "2", "3", "4", "5", "99"),
labels = labels_legend,
aesthetics = c("colour", "fill"),
drop = FALSE
) +
scale_y_discrete(limits=rev) +
coord_equal() +
theme(line = element_blank(),
title = element_blank()) +
theme(legend.direction = "horizontal", legend.position = "bottom")

Categorical variable of more than five categories not showing on sumtable in R

I am trying to conduct a balance test for treatment and control groups.
Using sumtable from vtable package, I constructed a summary statistics table by group.
However, a categorical variable of more than 5 categories does not show on the table.
So for example I have a sample dataframe like this:
Treatment <- c("Treated", "Control", "Control", "Treated", "Treated", "Treated", "Control", "Treated", "Control", "Control")
City <- c(1, 4, 6, 2, 3, 3, 2, 5, 4, 6)
Age <- c(56, 70, 12, 54, 23, 9, 33, 38, 27, 49)
Gender <- c(1, 2, 3, 2, 2, 1, 1, 3, 2, 1)
df <- data.frame(Treatment, City, Age, Gender)
I label City and Gender accordingly:
label_city <- c("1" = "City A",
"2" = "City B",
"3" = "City C",
"4" = "City D",
"5" = "City E",
"6" = "City F")
df$City <- label_city[match(df$City, names(label_city))]
label_gender <- c("1" = "Male",
"2" = "Female",
"3" = "Other")
df$Gender <- label_gender[match(df$Gender, names(label_gender))]
Then I create the table:
sumtable(df, group = "Treatment", group.test = TRUE)
I get a summary statistics table with Age and Gender, but without City.
When I restrict City to up to five categories, it appears on the table.
Is there a way to make City present in the summary table with all the categories?
Got an answer from the maintainer:
vtable automatically converts character variables into factors for display, but it doesn't do so when there are too many different values of the variable, because then it's probably an actual string variable and there would be N different categories.
So after doing something like this (Convert data.frame column format from character to factor), all the categories were displayed on vtable.

Setting the order level when using barplots

I'm trying to plot a series of demographics factors. Each plot show the frequency distributions of demographic variables by gender. It runs nicely, but some of the labels are ordered in alphabetic order and not in meaningful order eg. Education, Marital Status and SIC2007.
Data structure
structure(list(DMSex = c("Male", "Female", "Male", "Male"), Income = c(980,
-8, 3000, 120), IncCat = c("-1", "-8", "-1", "-1"), HrWkAc = c(-1,
-1, -1, -1), ShiftWk = c(-1, -1, -1, -1), ShiftPat = c(-1, -1,
-1, -1), SOC2010C = c("-1", "9.2.3.3", "-1", "-1"), XSOC2010 = c(-1,
9233, -1, -1), IndexNo = c(-1, 1398, -1, -1), ES2010 = c(-1,
7, -1, -1), nssec = c(-1, 13.4, -1, -1), SECFlag = c(-1, 0, -1,
-1), LSOC2000 = c("-1", "9.2.3.3", "-1", "-1"), XSOC2000 = c(-1,
9233, -1, -1), seg = c(-1, 11, -1, -1), sc = c(-1, 5, -1, -1),
SIC2007 = c(-1, 87, -1, -1), Educ = c(1, 1, -1, 2), EducCur = c(10,
1, -1, -1), FinFTEd = c(-1, -1, -1, 1), FinFTEdY = c(-1,
-1, -1, 21), HiQual = c(22, 10, -1, 1), sic20070 = c(-1,
87, -1, -1), dhhtype = c(6, 8, 7, 3), dagegrp = c(2, 3, 3,
3), dmarsta = c("Single, never married", "Single, never married",
"Interview not achieved", "Married/cohabitating"), dhiqual = c(" Secondary",
" A level or equivalent", "Item not applicable", "Degree or higher"
), dnssec8 = c(-1, 8, -1, -1), duresmc = c(14, 15, 11, 16
), dgorpaf = c(7, 8, 5, 10), dukcntr = c(1, 1, 1, 1), dnrkid04 = c(0,
0, 0, 0), dilodefr = c(3, 3, -1, 3), deconact = c(8, 8, -1,
11), dtenure = c(2, 3, 2, 3), dtotac = c(-1, -1, -1, -1),
dtotus = c(-1, -1, -1, -1), dsic = c("Item not applicable",
"Public admin, education and health", "Item not applicable",
"Item not applicable"), dsoc = c(-1, 9, -1, -1), DVAge_category = c("15 to 30",
"15 to 30", "15 to 30", "15 to 30"), Income_category = c("Less than 1000",
"Less than 1000", "1001 to 3000", "Less than 1000"), HoursWorked_category = c("Less than 20 hours",
"Less than 20 hours", "Less than 20 hours", "Less than 20 hours"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
#Age variable
demographics$dagegrp_category<-ifelse(demographics$dagegrp_01 > 2 & demographics$dagegrp < 6, age<-"15 to 30",
ifelse(demographics$dagegrp> 6 & demographics$dagegrp < 9, age<-"31 to 45",
ifelse(demographics$dagegrp > 9 & demographics$dagegrp < 12 , age<-"46 to 60",
ifelse(demographics$dagegrp > 12 & demographics$dagegrp < 15 , age<-"61 to 75",
ifelse(demographics$dagegrp > 15 & demographics$dagegrp < 18 , age<-"76+",
age<- "zombie")))))
demographics$DVAge_category<-c("15 to 30","31 to 45", "46 to 60","61 to 75", "76+", "zombie")[findInterval(demographics$dagegrp , c(-Inf, 6, 10, 12, 15,18, Inf))]
Age<-as.vector(demographics$DVAge_category)
#Gender variable
demographics$DMSex[demographics$DMSex==1]<-"Male"
demographics$DMSex[demographics$DMSex==2]<-"Female"
Gender<-as.vector(demographics$DMSex)
#Income variable
demographics$Income_category<-ifelse(demographics$Income < 1001, income<-"Less than 1000",
ifelse(demographics$Income > 999 & demographics$Income < 3001, income<-"1001 to 3000",
ifelse(demographics$Income > 3001 & demographics$Income < 6001, income <-"3001 to 6000",
ifelse(demographics$Income > 6001 & demographics$Income < 10001 , income<-"6001 to 10000",
income<- "zombie"))))
demographics$Income_category<-c("Less than 1000","1001 to 3000", "3001 to 6000", "6001 to 10000","zombie")[findInterval(demographics$Income , c(-Inf, 1001, 3001, 6001,10001, Inf) ) ]
Income<-as.vector(demographics$Income_category)
#Marital status variable
demographics$dmarsta[demographics$dmarsta==-1]<-"Interview not achieved"
demographics$dmarsta[demographics$dmarsta==1]<-"Single, never married"
demographics$dmarsta[demographics$dmarsta==2]<-"Married/cohabitating"
demographics$dmarsta[demographics$dmarsta==3]<-"Divorced/widowed"
MaritalStatus<-as.vector(demographics$dmarsta)
#Education
demographics$dhiqual[demographics$dhiqual==-8]<-"Don't know"
demographics$dhiqual[demographics$dhiqual==-1]<-"Item not applicable"
demographics$dhiqual[demographics$dhiqual==1]<-"Degree or higher"
demographics$dhiqual[demographics$dhiqual==2]<-"Higher education"
demographics$dhiqual[demographics$dhiqual==3]<-" A level or equivalent"
demographics$dhiqual[demographics$dhiqual==4]<-" Secondary"
demographics$dhiqual[demographics$dhiqual==5]<-" Other"
Education<-as.vector(demographics$dhiqual)
#Hours worked per week in main job variable
demographics$HoursWorked_category<-ifelse(demographics$dtotac < 21, workhours<-"Less than 20 hours",
ifelse(demographics$dtotac > 20 & demographics$dtotac< 41, workhours <-"Between 21 to 40 hours",
ifelse(demographics$dtotac > 40 & demographics$dtotac < 61, workhours <-"Between 41 to 60 hours",
ifelse(demographics$dtotac > 62, workhours<-"More than 61 hours",
workhours<- "Not Applicable"))))
demographics$HoursWorked_category<-c("Less than 20 hours", "Between 21 to 40 hours", "Between 41 to 60 hours","More than 61 hours","Not Applicable")[findInterval(demographics$dtotac, c(-Inf, 21, 41, 61, 62, Inf) ) ]
WorkHours<-as.vector(demographics$HoursWorked_category)
#DV: SIC 2007 industry divisions (grouped)
demographics$dsic[demographics$dsic==-8]<-"Don't know"
demographics$dsic[demographics$dsic==-1]<-"Item not applicable"
demographics$dsic[demographics$dsic==1]<-"Agriculture, forestry and fishing"
demographics$dsic[demographics$dsic==2]<-"Manufacturing"
demographics$dsic[demographics$dsic==3]<-"Energy and water supply"
demographics$dsic[demographics$dsic==4]<-"Construction"
demographics$dsic[demographics$dsic==5]<-"Distribution, hotels and restaurants"
demographics$dsic[demographics$dsic==6]<-"Transport and communication"
demographics$dsic[demographics$dsic==7]<-"Banking and finances"
demographics$dsic[demographics$dsic==8]<-"Public admin, education and health"
demographics$dsic[demographics$dsic==9]<-"Other services"
demographics$industry_category<-c("Don't know", "Item not applicable", "Agriculture, forestry and fishing","Manufacturing","Energy and water supply",
"Construction", "Distribution, hotels and restaurants", "Transport and communication", "Banking and finances",
"Public admin, education and health", "Other service")
SIC2007<-as.vector(demographics$dsic)
# creating df
df<-data.frame(Gender, Age, Education, MaritalStatus, Income, WorkHours, SIC2007)
df %>%
#tidy, not gender
gather(variable, value, -c(Gender))%>%
#group by value, variable, then gender
group_by(value, variable, Gender) %>%
#summarise to obtain table cell frequencies
summarise(freq=n()) %>%
#Plot
ggplot(aes(x=value, y=freq, group=Gender))+geom_bar(aes(fill=Gender), stat='identity', position='dodge')+ facet_wrap(~variable, scales='free_x') + theme(legend.position="right", axis.text.x = element_text(angle = 60, hjust = 1)) + labs(x="Characteristics", y="Frequencies")
In ggplot2, the data is ordered according to the factor levels of the data.frame column.
To (re)set the order in your plot, just set the order of the factor by:
df$variable <- factor(df$variable, levels = c(...))
You could do this by first storing the data.frame, before piping to the ggplot function, then manually setting the levels of the variables you want to change. It is maybe a bit inefficient, but this should do the trick:
## Make your plotting data.frame
df2 <- df %>%
gather(variable, value, -c(Gender))%>%
group_by(value, variable, Gender) %>%
summarise(freq=n())
## Apply custom order to MaritalStatus variable:
custom <- c(sort(unique(MaritalStatus))[c(4,3,1,2)],
....)
df2$variable <- factor(df2$variable, levels = c(levels(df2$variable)[!levels(df2$variable) %in% custom],
custom))

Resources