How to plot wordcloud based on multiple columns? - r

How to make wordcloud plot based on two columns values?
I have a dataframe as follows:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
I want to make wordcloud plot for df$Name based on values in df$Age and df$Pval. I used following code:
library("tm")
library("SnowballC")
library("wordcloud")
library("wordcloud2")
library("RColorBrewer")
set.seed(1234)
wordcloud(words = df$Name, freq = df$Age, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Here Luther & Ben are of same size, but I need to make Luther to be slightly bigger than Ben as it has lower Pval.

A quick fix workaround:
library("dplyr")
library("scales")
library("wordcloud")
library("RColorBrewer")
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
df <- df %>%
group_by(Age) %>%
mutate(rank = rank(Pval)) %>% #rank pvalue by age
mutate(weight = scales::rescale(rank/max(rank), to=c(0,1))) %>%
#this is just to make sure that we don't add more than one to the mix
mutate(weight = Age + (1-weight) ) #because rank is inversed
#the final thing adds 0.5 if there is not anyone with the same age and 1 if
#there is someone else but you have a smaller p-val (it also should work if
# there is more than 2 person with the same age)
set.seed(1234)
wordcloud(words = df$Name, freq = df$weight, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Fun and interesting question btw

Related

Labelling min, median, max of boxplot, using R-base

I am trying to label the min, median, and max data into the boxplot that I created. However, the boxplot is created with two different data frames, and thus it confused of how should I label the data value
Dummy variable:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
boxplot(class1$Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
I am trying to include the data value into the boxplot (shown in image), and its indication (ex: min, median, max)
Many thanks
You could use the function text with fivenum to get the numbers of each boxplot with labels argument and place them using x and y positions like this:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age), labels = fivenum(class1$Age), x=0.5)
text(y = fivenum(class2$Age), labels = fivenum(class2$Age), x=2.5)
Created on 2023-01-01 with reprex v2.0.2
If you only want the min (1), median(3) and max(5) you can simply extract the first, third and fifth value of the fivenum function like this:
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age)[c(1,3,5)], labels = fivenum(class1$Age)[c(1,3,5)], x=0.5)
text(y = fivenum(class2$Age)[c(1,3,5)], labels = fivenum(class2$Age)[c(1,3,5)], x=2.5)
Created on 2023-01-01 with reprex v2.0.2
The following code adds a new column Class which contains the Classnames to both DF. With rbind both DF are bind together.
Then the boxplot is created in which at defines a bit more space between each boxplot.
With tapply fivenum is calculated for each Class. And with these numbers a new DF is made which contain the necessary text for the annotations in text.
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
Class <- rep("Class1", 5)
class1 <- data.frame(Name, Age, Class)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1 <- c(33, 21, 56, 32, 65, 32, 89)
Class1 <- rep("Class2", 7)
class2 <- data.frame(Name = Name1, Age = Age1, Class = Class1)
df <- rbind(class1, class2)
bp <- boxplot(df$Age ~ factor(df$Class),
names = c("Class1", "Class2"),
ylim = c(0, 100),
xlim = c(0, 5),
xlab = "", ylab = "Age",
frame = F,
at = c(1, 3)
)
box(bty = "l")
fn <- tapply(df$Age, df$Class, fivenum)
tex <- data.frame(
Class = c("Class1", "Class2"),
max = c(fn$Class1[5], fn$Class2[5]),
min = c(fn$Class1[1], fn$Class2[1]),
median = c(fn$Class1[3], fn$Class2[3])
)
text(x = c(1, 3), y = tex$max + 2.5, paste(tex$max, "(max)", sep = ""))
text(x = c(1, 3), y = tex$min - 2.5, paste(tex$min, "(min)", sep = ""))
text(x = c(1.9, 3.9), y = tex$median, paste(tex$median, "(median)", sep = ""))

How to apply functions depending on the column, and mutate into new data frame?

I came up with the idea to represent stats on a chart like this. Example of the plot. And made it like this.
df_n <- df_normalized %>%
transmute(
Height_x = round(Height*cos_my(45), 2),
Height_y = round(Height*sin_my(45), 2),
Weight_x = round(Weight*cos_my(45*2), 2),
Weight_y = round(Weight*sin_my(45*2), 2),
Reach_x = round(Reach*cos_my(45*3), 2),
Reach_y = round(Reach*sin_my(45*3), 2),
SLpM_x = round(SLpM*cos_my(45*4), 2),
SLpM_y = round(SLpM*sin_my(45*4), 2),
Str_Def_x = round(`Str_Def %`*cos_my(45*5), 2),
Str_Def_y = round(`Str_Def %`*sin_my(45*5), 2),
TD_Avg_x = round(TD_Avg*cos_my(45*6), 2),
TD_Avg_y = round(TD_Avg*sin_my(45*6), 2),
TD_Acc_x = round(`TD_Acc %`*cos_my(45*7), 2),
TD_Acc_y = round(`TD_Acc %`*sin_my(45*7), 2),
Sub_Avg_x = round(Sub_Avg*cos_my(45*8), 2),
Sub_Avg_y = round(Sub_Avg*sin_my(45*8), 2))
Now I want to do this smart way, so I created a data frame with same number of rows empty_df, and later in for loop I try to mutate and array, with every iteration. So for example I want to multiply 1st column by cos(30), 2nd by cos(30*2), and so on
But...
It mutate only last column because all columns during iteration have the same name 'column'.
I want to name each column by the variable column, made with paste0().
reprex_df <- structure(list(Height = c(190, 180, 183, 196, 185),
Weight = c(120, 77, 93, 120, 84),
Reach = c(193, 180, 188, 203, 193),
SLpM = c(2.45, 3.8, 2.05, 7.09, 3.17),
`Str_Def %` = c(58, 56, 55, 34, 44),
TD_Avg = c(1.23, 0.33, 0.64, 0.91, 0),
`TD_Acc %` = c(24, 50, 20, 66, 0),
Sub_Avg = c(0.2, 0, 0, 0, 0)), row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"))
temp <- apply(reprex_df[,1], function(x) x*cos(60), MARGIN = 2)
temp
empty_df <- data.frame(first_column = replicate(length(temp),1))
for (x in 1:8) {
temp <- apply(df[,x], function(x) round(x*cos((360/8)*x),2), MARGIN = 2)
column <- paste0("Column_",x)
empty_df <- mutate(empty_df, column = temp)
}
Later I want to make it a function where I can pass data frame and receive data frame with X, and Y coordinates.
So, how should I make it?
Perhaps this helps
library(purrr)
library(stringr)
nm1 <- names(reprex_df)
nm_cos <- str_c(names(reprex_df), "_x")
nm_sin <- str_c(names(reprex_df), "_y")
reprex_df[nm_cos] <- map2(reprex_df, seq_along(nm1),
~ round(.x * cos(45 *.y ), 2))
reprex_df[nm_sin] <- map2(reprex_df[nm1], seq_along(nm1),
~ round(.x * sin(45 *.y ), 2))

How to make a table with gt() in R?

So, I have this data I'm trying to format into a nice table. Currently, I've just been using the kable() command from the knitr pacakge, but I am trying to learn how to make nice and pretty tables that look more professional. I have the following code:
library(gt)
library(tidyverse)
library(glue)
Player <- c("Russel Westbrook", "James Harden", "Kawhi Leonard", "Lebron James",
"Isaiah Thomas", "Stephen Curry", "Giannis Antetokounmpo", "John Wall",
"Anthony Davis", "Kevin Durant")
Overall_proportion <- c(0.845, 0.847, 0.880, 0.674, 0.909, # q-the ratio of clutch makes
0.898, 0.770, 0.801, 0.802, 0.875) # by clutch attempts
Clutch_makes <- c(64, 72, 55, 27, 75, # Y-values
24, 28, 66, 40, 13)
Clutch_attempts <- c(75, 95, 63, 39, 83, # Clutch_attempts -values
26, 41, 82, 54, 16)
NBA_stats <- as.data.frame(cbind(Player, Overall_proportion, Clutch_makes, Clutch_attempts))
# creating the various quartiles for the posterior distributions
q25 <- qbeta(0.250, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q50 <- qbeta(0.500, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q75 <- qbeta(0.750, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q90 <- qbeta(0.900, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q_low <- qbeta(0.025, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q_high <- qbeta(0.975, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
Player_distribution_table <- cbind(q25, q50, q75, q90, q_low, q_high)
rownames(Player_distribution_table) <- Player
I'm just trying to turn this into a table where the row names are those of the players, and the column names are "25th percentile, 50th percentile" etc.
Thank you!
gt needs a data.frame or tibble object. Player_distribution_table is a matrix (because you used cbind). You can pass dataframe to gt function with rownames_to_stub = TRUE to get player names.
Player_distribution_table <- data.frame(q25, q50, q75, q90, q_low, q_high)
rownames(Player_distribution_table) <- Player
gt::gt(Player_distribution_table, rownames_to_stub = TRUE)

Summing ranks for variable with fewest entries

I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average"))
# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
a.df <- arrange(a.df, desc(State))
} else {
a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------
nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
Please try this code and tell me if I am on the right way:
I replaced your so called clumsy code with this one
... %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
The whole code:
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average")) %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
-----------------------------------------------------------------------------
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)

Weighted mean calculation in R with missing values

Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})

Resources