Creating boxplot based on three variables - r

How is it possible to create a box plot like this
data.frame(category = c(2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 1, 3), educational_years = c(6, 15, 16, 6, 6, 6, 12, 12, 12, 4, 12, 15, 6, 4), gender = c("M", "F", "F", "F", "M", "M", "F", "M", "F", "F", "M", "F", "M", "F"))
from a data frame like this and x axis have the category y axis age and factor gender

You don't have enough data to create a boxplot like the one shown. For example, you only have a single data point for category 1, so you will only get a single horizontal line here. You only have "M" values for category 2, so you will only get a single box here. You have only a single value for "M" in category three, so you will get a horizontal line instead of a box.
Assuming this is only a sample of your data, rather than the full data set, the code would look like this:
library(ggplot2)
ggplot(df, aes(factor(category), educational_years, fill = gender)) +
geom_boxplot()
At the moment, the result obtained looks like this:

Related

Find the overlap of two datasets

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?
You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")
Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

numbering characters in a string

I want to number the letters in a large dataset. Some letters occur multiple times and are numbered ("A1", "A2"), others also occur multiple times but are not numbered. There are also letters that occur only once... but maybe it's easier to look at the example data below.
The numbers in df$nr are the desired result. How can I get df$nr from df$word and df$letter ?
df <-tibble(word=c(rep("Amamam", 17), rep("Bobob", 14)),
letter=c("A1", "A1", "A1", "A1", "A2", "A2", "m", "m", "m", "a", "a", "m", "m", "a", "a", "m", "m",
"B1", "B1", "B2", "B2", "B3", "B3", "o", "b", "b", "b", "o", "o", "o", "b"),
nr=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5) )
We can group by 'word', remove the numeric part from the 'letter' column, convert to run-length-id (rleid from data.table)
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
group_by(word) %>%
mutate(nr1 = rleid(str_remove(letter, "\\d+")))
all.equal(df1$nr, df1$nr1)
#[1] TRUE

Using matplot in R whenever certain column changes

Sorry in advance because I am new at asking questions here and don't know how to input this table properly.
Say I have a data frame in R constructed like:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind(team, value)
I want to create a plot that will give me 3 lines graphing the values for teams A, B, and C. I believe I can do this inputting the matrix m into matplot somehow, but I'm not sure how.
EDIT: I've gotten a lot closer to solving my problem. However I've realized that for some reason, with the code I have, "Value" is a list of 745 which matches the number of rows in my dataframe m. However when I unlist(Value) it turns into a numeric of length 894. Any ideas on why this would happen?
You can try something like this:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind.data.frame(team, value)
library(ggplot2)
ggplot(m, aes(x=as.factor(1:nrow(m)), y=value, group=team, col=team)) +
geom_line(lwd=2) + xlab('index')
if you have same number of ordered values for each team, you could use matplot to visualize them. but the data should be converted to matrix first;
m = cbind.data.frame(team, value, index = rep(1:3, 3))
m <- reshape(m, v.names = 'value', idvar = 'team', direction = 'wide', timevar = 'index')
matplot(t(m[, 2:4]), type = 'l', lty = 1)
legend('top', legend = m[, 1], lty = 1, col = 1:3)

Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)

Chart with built in groupby and secondary Y %s in r

Thanks for this wonderful community and expert responses. This is my first question on stackoverflow. I did research but couldn't find what I am trying to do.
How to write an efficient code in r that will create a chart with secondary Y and also does the groupby for total counts based on a certain variable? I want groupby to operate within the code rather than having to create a separate dataframe for every variable that I want to plot on X.
I have thousands of rows and hundreds of columns in an r dataframe. My sample data looks like this. (20 x 5)
tv = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0)
pr1 =c("AA", "AB", "ZH", "AA", "ZA", "AB", "ZA", "ZA", "AA", "AA", "ZA", "AA", "ZG", "AA", "ZF", "AB", "AA", "AB", "AA", "AA")
pr2 =c("B", "F", "F", "J", "E", "E", "J", "B", "J", "F", "B", "B", "J", "B", "F", "J", "B", "F", "B", "E")
pr3 =c(13, 13, 25, 13, 13, 13, 13, 1, 13, 13, 13, 13, 25, 13, 25, 1, 13, 13, 13, 13)
sample_data = data.frame("SN"= c(1:20),"Target_Vbl"=tv,Predictor_1=pr1,Predictor_2=pr2,Predictor_3=pr3)
From this sample data, I can create the chart I am looking for in excel but am lost when it comes to plotting it in r. I want to re-use the code for any other predictor variable but my Y axes will always remain the same i.e. primary Y is total count of Target_Vbl and secondary Y is % of one's for a given category of Predictor variable plotted on X axis.
The chart should look like below...currently plotted for Predictor_1(drawn in excel)
Edit - After trying the plotrix
Continuing with the sample_data I created a summary data to utilize the plotrix package. (Thanks lawyeR) The twoord.plot takes me closer to what I am looking for however there are few discrepancies as below -
1. am not getting the bars for the tc (total count of predictor_1) for left Y axis...I did try mentioning the "bar" in "type" option but it did not work.
2. The X axis labels don't show the values from the data but defaults to numbers. It should show "AA", "AB", "ZA" etc...and not 1,2,3...
3. Is there a way to make the overall process more concise. I feel my code is crude at best. Any pointers would be helpful.
library(sqldf)
smry = sqldf("Select Predictor_1, count(Target_Vbl) as tc, sum(Target_Vbl)
as conv from sample_data Group by Predictor_1")
smry$ratio = round((smry$conv/smry$tc),2)
library(plotrix)
twoord.plot(smry$Predictor_1, smry$tc,
smry$Predictor_1, smry$ratio,
type= c("l", "l"), lcol=3,rcol=4,do.first="plot_bg(\"gray\")")
The graph now looks like this -
output of twoord.plot

Resources