Find and visualize best and worst items using boxplot - r

I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.
The data look like this
> str(tabulka)
'data.frame': 1761439 obs. of 3 variables:
$ User : int 1 1 1 1 1 1 1 1 1 1 ...
$ Joke : int 5 7 8 13 15 16 17 18 19 20 ...
$ Rating: num 0.219 -9.281 -9.281 -6.781 0.875 ...
Here is a subset of Dataset 2.
> head(tabulka)
User Joke Rating
1 1 5 0.219
2 1 7 -9.281
3 1 8 -9.281
4 1 13 -6.781
5 1 15 0.875
6 1 16 -9.656
I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.
KWtest <- with ( tabulka , kruskal ( Rating , Joke ))
Here are the groups.
> head(KWtest$groups)
trt means M
1 53 1085099 a
2 105 1083264 a
3 89 1077435 ab
4 129 1072706 b
5 35 1070016 bc
6 32 1062102 c
The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.
barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
col = barvy,
xlab = "Joke",
ylab = "Rating",
ylim=c(-7,7))
It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.
How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?

Interesting question per se. It's easy to color each bar according to the group the joke belongs to. However, I think it is just a intermediate solution, there must be better visualization for these data. So, certainly not the best one, but there is my version:
library(tidyverse)
# download data (jokes, part 1) to temporaty file, and unzip
tmp <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip", tmp)
tmp <- unzip(tmp)
# read data from temp
vtipy <- readxl::read_excel(tmp, col_names = F, na = '99')
# clean data
vtipy <- vtipy %>%
mutate(user = 1:n()) %>%
gather(key = 'joke', value = 'rating', -c('..1', 'user')) %>%
rename(n = '..1', ) %>%
filter(!is.na(rating)) %>%
mutate(joke = as.character(as.numeric(gsub('\\.+', '', joke)) - 1)) %>%
select(user, n, joke, rating)
# your code
KWtest <- with(vtipy, agricolae::kruskal(rating, joke))
# join groups from KWtest to original data, clean and plot
KWtest$groups %>%
rownames_to_column('joke') %>%
select(joke, groups) %>%
right_join(vtipy, by = 'joke') %>%
mutate(joke = stringi::stri_pad_left(joke, 3, '0')) %>%
ggplot(aes(x = joke, y = rating, fill = groups)) +
geom_boxplot(show.legend = F) +
scale_x_discrete(breaks = stringi::stri_pad_left(c(1, seq(5, 100, by = 5)), 3, '0')) +
ggthemes::theme_tufte() +
labs(x = 'Joke', y = 'Rating')

Related

select top n values by group with n depending on other value in data frame

I'm quite new to r and coding in general. Your help would be highly appreciated :)
I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.
(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)
First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?
Sample data:
(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')
# creating sample data
library(tidyverse)
df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
BHD = c(rnorm(23, mean = 30, sd = 5)),
factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
)
# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*df$factor,
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))
# then
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*lt$factor 'where lt$group == df$group',
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
I already found this answer to a problem which seems similar to mine, but it didn't quite help.
Since all the factor values are the same within each group, you can select any one factor value.
library(dplyr)
df %>%
group_by(group) %>%
top_n(BHD, n = 100* first(factor)) %>%
ungroup
# group BHD factor
# <dbl> <dbl> <dbl>
# 1 1 25.8 0.0707
# 2 1 24.6 0.0707
# 3 1 27.6 0.0707
# 4 1 28.3 0.0707
# 5 1 29.2 0.0707
# 6 2 28.8 0.126
# 7 2 39.5 0.126
# 8 2 23.1 0.126
# 9 2 27.9 0.126
#10 2 31.7 0.126
# … with 13 more rows

2d plot with 3rd variable as color in RStudio

I have a dataset as CSV with three columns:
timestamp (e.g. 2018/12/15)
keyword (e.g. "hello")
count (e.g. 7)
I want one plot where all the lines of the same keyword are connected with each other and timestamp is on the X- and count is on the Y- axis. I would like each keyword to have a different color for its line and the line being labeled with the keyword.
The CSV has only ~30.000 rows and R runs on a dedicated machine. Performance can be ignored.
I tried various approaches with mathplot and ggplot in this forum, but didn't get it to work with my own data.
What is the easiest solution to do this in R?
Thanks!
EDIT:
I tried customizing Romans code and tried the following:
`csvdata <- read.csv("c:/mydataset.csv", header=TRUE, sep=",")
time <- csvdata$timestamp
count <- csvdata$count
keyword <- csvdata$keyword
time <- rep(time)
xy <- data.frame(time, word = c(keyword), count, lambda = 5)
library(ggplot2)
ggplot(xy, aes(x = time, y = count, color = keyword)) +
theme_bw() +
scale_color_brewer(palette = "Set1") + # choose appropriate palette
geom_line()`
This creates a correct canvas, but no points/lines in it...
DATA:
head(csvdata)
keyword count timestamp
1 non-distinct-word 3 2018/08/09
2 non-distinct-word 2 2018/08/10
3 non-distinct-word 3 2018/08/11
str(csvdata)
'data.frame': 121 obs. of 3 variables:
$ keyword : Factor w/ 10 levels "non-distinct-word",..: 5 5 5 5 5 5 5 5 5 5 ...
$ count : int 3 2 3 1 6 6 2 3 2 1 ...
$ timestamp: Factor w/ 103 levels "2018/08/09","2018/08/10",..: 1 2 3 4 5 6 7 8 9 10 ...
Something like this?
# Generate some data. This is the part poster of the question normally provides.
today <- as.Date(Sys.time())
time <- rep(seq.Date(from = today, to = today + 30, by = "day"), each = 2)
xy <- data.frame(time, word = c("hello", "world"), count = rpois(length(time), lambda = 5))
library(ggplot2)
ggplot(xy, aes(x = time, y = count, color = word)) +
theme_bw() +
scale_color_brewer(palette = "Set1") + # choose appropriate palette
geom_line()

Creating a line graph (mean), arranged by facets, with standard error of mean error bars: ggplot

Hi Stack Overflow community,
I have a dataset:
conc branch length stage factor
1 1000 3 573.5 e14 NRG4
2 1000 7 425.5 e14 NRG4
3608 1000 44 5032.0 P10 NRG4
3609 1000 0 0.0 P10 NRG4
FYI
> str(dframe1)
'data.frame': 3940 obs. of 5 variables:
$ conc : Factor w/ 6 levels "0","1","10","100",..: 6 6 6 6 6 6 6 6 6 6 ...
$ branch: int 3 7 5 0 1 0 0 4 1 1 ...
$ length: num 574 426 204 0 481 ...
$ stage : Factor w/ 8 levels "e14","e16","e18",..: 1 1 1 1 1 1 1 1 1 1 ...
$ factor: Factor w/ 2 levels "","NRG4": 2 2 2 2 2 2 2 2 2 2 ...
I would like to create facetted line graphs, plotting the mean +/- standard error of the mean
I have tried experimenting and building a ggplot from others (here and on the web).
I have successfully used scripts that will make bargraphs this way:
errbar.ggplot.facets <- ggplot(dframe1, aes(x = conc, y = length))
### function to calculate the standard error of the mean
se <- function(x) sd(x)/sqrt(length(x))
### function to be applied to each panel/facet
my.fun <- function(x) {
data.frame(ymin = mean(x) - se(x),
ymax = mean(x) + se(x),
y = mean(x))}
g.err.f <- errbar.ggplot.facets +
stat_summary(fun.y = mean, geom = "bar",
fill = clrs.hcl(48)) +
stat_summary(fun.data = my.fun, geom = "linerange") +
facet_wrap(~ stage) +
theme_bw()
print(g.err.f)
Source: http://teachpress.environmentalinformatics-marburg.de/2013/07/creating-publication-quality-graphs-in-r-7/
In fact, I have created facetted line graphs with this script:
`ggplot(data=dframe1, aes(x=conc, y = length, group = stage)) +
geom_line() + facet_wrap(~stage)`
image: postimg.org/image/ebpdc0sb7
However, I used a transformed dataset of only means, SEM in another column, but I don't know how to add them.
Given the complexity (for me) of the bargraphs + error line scripts above, I have not yet been able to integrate/synthesize these into something I need.
In this case, the colour is not important to have.
P.S. I apologise for the long thread (and perhaps the overkill on some details). This is my first online R question, so not sure of correct etiquette. Thank you all in advance for being so helpful!
Darian
In case your dataframe has a column for the mean and the se you could do something like this:
library("dplyr")
library("ggplot2")
# Create a dummydataframe with columns mean and se
df <- mtcars %>%
group_by(gear, cyl) %>%
summarise(mean_mpg = mean(mpg), se_mpg = se(mpg))
ggplot(df, aes(x = gear, y = mean_mpg)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_mpg - se_mpg, ymax = mean_mpg + se_mpg)) +
facet_wrap(~cyl)

Graphing Activity budges that will also incorporate behavior

I have been given a challenging problem and was hoping for some recommendations.
I have activity data that I would like to display graphically and am looking for a package or program that could be used to solve my problem (preferably R).
The data is count of movements (Activity) collected hourly (Time of day) for 3 weeks (Calendar Date) or more with associated variables (Food/Vegetation).
Typically, as Ive been told the data can be processed and graphed in a program called Clocklab that is a Matlab product. However, the added complication is the desire to plot this data according to a classification of feeding groups. I was trying to find an equitable program/package in R for this but have come up short.
What the data looks like is simply:
Activity time of day Food type Calendar Date
0 01:00 B 03/24/2007
13 02:00 --- 03/24/2007
0 03:00 B 03/24/2007
0 04:00 B 03/24/2007
: : : :
1246 18:00 C 03/24/2007
3423 19:00 C 03/24/2007
: : : :
0 00:00 --- 03/25/2007
This data is circadian, circular, activity budgeting and I would like to have a graph that may be 3-D in nature that will show the diet selection and how much activity is associated with that diet plotted over time for multiple days/weeks. I would do this by individual and then at a population level. Ive a link to the program and example plot of what is typically produced by the program Clocklab.
Absent real data, this is the best I can come up with. No special packages required, just ggplot2 and plyr:
#Some imagined data
dat <- data.frame(time = factor(rep(0:23,times = 20)),
count = sample(200,size = 480,replace = TRUE),
grp = sample(LETTERS[1:3],480,replace = TRUE))
head(dat)
time count grp
1 0 79 A
2 1 19 A
3 2 9 C
4 3 11 A
5 4 123 B
6 5 37 A
dat1 <- ddply(dat,.(time,grp),summarise,tot = sum(count))
> head(dat1)
time grp tot
1 0 A 693
2 0 B 670
3 0 C 461
4 1 A 601
5 1 B 890
6 1 C 580
ggplot(data = dat1,aes(x = time,y = tot,fill = grp)) +
geom_bar(stat = "identity",position = "stack") +
coord_polar()
I just coded the hours of the day as integers 0-23, and simply grabbed some random values for Activity counts. But this seems like it's generally what you're after.
Edit
A few more options based on comments:
#Force some banding patterns
xx <- sample(10,9,replace = TRUE)
dat <- data.frame(time = factor(rep(0:23,times = 20)),
day = factor(rep(1:20,each = 24),levels = 20:1),
count = rep(c(xx,rep(0,4)),length.out = 20*24),
grp = sample(LETTERS[1:3],480,replace = TRUE))
Options one using faceting:
ggplot(dat,aes(x = time,y = day)) +
facet_wrap(~grp,nrow = 3) +
geom_tile(aes(alpha = count))
Option two using color (i.e. fill):
ggplot(dat,aes(x = time,y = day)) +
geom_tile(aes(alpha = count,fill = grp))

frequency plot using ggplot hangs or not showing plot

I have a dataframe with 880,000 rows and 2 columns ('width', 'group') in the following form:
width group
20 a
25 a
20 a
25 a
35 b
40 c
20 d
25 d
I want to create a frequency polygon for all the four groups in the same figure but so far I remained unsuccessful.
df1 = cbind(ceiling(rnorm(20, 30,5)), 'a')
df2 = cbind(ceiling(rnorm(40, 80,10)), 'b')
df3 = cbind(ceiling(rnorm(30, 50,8)), 'c')
df4 = cbind(ceiling(rnorm(35, 30,7)), 'd')
dfrm = rbind(df1,rbind(df2,rbind(df3,df4)))
colnames(dfrm)=c('width', 'group')
dfrm = as.data.frame(dfrm)
qplot(width, data = dfrm, geom="freqpoly", binwidth = 100) #not showing any plot
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 1000) #create more than four plots
I need to draw something similar to the following:
http://had.co.nz/ggplot2/graphics/996ae62d750dfccac8805fa0c87168cc.png
Or
http://had.co.nz/ggplot2/graphics/55078149a733dd1a0b42a57faf847036.png
There are a couple of problems. First, the way you have created dfrm, width is a factor.
> str(dfrm)
'data.frame': 125 obs. of 2 variables:
$ width: Factor w/ 60 levels "106","20","21",..: 7 7 17 10 9 9 6 7 17 4 ...
$ group: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 1 1 1 1 1 1 ...
This is because cbind creates a matrix which must have all the same type and since there is a character, it is a character matrix. Later transformation to a data.frame makes them into factors. This can be fixed with
dfrm$width <- as.numeric(as.character(dfrm$width))
or better, not making matrices to begin with
df1 = data.frame(width=ceiling(rnorm(20, 30,5)), group='a')
df2 = data.frame(width=ceiling(rnorm(40, 80,10)), group='b')
df3 = data.frame(width=ceiling(rnorm(30, 50,8)), group='c')
df4 = data.frame(width=ceiling(rnorm(35, 30,7)), group='d')
dfrm = rbind(df1,df2,df3,df4)
This is enough to make a graph
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 1000)
Though it looks like there is only one line, there are actually 4, all on top of each other. You only see the last one drawn (group "d"). This points out the second problem: your binwidth is way too large for this data.
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 10)
geom_freqpoly does not appear to have a fill aesthetic, though.

Resources