I have a data.table / data frame with lists as values. I would like to make a box or violin plot of the values, one violin/box representing one row of my data set, but I can't figure out how.
Example:
test.dt <- data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
ggplot(data = test.dt, aes(x = as.factor(id), y = v1)) + geom_boxplot()
I get the following message:
Warning message:
Computation failed in stat_boxplot():
'x' must be atomic
So my guess is that maybe I should split the lists of the values to rows somehow. I.e.: the row with a as id would be transformed to 3 rows (corresponding to the length of the vector in v1) with the same id, but the values would be split among them.
Firstly I don't know how to transform the data.table as mentioned, secondly I don't know either if this would be the solution at all.
Indeed, you need to unnest your dataset before plotting:
library(tidyverse)
unnest(test.dt) %>%
ggplot(data = ., aes(x = as.factor(id), y = v1)) + geom_boxplot()
I believe what you are looking for is the very handy unnest() function. The following code works:
library(data.table)
library(tidyverse)
test.dt <- data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
test.dt = test.dt %>% unnest()
ggplot(test.dt, aes(x = as.factor(id), y = v1)) +
geom_boxplot()
If you don't want to import the whole tidyverse, the unnest() function is from the tidyr package.
This is what unnest() does with example data:
> data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
id v1
1: a 1, 0,10
2: b 1,2,3,4,5
3: c 3
> data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3)) %>% unnest()
id v1
1: a 1
2: a 0
3: a 10
4: b 1
5: b 2
6: b 3
7: b 4
8: b 5
9: c 3
Related
I have a peculiar problem with arranging boxplots given a certain order of the x-axis, as I am adding two boxplots from different dataframe in the same plot and each time I add the second geom_boxplot, R reorders my x axis alphabetically instead of following ordered levels of factor(x).
So, I have two dataframe of different lengths lookings something like this:
df1:
id value
1 A 1
2 A 2
3 A 3
4 A 5
5 B 10
6 B 8
7 B 1
8 C 3
9 C 7
df2:
id value
1 A 4
2 A 5
3 B 6
4 B 8
There is always more observations per id in df1 than in df2 and there is some ids in df1 that are not available in df2.
I'd like df1 to be sorted by the median(value) (ascending) and to first plot boxplots for each id in that order.
Then I add a second layer with boxplots for all other measurements per id from df2, which should maintain the same order on the x-axis.
Here's how I approached that:
vec <- df %>%
group_by(id) %>%
summarize(m = median(value)) %>%
arrange(m) %>%
pull(id)
p1 <- df1 %>%
ggplot(aes(x = factor(id, levels = vec), y = value)) +
geom_boxplot()
p1
p2 <- p1 +
geom_boxplot(data = df2, aes(x = factor(id, levels = vec), y = value))
p2
p1 shows the right order (ids are ordered based on ascending medians), p2 always throws my order off and goes back to plotting ids alphabetically (my id is a character column with names actually). I tried with sample dataframes and the above code achieves what is required. Hence, I am not sure what could be specifically wrong about my data so that the code fails when applied to the specific data and not the above mock data.
Any ideas?
Thanks a lot in advance!
If I understood correctly, this shoud work.
library(tidyverse)
# Sample data
df1 <-
tibble(
id = c("A","A","A","A","B","B","B","C","C"),
value = c(1,2,3,5,10,8,1,3,7),
type = "df1"
)
df2 <-
tibble(
id = c("A","A","B","B"),
value = c(4,5,6,8),
type = "df2"
)
df <-
# Create single data.frame
df1 %>%
bind_rows(df2) %>%
# Reorder id by median(value)
mutate(id = fct_reorder(id,value,median))
df %>%
ggplot(aes(id, y = value, fill = type)) +
geom_boxplot()
I measured bacterial inhibating power on viruses. I have data matrix of n rows (individuals) and 4 columns (a,b,c,x). Depending on column x I would like to define them as good or bad inhibators. However, I am not sure how to put a treshold of column x, depending on other measured columns (a,b,c). Is there any R function that could separate/group my dataframe?
In dplyr logic there is group_by(), it works like this:
library(dplyr)
df %>%
group_by(A) %>% # df is now grouped by column A
summarise(Mean = mean(C)) # calculates the mean of C for each group of A, summarise will delete any other columns not summarised and show only distinct rows
df %>%
group_by(A) %>%
mutate(Mean = mean(C)) # This will add the grouped mean to each row without changing the data frame
If you summarise then you are done but after group_by and mutate you have to ungroup your data frame at some point.
data.table example below. In the data, we have 50 observations (a) across 5 groups (Group).
Data
dt = data.table(
a = runif(1:50),
Group = sample(LETTERS[1:5], 50, replace = T)
)
Example 1
Firstly, we can calculate the Group mean of a and label it 'Good' if it is above 0.5 and 'Bad' if below. Note that this summary does not include a.
dt1 = dt[, .(Mean = mean(a)), keyby = Group][, Label := ifelse(Mean > 0.5, 'Good', 'Bad')]
> dt1
Group Mean Label
1: A 0.2982229 Bad
2: B 0.4102181 Bad
3: C 0.6201973 Good
4: D 0.4841881 Bad
5: E 0.4443718 Bad
Example 2
Similarly to Fnguyen's answer, the following code will not summarise the data per group; it will merely show the Group Mean and Label next to each observation.
dt2 = dt[, Mean := mean(a), by = Group][, Label := ifelse(Mean > 0.5, 'Good', 'Bad')]
> head(dt2)
a Group Mean Label
1: 0.4253110 E 0.4443718 Bad
2: 0.4217955 A 0.2982229 Bad
3: 0.7389260 E 0.4443718 Bad
4: 0.2499628 E 0.4443718 Bad
5: 0.3807705 C 0.6201973 Good
6: 0.2841950 E 0.4443718 Bad
Example 3
Lastly, we can of course apply a conditional argument to create a new column without having previously calculated a Grouped variable. The following example tests a combined condition on columns a and b.
dt3 = data.table(a = runif(100), b = runif(100))
dt3[, abGrThan0.5 := ifelse((a > 0.5 & b > 0.5), TRUE, FALSE)]
> head(dt3)
a b abGrThan0.5
1: 0.5132690 0.02104807 FALSE
2: 0.8466798 0.96845916 TRUE
3: 0.5776331 0.79215074 TRUE
4: 0.9740055 0.59381244 TRUE
5: 0.4311248 0.07473373 FALSE
6: 0.2547600 0.09513784 FALSE
df<-data.frame(gender = c('A', 'B', 'B','B','A'),q01 = c(1, 6, 3,8,5),q02 = c(5, 3, 6,5,2))
gender q01 q02
1 A 1 5
2 B 6 3
3 B 3 6
4 B 8 5
5 A 5 2
I want to calculate q01*2+q02 and then get the mean by gender group,the expected result as below:
A 9.5
B 16
I tried but failed:
df %>% aggregate(c(q01,q02)~gender,mean(q01*2+q02))
Error in mean(q01 * 2 + q02) : object 'q01' not found
df %>% group_by(gender) %>% mean(.$q01*2+.$q02)
[1] NA
Warning message:
In mean.default(., .$q01 * 2 + .$q02) :
argument is not numeric or logical: returning NA
What's the problem?
In the OP's code for dplyr + aggregate, the data is not specified along with using c i.e. concatenate two columns together. Also,
aggregate(c(q01,q02)~gender,df, mean(q01*2+q02))
Error in model.frame.default(formula = c(q01, q02) ~ gender, data =
df) : variable lengths differ (found for 'gender')
Here,with c(q01, q02), it is like concatenating c(1:5, 6:10) and now the length will be double as that of previous along with the fact that the FUN used will not get evaluated as it wouldn't find the 'q01' or 'q02'
Instead, we can cbind to create new column with the formula method of aggregate and then get the mean
library(dplyr)
df %>%
aggregate(cbind(q = q01 * 2 + q02) ~ gender, data = ., mean)
# gender q
#1 A 9.5
#2 B 16.0
NOTE: In dplyr, the data from the lhs of %>% can be specified with a ..
NOTE2: Here, we assume that the question is to understand how the aggregate can be made to work in the %>%. If it is just to get the mean, the whole process can be done with dplyr
f1 <- function(x, y, val) mean(x * val + y)
df %>%
group_by(gender) %>%
summarise(q = f1(q01, q02, 2))
Or using data.table methods
library(data.table)
setDT(df)[, .(q = mean(q01 * 2 + q02)), .(gender)]
# gender q
#1: A 9.5
#2: B 16.0
Or using base R with by
stack(by(df[-1], df[1], FUN = function(x) mean(x[,1] * 2 + x[,2])))
Or with aggregate
aggregate(cbind(q = q01 * 2 + q02) ~ gender, df, mean)
Better to keep dplyr and base approaches separate. Each of them have their own way to handle data. With dplyr you can do
library(dplyr)
df %>%
mutate(q = q01 * 2 + q02) %>%
group_by(gender) %>%
summarise(q = mean(q))
# gender q
# <fct> <dbl>
#1 A 9.5
#2 B 16
and using base R aggregate
aggregate(q~gender, transform(df, q = q01*2+q02), mean)
Sticking with the same logicc:
df %>%
do(aggregate(I(q01*2)+q02~gender,
data=.,mean)) %>%
setNames(.,nm=c("gender","q"))
gender q
1 A 9.5
2 B 16.0
NOTE:
I do note that do's lifecycle is marked as questioning.
I have two columns in a data.frame, that should have levels sorted in the same order, but I don't know how to do it in a straightforward manner.
Here's the situation:
library(ggplot2)
library(dplyr)
library(magrittr)
set.seed(1)
df1 <- data.frame(rating = sample(c("GOOD","BAD","AVERAGE"),10,T),
div = sample(c("A","B","C"),10,T),
n = sample(100,10,T))
# I'm adding a label column that I use for plotting purposes
df1 <- df1 %>% group_by(rating) %>% mutate(label = paste0(rating," (",sum(n),")")) %>% ungroup
# # A tibble: 10 x 4
# rating div n label
# <fctr> <fctr> <int> <chr>
# 1 BAD C 48 BAD (220)
# 2 BAD B 87 BAD (220)
# 3 BAD C 44 BAD (220)
# 4 GOOD B 25 GOOD (77)
# 5 AVERAGE B 8 AVERAGE (117)
# 6 AVERAGE C 10 AVERAGE (117)
# 7 AVERAGE A 32 AVERAGE (117)
# 8 GOOD B 52 GOOD (77)
# 9 AVERAGE C 67 AVERAGE (117)
# 10 BAD C 41 BAD (220)
# rating levels are sorted
df1$rating <- factor(df1$rating,c("BAD","AVERAGE","GOOD"))
ggplot(df1,aes(x=rating,y=n,fill=div)) + geom_col() # plots in the order I want
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col() # doesn't because levels aren't sorted
How do I manage to copy the factor order from one column to another ?
I can make it work this way but I think it's really awkward:
lvls <- df1 %>% select(rating,label) %>% unique %>% arrange(rating) %>% extract2("label")
df1$label <- factor(df1$label,lvls)
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col()
Instead of adding a label column and use aes(x = label, you may stick to aes(x = rating, and create the labels in scale_x_discrete:
ggplot(df1, aes(x = rating, y = n, fill = div)) +
geom_col() +
scale_x_discrete(labels = df1 %>%
group_by(rating) %>%
summarize(n = sum(n)) %>%
mutate(lab = paste0(rating, " (", n, ")")) %>%
pull(lab))
Once you have set the levels of rating, you can use forcats to set the levels of label by the order of rating like this...
library(forcats)
df1 <- df1 %>% group_by(rating) %>%
mutate(label=paste0(rating," (",sum(n),")")) %>%
ungroup %>%
arrange(rating) %>% #sort by rating
mutate(label=fct_inorder(label)) #set levels by order in which they appear
Or you can use forcats::fct_reorder to do the same thing...
df1$label <- fct_reorder(df1$label, as.numeric(df1$rating))
The plot then has the bars in the right order.
For the following dataset:
d = data.frame(date = as.Date(as.Date('2015-01-01'):as.Date('2015-04-10'), origin = "1970-01-01"),
group = rep(c('A','B','C','D'), 25), value = sample(1:100))
head(d)
date group value
1: 2015-01-01 A 4
2: 2015-01-02 B 32
3: 2015-01-03 C 46
4: 2015-01-04 D 40
5: 2015-01-05 A 93
6: 2015-01-06 B 10
.. can anyone advise a more elegant way to calculate a cumulative total of values by group than this data.table) method?
library(data.table)
setDT(d)
d.cast = dcast.data.table(d, group ~ date, value.var = 'value', fun.aggregate = sum)
c.sum = d.cast[, as.list(cumsum(unlist(.SD))), by = group]
.. which is pretty clunky and yields a flat matrix that needs dplyr::gather or reshape2::melt to reformat.
Surely R can do better than this??
If you just want cumulative sums per group, then you can do
transform(d, new=ave(value,group,FUN=cumsum))
with base R.
This should work
library(dplyr)
d %>%
group_by(group) %>%
arrange(date) %>%
mutate(Total = cumsum(value))
As this question was tagged with data.table, you are probably looking for (a modification of #Franks comment).
setDT(d)[order(date), new := cumsum(value), by = group]
This will simultaneously rearrange the data by date (not sure if needed, if not, you can get rid of order(date)) and update your data set in place utilizing the := operator
Is this it?
sp <- split(d, d$group)
res <- lapply(seq_along(sp), function(i) cumsum(sp[[i]]$value))
res <- lapply(seq_along(res), function(i){
sp[[i]]$c.sum <- res[[i]]
sp[[i]]
})
res <- do.call(rbind, res)
res