I had a problem today figuring out a way to do an aggregation in dplyr in R but for some reason was unable to come up with a solution (although I think this should be quite easy).
I have a data set like this:
structure(list(date = structure(c(16431, 16431, 16431, 16432,
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green",
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L,
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190,
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")
which shows like this:
date colour shape value
1 2014-12-27 red square 100
2 2014-12-27 blue square 130
3 2014-12-27 blue triangle 100
4 2014-12-28 green triangle 180
5 2014-12-28 green triangle 125
6 2014-12-28 red square 190
7 2014-12-29 red circle 120
8 2014-12-29 blue circle 100
9 2014-12-29 blue circle 140
My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:
date colour shape value
1 27/12/2014 blue square 110
2 28/12/2014 green triangle 165
3 29/12/2014 blue circle 120
I ended up doing it using split and writing my own function to calculate the above for a data.frame, then used snow::clusterApply to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr. Efficiency is really important for this so being able to run it in one chain is quite important.
You could do
dat %>% group_by(date) %>%
summarize(colour = names(which.max(table(colour))),
shape = names(which.max(table(shape))),
value = mean(value))
Related
I have a dataset composed of more than 100 columns and all columns are of type factor. Ex:
animal fruit vehicle color
cat orange car blue
dog apple bus green
dog apple car green
dog orange bus green
In my dataset i need to remove all columns with factors thas has less than 5 observations per level. In this example, if i want to remove all columns with amount of observations per levels less than or equal to 1, like blue or cat, the algorithm will remove the columns animal and color. What is the most elegant way to do this?
We can use Filter with table
Filter(function(x) !any(table(x) < 2), df1)
# fruit vehicle
#1 orange car
#2 apple bus
#3 apple car
#4 orange bus
data
df1 <- structure(list(animal = structure(c(1L, 2L, 2L, 2L), .Label = c("cat",
"dog"), class = "factor"), fruit = structure(c(2L, 1L, 1L, 2L
), .Label = c("apple", "orange"), class = "factor"), vehicle = structure(c(2L,
1L, 2L, 1L), .Label = c("bus", "car"), class = "factor"), color = structure(c(1L,
2L, 2L, 2L), .Label = c("blue", "green"), class = "factor")),
row.names = c(NA,
-4L), class = "data.frame")
We can use select_if from dplyr
library(dplyr)
df1 %>% select_if(~all(table(.) > 1))
# fruit vehicle
#1 orange car
#2 apple bus
#3 apple car
#4 orange bus
Warning: this question seems so easy that I as a beginner probably did not manage to find the right solution among the more complex topics on SO (looked here, here, here and at more places)
I would like to fill a column in my dataframe, based on another column, and using as input further columns.
This is much clearer with an example:
Version1 Version2 Version3 Version4 Presented_version Color
1 blue red green yellow 1 NA
2 red blue yellow green 4 NA
3 yellow green red blue 3 NA
I would like to fill the column "Color" with the value of either Version1/Version2/Version3/Version 4. The column Presented_version tells me which of these four values is needed.
For example, in row 1, Presented_version is 1, so the value needed is in "Version1" ("blue"). Color in row 1 should be blue.
Could someone show me a way to do this without looping over the dataframe using lots of "if" statements?
structure(list(Version1 = structure(1:3, .Label = c("blue", "red",
"yellow"), class = "factor"), Version2 = structure(c(3L, 1L,
2L), .Label = c("blue", "green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(3:1, .Label = c("blue", "green", "yellow"
), class = "factor"), Presented_version = c(1L, 4L, 3L),
Color = c(NA, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
=======================
EDITED!
I simplified the example to explain my question but the example above differs in several ways from my actual dataset, and the solutions therefore make assumptions which my data do not actually meet.
Here is a more accurate representation of the data.frame. In particular, there is no fixed match between Presented_version and the content of the Version1...Version 4 columns (that differs depending on an extra column, which I called Painter now), and Version1 to Version4 are not necessarily in column 1 to 4 in my dataset.
FillerColumn Painter Version1 Version2 Version3 Version4 Version_presented Color FillerColumn.1
1 77 A blue red green yellow 1 NA 77
2 77 B red blue yellow green 4 NA 77
3 77 C yellow green red blue 3 NA 77
4 77 D red blue yellow green 1 NA 77
structure(list(FillerColumn = c(77L, 77L, 77L, 77L), Painter = structure(1:4, .Label = c("A",
"B", "C", "D"), class = "factor"), Version1 = structure(c(1L,
2L, 3L, 2L), .Label = c("blue", "red", "yellow"), class = "factor"),
Version2 = structure(c(3L, 1L, 2L, 1L), .Label = c("blue",
"green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L, 3L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(c(3L, 2L, 1L, 2L), .Label = c("blue",
"green", "yellow"), class = "factor"), Version_presented = c(1L,
4L, 3L, 1L), Color = c(NA, NA, NA, NA), FillerColumn.1 = c(77L,
77L, 77L, 77L)), class = "data.frame", row.names = c(NA,
-4L))
We can use a vectorized option with row/column indexing to extract the values instead of any loop
df1$color <- df1[1:4][cbind(1:nrow(df1), df1$Presented_version)]
df1$color
#[1] "blue" "green" "red"
Benchmarks
dfN <- df1[rep(seq_len(nrow(df1)), 1e6),]
system.time({
dfN[1:4][cbind(1:nrow(dfN), dfN$Presented_version)]
})
# user system elapsed
# 1.216 0.110 1.321
system.time({
cols <- grep("^Version", names(dfN))
unlist(mapply(function(x, y) dfN[x, cols][y],
1:nrow(dfN),dfN$Presented_version))
})
# user system elapsed
# 319.907 1.644 322.418
Now, let's see the other option with apply
system.time({
apply(dfN, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
})
# user system elapsed
# 14.240 0.365 14.550
I like to mess with the data set. Try a data.table melt approach
df <- setDT(df)
df1 <- melt.data.table(df,
id.vars = c('Presented_version'),
measure.vars = patterns('Version'),
value.name = 'Color',
variable.name = 'Version')[
, version1 := str_extract(Version, '\\d+')][
Presented_version == version1][
version1 := NULL]
resulting in
Presented_version Version Color
1: 1 Version1 blue
2: 3 Version3 red
3: 4 Version4 green
And, if you want the information in the same original structure
merge(df,
df1[, .(Presented_version, Color)],
by = 'Presented_version')
Presented_version Version1 Version2 Version3 Version4 Color
1: 1 blue red green yellow blue
2: 3 yellow green red blue red
3: 4 red blue yellow green green
One way using mapply
cols <- grep("^Version", names(df))
df$Color <- unlist(mapply(function(x, y) df[x, cols][y],
1:nrow(df),df$Presented_version))
df
# Version1 Version2 Version3 Version4 Presented_version Color
#1 blue red green yellow 1 blue
#2 red blue yellow green 4 green
#3 yellow green red blue 3 red
And with apply
apply(df, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
#[1] "blue" "green" "red"
I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.
If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).
I have a data.table that looks like the following:
ID Date Team MonthFactor
1 2512 2015-04-24 Purple 2015-04
2 2512 2015-04-25 Purple 2015-04
3 2512 2015-04-26 Purple 2015-04
4 2512 2015-04-27 Purple 2015-04
I would like to get the number of rows grouped by both Team and MonthFactor, including when there are no rows from a given month, IE if purple team had no entries in the month of May but yellow did, the summarized table would look like:
Team MonthFactor N
1 Purple 2015-04 10
2 Purple 2015-05 0
3 Yellow 2015-04 5
4 Yellow 2015-05 7
Doing this would be trivial if I didn't need the "empty" groups, but I can't wrap my head around how to specify the groups that need to be evaluated when there might not be rows that contain a given monthFactor.
You can achieve that by using a cross-join:
dat[, .N, .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), N := 0][]
this gives:
Team MonthFactor N
1: Purple 2015-04 2
2: Purple 2015-05 0
3: Yellow 2015-04 5
4: Yellow 2015-05 3
The advantage of this method is that it is easier to include other variables as well. Supposing that ID is just a numeric value, consider this example:
dat[, .(.N, sID = sum(ID)), .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), `:=` (N = 0, sID = 0)][]
which gives:
Team MonthFactor N sID
1: Purple 2015-04 2 5024
2: Purple 2015-05 0 0
3: Yellow 2015-04 5 12560
4: Yellow 2015-05 3 7536
Used data:
dat <- structure(list(ID = c(2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L),
Date = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 4L, 2L, 3L, 4L), .Label = c("2015-04-24", "2015-04-25", "2015-04-26", "2015-04-27"), class = "factor"),
Team = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Purple", "Yellow"), class = "factor"),
MonthFactor = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("2015-04", "2015-05"), class = "factor")),
.Names = c("ID", "Date", "Team", "MonthFactor"), class = c("data.table", "data.frame"), row.names = c(NA, -10L))
Perhaps this could work
data.table(table(dt$Team,dt$MonthFactor))
I'm trying to make a simple bar chart that first, distinguishes between two groups say based on sex, male or female, and then after stats, for each sample/ individual, there is a P-value, significant or not. I know how to color code the bars between male and female, but I want R to automatically put a star above each sample/ individual who has a P-value less than 0.05 say.
I'm currently just using the simple barplot(x) function.
I've tried to look around for answers but haven't found anything for this yet.
Below is is a link to my example data set:
[url=http://www.divshare.com/download/22797284-187]DivShare File - test.csv[/url]
I'd like to put the time on the y axis, color code the bars to distinguish between Male and Female, and then for individuals in either group who has a 1 under significance, put a star above their corresponding bar.
Thanks for any suggestions in advance.
I messed with your data a bit to make it friendlier:
## dput(read.csv("barcharttest.csv"))
x <- structure(list(ID = 1:7,
sex = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 2L), .Label = c("female", "male"),
class = "factor"),
val = c(309L, 192L, 384L, 27L, 28L, 245L, 183L),
stat = structure(c(1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("NS", "sig"),
class = "factor")),
.Names = c("ID", "sex", "val", "stat"),
class = "data.frame", row.names = c(NA, -7L))
Which looks like this:
ID sex val stat
1 1 female 309 NS
2 2 female 192 sig
3 3 female 384 sig
4 4 male 27 NS
5 5 male 28 sig
6 6 female 245 NS
7 7 male 183 NS
Now the plot:
sexcols <- c("pink","blue")
## png("barplot.png") ## for output graph
par(las=1,bty="l") ## I prefer these settings; see ?par
b <- with(x,barplot(val,col=sexcols[sex])) ## b saves x coords of bars
legend("topright",levels(x$sex),fill=sexcols,bty="n")
## use xpd=NA to make sure that star on tallest bar doesn't get clipped;
## pos=3 puts the text above the (x,y) location specified
text(b,x$val,ifelse(x$stat=="sig","*",""),pos=3,cex=2,xpd=NA)
axis(side=1,at=b,label=x$ID)
## dev.off()
I should also add "Time" and "ID" labels on the relevant axes.