Interactively plotting multiple lines with shiny and ggplot2 - r

I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.

If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).

Related

group_by edit distance between rows over multiple columns

I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

R code to assign a sequence based off of multiple variables [duplicate]

This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")

How can I keep a tally of the count associated with each type of record? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have some data:
structure(list(date = structure(c(17888, 17888, 17888, 17888,
17889, 17889, 17891, 17891, 17891, 17891, 17891, 17892, 17894
), class = "Date"), type = structure(c(4L, 6L, 15L, 16L, 2L,
5L, 2L, 3L, 5L, 6L, 8L, 2L, 2L), .Label = c("aborted-live-lead",
"conversation-archived", "conversation-auto-archived", "conversation-auto-archived-store-offline-or-busy",
"conversation-claimed", "conversation-created", "conversation-dropped",
"conversation-restarted", "conversation-transfered", "cs-transfer-connected",
"cs-transfer-ended", "cs-transfer-failed", "cs-transfer-initiate",
"cs-transfer-request", "getnotified-requested", "lead-created",
"lead-expired"), class = "factor"), count = c(1L, 1L, 1L, 1L,
3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
It looks like this:
> head(dat)
# A tibble: 6 x 3
date type count
<date> <fct> <int>
1 2018-12-23 conversation-auto-archived-store-offline-or-busy 1
2 2018-12-23 conversation-created 1
3 2018-12-23 getnotified-requested 1
4 2018-12-23 lead-created 1
5 2018-12-24 conversation-archived 3
6 2018-12-24 conversation-claimed 1
For each unique type value, there is an associated count per day.
How can I count all of the values of each type (regardless of the date) and list them in a two-column data frame (in a format like this):
type count
------ ------
conversation-created 10
conversation-archived 4
lead-created 2
...
The reason for this is to show an overall count of each event type over the entire date range.
I presume that I have to use the select() function from dplyr but I am sure I am missing something.
This is what I have so far - it sums every value in the count column which isn't what I want as I want it broken down by day:
dat %>%
select(type, count) %>%
summarise(count = sum(count)) %>%
ungroup()
Seems like a combination of group_by and summarize with sum does the job:
dat %>% group_by(type) %>% summarise(count = sum(count))
# A tibble: 8 x 2
# type count
# <fct> <int>
# 1 conversation-archived 7
# 2 conversation-auto-archived 1
# 3 conversation-auto-archived-store-offline-or-busy 1
# 4 conversation-claimed 3
# 5 conversation-created 3
# 6 conversation-restarted 1
# 7 getnotified-requested 1
# 8 lead-created 1
There is no need for select as summarize will drop all the other variables anyway. Or perhaps you are confusing select with group_by, which is what we want in this case - to summarize those values of count where type takes the same value.

Identifying 24hour periods in GPS data

I would like to identify sequential 24 hour periods in GPS data. I have a datetime column that is numerical (ex: 41422.29) and I know each rounded number is a day. I know how to get the day (just round), however my schedule does not specifically follow days. Instead, I would specifically like to identify all of the columns that are within 24 hours from the first column, and then go from there. I can not use a count of columns, as 24 hours is not divided into equal increments.
This is my logic so far, though it doesn't get me where I need to be:
for (i in 1:length(example)){
base<-round(example$DT_LMT[i], digits=0)
if(example$DT_LMT[i]<=base+1) {
example$DaySeq<-base
}
else {
base+1
}
}
I have a dummy data set example, with the kind of thing I would like:
structure(list(ID = 1:19, DT_LMT = c(41423.62517, 41423.79236,
41423.95868, 41424.12534, 41424.29203, 41424.45888, 41424.62535,
41424.79186, 41424.95852, 41425.12502, 41425.29185, 41425.75016,
41425.79201, 41425.83352, 41425.87534, 41425.91744, 41425.95868,
41426.00105, 41426.04257), NEED = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), .Names = c("ID",
"DT_LMT", "NEED"), class = "data.frame", row.names = c(NA, -19L
))
Here is one approach, assuming df is the data assigned in your question. I created a new variable, need which I believe is your desired outcome.
transform(df, need = trunc(DT_LMT - DT_LMT[1]) + 1)
I would add 1 to the first value as the filter the data frame.
data<-data.frame(ID = 1:19, DT_LMT = c(41423.62517, 41423.79236,
41423.95868, 41424.12534, 41424.29203, 41424.45888, 41424.62535,
41424.79186, 41424.95852, 41425.12502, 41425.29185, 41425.75016,
41425.79201, 41425.83352, 41425.87534, 41425.91744, 41425.95868,
41426.00105, 41426.04257), NEED = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))
data[data$DT_LMT<=data$DT_LMT[1]+1,]
Output:
ID DT_LMT NEED
1 1 41423.63 1
2 2 41423.79 1
3 3 41423.96 1
4 4 41424.13 1
5 5 41424.29 1
6 6 41424.46 1
If you want to split the data into a list by 24 hour period.
split(data,unlist(lapply(data$DT_LMT,function(x){floor(x-data$DT_LMT[1])})))
Output:
$`0`
ID DT_LMT NEED
1 1 41423.63 1
2 2 41423.79 1
3 3 41423.96 1
4 4 41424.13 1
5 5 41424.29 1
6 6 41424.46 1
$`1`
ID DT_LMT NEED
7 7 41424.63 2
8 8 41424.79 2
9 9 41424.96 2
10 10 41425.13 2
11 11 41425.29 2
$`2`
ID DT_LMT NEED
12 12 41425.75 3
13 13 41425.79 3
14 14 41425.83 3
15 15 41425.88 3
16 16 41425.92 3
17 17 41425.96 3
18 18 41426.00 3
19 19 41426.04 3
To add a column with the day.
data$day<-lapply(data$DT_LMT,function(x){floor(x-data$DT_LMT[1])+1})

making a presence/absence timeline in r for multiple y objects

This is my first time using SO and I am an R newbie; sorry if this is a little basic or unclear (or if the question has already been answered... I'm struggling with coding and need pretty specific answers to understand)
I would like to produce an image similar to this one:
Except I would like it to be oriented horizontally on a timeline, and with two vertical lines drawn from the x-axis.
I can set the data up simply, and there are only two variables - date and Tag.
Tag Date
1 1 1/1/2014
2 3 1/1/2014
3 1 1/3/2014
4 2 1/3/2014
5 3 1/3/2014
6 5 1/3/2014
7 2 1/4/2015
8 3 1/4/2015
9 4 1/4/2015
10 6 1/4/2015
11 1 1/5/2014
12 2 1/5/2014
13 4 1/5/2014
14 6 1/5/2014
15 1 1/6/2014
16 2 1/6/2014
17 3 1/6/2014
18 4 1/6/2014
19 6 1/6/2014
20 2 1/7/2014
21 4 1/7/2014
22 1 1/8/2014
23 2 1/8/2014
24 6 1/8/2014
Here is a drawn image of what I would like to accomplish:
To recap - I want to take this data, which shows the dates of detection of animals at a certain location and plot it on a timeline with two vertical lines on two dates. If an animal (say, tag 2) was detected on consecutive days, I would like to connect those dates with a line, and if the detection happened without detection on consecutive days, a simple dot will suffice. I imagine the y-axis is stacked with each individual Tag, and the x-axis is a date scale - for each date, if A tag ID was detected, then its corresponding x,y coordinate will be marked; if a tag was not detected on a certain date; the corresponding x,y coordinate will remain blank.
Here's a follow-up question:
I want to add a shaded background to some of the dates. I figured that I can use this using geom_rect, but i keep getting the following error:
Error: Invalid input: date_trans works with objects of class Date only
using the code you wrote, this is what I have added to receive the error:
geom_rect(aes(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf), fill="red", alpha=0.25)
this code will plot, but is not transparent, and so becomes fairly useless:
geom_rect(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf)
You first need to change your date format into Date. Then you need to figure out if dates are consecutive. And finally you need to plot them. Below is a possible solution using the packages dplyr and ggplot2.
# needed packages
require(ggplot2)
require(dplyr)
# input your data (changed to 2014 dates)
dat <- structure(list(Tag = c(1L, 3L, 1L, 2L, 3L, 5L, 2L, 3L, 4L, 6L, 1L, 2L, 4L, 6L, 1L, 2L, 3L, 4L, 6L, 2L, 4L, 1L, 2L, 6L), Date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L), .Label = c("1/1/2014", "1/3/2014", "1/4/2014", "1/5/2014", "1/6/2014", "1/7/2014", "1/8/2014"), class = "factor")), .Names = c("Tag", "Date"), class = "data.frame", row.names = c(NA, -24L))
# change date to Date format
dat[, "Date"] <- as.Date(dat[, "Date"], format='%m/%d/%Y')
# adding consecutive tag for first day of cons. measurements
dat <- dat %>% group_by(Tag) %>% mutate(consecutive=c(diff(Date), 2)==1)
# plotting command
ggplot(dat, aes(Date, Tag)) + geom_point() + theme_bw() +
geom_line(aes(alpha=consecutive, group=Tag)) +
scale_alpha_manual(values=c(0, 1), breaks=c(FALSE, TRUE), guide='none')

Resources