Related
I have a time series like this:
created_time,reaction_counts
2016-01-18T08:05:44+0000,65
2016-01-18T08:05:44+0000,65
2016-01-18T08:05:44+0000,65
2016-02-23T01:42:48+0000,468
2016-02-23T03:51:37+0000,125
2016-02-23T09:49:01+0000,433
2016-02-23T10:09:32+0000,72
2016-02-26T07:45:10+0000,137
2016-02-26T11:48:09+0000,120
2016-02-27T03:27:39+0000,70
2016-02-28T09:28:16+0000,145
2016-03-02T00:17:14+0000,122
2016-03-02T05:34:41+0000,108
2016-03-02T09:04:45+0000,296
And I want to aggregate it by month (and also by year) and plot a histogram.
How do I do it?
Thanks!
You can use the following code for converting hourly data to monthly or yearly data
library(lubridate)
library(dplyr)
library(hydroTSM)
try <- structure(list(created_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("2016-01-18T08:05:44+0000",
"2016-02-23T01:42:48+0000", "2016-02-23T03:51:37+0000", "2016-02-23T09:49:01+0000",
"2016-02-23T10:09:32+0000", "2016-02-26T07:45:10+0000", "2016-02-26T11:48:09+0000",
"2016-02-27T03:27:39+0000", "2016-02-28T09:28:16+0000", "2016-03-02T00:17:14+0000",
"2016-03-02T05:34:41+0000", "2016-03-02T09:04:45+0000"), class = "factor"),
reaction_counts = c(65L, 65L, 65L, 468L, 125L, 433L, 72L,
137L, 120L, 70L, 145L, 122L, 108L, 296L)), class = "data.frame", row.names = c(NA,
-14L))
df <- mutate_at(try, "created_time", ymd_hms)
Monthly conversion
monthly = df %>%
mutate(month = format(created_time, "%m"), year = format(created_time, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(reaction_counts))
For histogram plotting of monthly data
hist(monthly$total)
Yearly conversion
yearly = df %>%
mutate(month = format(created_time, "%m"), year = format(created_time, "%Y")) %>%
group_by(year) %>%
summarise(total = sum(reaction_counts))
For histogram plotting of yearly data
hist(yearly$total)
I use ggplot 2.2.0 and R version 3.3.2 w64
According to http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ I can specify a function to provide the facet labels.
I plot patient data of a study:
I have a dataframe with the Ids and the data, and I have a second dataframe containing some general information (age and gender)
patmeta <- data.frame(
"pat_id"=c(66, 103, 219, 64, 62, 111, 232),
"gender"=c("f","f","f", "m","f", "f", "f"),
"age"=c(56, 32, 73, 58,37,33,52))
I defined a global labeller function and a special one for my pat_id (pat_id_fac is the same as pat_id but as a factor, pat_id is numeric)
PatIdLabeller <- function(id) {
res <- sprintf("Pat %s (%i y, %s)", id,
subset(patmeta, pat_id == id)$age,
subset(patmeta, pat_id == id)$gender)
return(res)
}
globalLabeller <- labeller(
pat_id_fac = PatIdLabeller,
pat_id = PatIdLabeller,
.default = label_both
)
Testing the PatIdLabeller function gives the desired output (though I think, using subset is not most elegant way to do it), e.g.
> PatIdLabeller('103')
[1] "Pat 103 (32 y, f)"
But using it in ggplot, the IDs are correct, but age and gender are for all the same (last row of patmeta) as you see in the picture.
A subset of my qdat is the following
structure(list(pat_id = c(103L, 103L, 103L, 64L, 64L, 64L, 66L,
66L, 66L, 219L, 219L, 219L, 62L, 62L, 62L, 111L, 111L, 111L,
232L, 232L, 232L), pat_id_fac = structure(c(4L, 4L, 4L, 2L, 2L,
2L, 3L, 3L, 3L, 6L, 6L, 6L, 1L, 1L, 1L, 5L, 5L, 5L, 7L, 7L, 7L
), .Label = c("62", "64", "66", "103", "111", "219", "232"),
class = c("ordered", "factor")),
Activity = structure(c(9L, 3L, 9L, 2L, 9L, 9L, 9L,
2L, 2L, 3L, 8L, 4L, 2L, 2L, 2L, 4L, 4L, 7L, 2L, 2L, 9L), .Label = c("",
"Anderes", "Essen", "Hausarbeit", "Hobbies", "Körperpflege",
"Liegen", "Medienkonsum", "Sozialer Kontakt"), class = "factor")),
.Names = c("pat_id", "pat_id_fac", "Activity"), row.names = c(1L, 2L, 3L,
128L, 129L, 130L, 199L, 200L, 201L, 217L, 218L, 219L, 343L, 344L, 345L,
397L, 398L, 399L, 451L, 452L, 453L), class = "data.frame")
g.bar.activities <-
ggplot(data=qdat, aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id_fac, labeller= globalLabeller)
From other questions and answers, I know I could define a character vector, but I am lazy and would like to do it more elegant reusing my patmeta, because the list of study participants will become quite long and evolve over time.
With smaller test data set
t <- data.frame("pat_id"=c(103, 103, 103, 219, 219, 219),
"Activity" = c("sleep", "sleep", "eat", "eat", "eat", "sleep"))
patmeta <- data.frame("pat_id"=c(103, 219),
"gender"=c("m","f"), "age"=c(32,52))
ggplot(data=t, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=globalLabeller)
I get exactly what I want. I don't see the difference.
It appears that the subsetting is not working properly, likely because the == is trying to act as a vector along the length of all of the id's being passed in. That is, it is checking each pat_id in patmeta to see if it matches the pat_id passed in. The differences in sorting are somehow leaving only that one pat_id matching.
You can see this in action if you try any of the following:
PatIdLabeller(c(103, 66))
gives character(0) and this warning:
In pat_id == id : longer object
length is not a multiple of shorter object length
because none of the rows return, and R is forced to repeat the elements in the ==
ggplot(data=head(qdat), aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id, labeller= globalLabeller)
gives a plot with duplicated age/gender again, and this warning
In pat_id == id : longer object length is not a
multiple of shorter object length
(ditto above).
Of note, even with your smaller data set, if you reverse the row order of your new patmeta (so that 219 is before 103), then run the code you get
Error in FUN(X[[i]], ...) : Unknown input
because the labeller is returning an empty character() (as above).
I don't have a lot of experience with labellers (this answer was a good chance to explore them), but this one should work by using left_join from dplyr, rather than trying to use ==.
myLabeller <- function(x){
lapply(x,function(y){
toLabel <-
data.frame(pat_id = y) %>%
left_join(patmeta)
paste0("Pat ", toLabel$pat_id
, " (", toLabel$age, "y, "
, toLabel$gender, ")")
})
}
and use gives:
ggplot(data=qdat, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=myLabeller) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
An alternative option would be to skip the labeller step, and just generate the label you actually want to use directly. Here, just merge the meta data with the patient data (using left_join from dplyr), then generate a column using the format/style that you want (here, using mutate from dplyr and paste0).
forPlotting <-
qdat %>%
left_join(patmeta) %>%
mutate(forFacet = paste0("Pat ", pat_id
, " (", age, "y, "
, gender, ")"))
Then, use that data for plotting, and the new column for faceting.
ggplot(forPlotting, aes(x=Activity)) +
geom_bar() +
facet_wrap(~forFacet) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
gives
note that the facets are now sorted alphabetically, but you could adjust that as needed by setting the column as a factor with explicitly sorted levels when you make it.
This is my vector
head(sep)
I must find percent of all SEP 11 in each row.
For instance, in first row, percent of SEP 11 is
100 * ((63 + 124)/ (63 + 124 + 0 + 0))
And would like this stored in newly created 8th column
Thanks
dput
> dput(head(sep))
structure(list(Site = structure(1:6, .Label = c("31R001", "31R002",
"31R003", "31R004", "31R005", "31R006", "31R007", "31R008", "31R011",
"31R013", "31R014", "31R016", "31R018", "31R019", "31R020", "31R021",
"31R022", "31R023", "31R024", "31R025", "31R026", "31R027", "31R029",
"31R030", "31R031", "31R032", "31R034", "31R035", "31R036", "31R038",
"31R039", "31R040", "31R041", "31R042", "31R043", "31R044", "31R045",
"31R046", "31R048", "31R049", "31R050", "31R051", "31R052", "31R053",
"31R054", "31R055", "31R056", "31R057", "31R058", "31R059", "31R060",
"31R061", "31R069", "31R071", "31R072", "31R075", "31R435", "31R440",
"31R445", "31R450", "31R455", "31R460", "31R470", "31R600", "31R722",
"31R801", "31R825", "31R826", "31R829", "31R840", "31R843", "31R861",
"31R880"), class = "factor"), Latitude = c(33.808874, 33.877256,
33.820825, 33.852373, 33.829697, 33.810274), Longitude = c(-117.844048,
-117.700135, -117.811845, -117.795516, -117.787532, -117.830429
), Windows.SEP.11 = c(63L, 174L, 11L, 85L, 163L, 71L), Mac.SEP.11 = c(0L,
1L, 4L, 0L, 0L, 50L), Windows.SEP.12 = c(124L, 185L, 9L, 75L,
23L, 5L), Mac.SEP.12 = c(0L, 1L, 32L, 1L, 0L, 50L)), .Names = c("Site",
"Latitude", "Longitude", "Windows.SEP.11", "Mac.SEP.11", "Windows.SEP.12",
"Mac.SEP.12"), row.names = c(NA, 6L), class = "data.frame")
Assuming that you want to get the rowSums of columns that have 'Windows' as column names, we subset the dataset ("sep1") using grep. Then get the rowSums(Sub1), divide by the rowSums of all the numeric columns (sep1[4:7]), multiply by 100, and assign the results to a new column ("newCol")
Sub1 <- sep1[grep("Windows", names(sep1))]
sep1$newCol <- 100*rowSums(Sub1)/rowSums(sep1[4:7])
This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 2 years ago.
I am using the dplyr to make a sumIF function on my data frame. However, it does not give me the desired output:
> dput(sys)
structure(list(NUMERIC = c(244L, 24L, 1L, 2L, 4L, 111L, 23L,
2L, 3L, 4L, 24L), VAL = c("FALSE", "FALSE", "TES", "TEST", "TRUE",
"TRUE", "TRUE", "asdfs", "asdfs", "safd", "sd"), IDENTIFIER = c(99L,
99L, 98L, 98L, 99L, 99L, 99L, 13L, 13L, 99L, 12L)), .Names = c("NUMERIC",
"VAL", "IDENTIFIER"), row.names = c(NA, 11L), class = c("grouped_dt",
"tbl_dt", "tbl", "grouped_dt", "tbl_dt", "tbl", "data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000100788>, sorted = c("VAL",
"IDENTIFIER"), vars = list(VAL, IDENTIFIER))
>
>
> sys <- group_by(sys, VAL, IDENTIFIER)
> df.summary <- summarise(sys,
+ numeric = sum(NUMERIC)
+ )
>
> (df.summary)
numeric
1 442
My desired result should look like that:
Any recommendation as to what I am doing wrong?
This could occur when you have plyr loaded along with dplyr. You can either do this on a new R session or use
dplyr::summarise(sys,
numeric = sum(NUMERIC)
)
So I have an example dataframe that hold the columns id, count and username with id and count being numbers and username being a string.
For every row of the dataframe I want to set a value of a new column called 'ratio', with ratio being defined as
count / number of rows where username == the username in this row
Example from the provided data:
In every row where the username is 'Tom' the ratio would be count/4 , because the user Tom is found four times in the data.
This is just a simplified version of my problem, a for-loop is not an option because my original dataframe has about 3.4 million rows and my previous approach where I used for-loops to iterate the unique values of e.g. 'username' to solve this problem takes forever.
dput of my dataframe:
structure(list(id = 1:20, count = c(140L, 89L, 17L, 114L, 129L,
86L, 21L, 50L, 197L, 160L, 8L, 14L, 78L, 208L, 155L, 55L, 63L,
20L, 189L, 79L), usernames = structure(c(4L, 3L, 5L, 5L, 2L,
3L, 1L, 1L, 3L, 1L, 3L, 2L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 3L), .Label = c("Jerry",
"Mark", "Phil", "Tina", "Tom"), class = "factor")), .Names = c("id",
"count", "usernames"), row.names = c(NA, 20L), class = "data.frame")
I hope I provided everything for you to understand and reproduce the problem, if something's missing don't hesitate to mention it in the comments.
There are several options. Here are three, one in base R, one with data.table, and one with "plyr". Both assume we're starting with a data.frame named "mydf":
Base R
within(mydf, {
temp <- as.numeric(ave(as.character(usernames), usernames, FUN = length))
ratio <- count/temp
rm(temp)
})
data.table
library(data.table)
DT <- data.table(mydf)
DT[, ratio := count/.N, by = "usernames"]
DT
plyr
library(plyr)
ddply(mydf, .(usernames), transform,
ratio = count/length(usernames))
You can use ave for this:
transform(d, x=count/as.numeric(ave(d$usernames, d$usernames, FUN=length)))