How to group daily data into months in a dataframe using dplyr - r

I have a dataframe containing daily counts of number group members seen present. I am wanting to get a monthly mean of the number of group members seen (produced in a data frame). I've been trying to use dplyr as it is much simpler than creating a new data frame and filling it using a for loop. I'm very new to coding and would like to be able to do this for multiple groups. My dataframe looks like this:
data.frame': 148 obs. of 7 variables:
$ Date : Date, format: "2013-05-01" "2013-05-02" ...
$ Group : chr "WK" "WK" "WK" "WK" ...
$ Session : Factor w/ 12 levels "AM","AM1","AM2",..: 9 1 9 9 1 9 9 1 1 1 ...
$ Group.Members.Seen : num 7 6 8 9 9 6 8 9 4 9 ...
$ Roving.Males : num NA NA NA NA NA NA NA NA NA NA ...
$ Undyed.Group.Members.Seen: num NA NA NA NA NA NA NA NA NA NA ...
$ Non.group.Other : num NA NA NA NA NA NA NA NA NA NA ..
I don't have an observation for every day, and sometimes have multiple observations for a day. In this particular instance, there is only data in the Group.members.seen column, however in other datasets i do have numbers in roving.males, undyed.group.members.seen, and non.group.other columns.
For this particular dataset, I am only wanting to work with the Date and Group.Members.seen columns, as I only have data in those columns. I've used select to select those columns, then have tried to use mutate, group_by, and summarise to get what I want. However, I think the problem is with the dates. Have also tried aggregate but i don't think that is the best.
test <- WK.2013 %>%
select(Date, Group.Members.Seen) %>%
mutate(mo = Date(format="%m"), mean.num.members = mean(Group.Members.Seen)) %>%
group_by(Date(format="%m")) %>%
summarise(mean = mean(Group.Members.Seen))
Error message is saying it cannot find the function "Date", which is probably the beginning of a long string of problems with that code.

You can try lubridate package and round dates to month or year or other units.
library(lubridate)
mydate <- today()
> floor_date(today(),unit = "month")
[1] "2019-07-01"
> floor_date(mydate,unit = "month")
[1] "2019-07-01"
> round_date(mydate,unit = "month")
[1] "2019-08-01"

It's hard to say for sure if this will work without seeing the actual data but could you try the apply.monthly function from the xts package?

Related

How do I impute values by factor levels using 'missForest'?

I am trying to impute missing values in my dataframe with the non-parametric method available in missForest.
My data (OneDrive link) consists of one categorical variable and five continuous variables.
head(data)
phylo sv1 sv2 sv3 sv4 sv5
1 Phaon_camerunensis 6.03803 NA 5121.257 NA 70
2 Umma_longistigma 6.03803 NA 5121.257 NA 53
3 Umma_longistigma 6.03803 NA 5121.257 NA 64
4 Umma_longistigma 6.03803 NA 5121.257 NA 63
5 Sapho_ciliata 6.03803 NA 5121.257 NA 63
6 Sapho_gloriosa 6.03803 NA 5121.257 NA 63
I was successful at first using missForest()
imp<- missForest(data[2:6])
However, instead of aggregating over the whole data matrix (or vector? idk exactly) I would like to impute missing values by phylo.
I tried data[2:6] %>% group_by(phylo) %>% and sapply(split(data[2:6], data$phylo)) %>% but no success.
Any guess on how to deal with it?
If you want to run missForest for each group, you can use group_map:
imp <- df %>% group_by(phylo) %>% group_map(~ missForest(.))
To get only the first item from the result:
imp2 <- t(sapply(imp, "[[", 1))

Displaying answers on ranking question in R

I have the following variables which are the result of one ranking question. On that question, participants got the 7 listed motivations presented and should rank them. Here, value 1 means the participant put the motivation on position 1, and value 7 means he put it on last position. The ranking is expressed through the numbers on these variables (numbers 1 to 7):
'data.frame': 25 obs. of 8 variables:
$ id : num 8 9 10 11 12 13 14 15 16 17 ...
$ motivation_quantity : num NA 3 1 NA 3 NA NA NA 1 NA ...
$ motivation_quality : num NA 1 6 NA 3 NA NA NA 3 NA ...
$ motivation_timesaving : num NA 6 4 NA 2 NA NA NA 5 NA ...
$ motivation_contribution : num NA 4 2 NA 1 NA NA NA 2 NA ...
$ motivation_alternativelms: num NA 5 3 NA 6 NA NA NA 7 NA ...
$ motivation_inspiration : num NA 2 7 NA 4 NA NA NA 4 NA ...
$ motivation_budget : num NA 7 5 NA 7 NA NA NA 6 NA ...
What I want to do now is to calculate and visualize the results on the ranking question (i.e. visualizing the results on the motivations). Since I havent worked with R for a long time, I am not sure how to best do this.
One way I could imagine is to first calculate the top 3 answers (which are the motivations which were most frequently ranked on position "1", "2", and "3" across participants.
Would really appreciate it if someone could help out with doing this or even show a better way how to analyse and visualize my data.
I originally had an visualization in microsoft forms but this one got destroyed by a bug overnight. It looked like this:
These variables are defined by RStudio as numeric (in statistics terms it refers to continuous variables). The goal is then to convert them into categorical variables (called factors in RStudio).
Let's get to work :
library(dplyr)
library(tidyr)
# lets us first convert the id column into integers so we can apply mutate_if on the other numeric factors and convert all of them into factors (categorical variables), we shall name your dataframe (df)
df$id <- as.integer(df$id)
# and now let's apply mutate_if to convert all the other variables (numeric) into factors (categorical variables).
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7)
# I guess in your case that would be all, but if you wanted the content of the dataframe to be position_1, position_2 ...position_7, we just add labels like this :
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7,
labels = paste(rep("position",7),1:7,sep="_"))
# For the visualisation now, we need to use the function gather in order to convert the df dataframe into a two column dataframe (and keeping the id column), we shall name this new dataframe df1
df1 <- df %>% gather(key=Questions, value=Answers, motivation_quantity:motivation_budget,-id )
# the df1 dataframe now includes three columns : the id column - the Questions columns - the Answers column.
# we can now apply the ggplot function on the new dataframe for the visualisation
# first the colours
colours <- c("firebrick4","firebrick3", "firebrick1", "gray70", "blue", "blue3" ,"darkblue")
# ATTENTION since there are NAs in your dataframe, either you can recode them as zeros or delete them (for the visualisation) using the subset function within the ggplot function as follows :
ggplot(subset(df1,!is.na(Answers)))+
aes(x=Questions,fill=Answers)+
geom_bar()+
coord_flip()+
scale_fill_manual(values = colours) +
ylab("position_levels")
# of course you can enter many modifications into the visualisation but in total I think that's what you need.

Error "number of items to replace is not a multiple of replacement length" in single xts element

I know this error is well-known and I've read many questions, but I still can't figure out why I have this problem in my case.
I have a 74-column xts object with closing prices of certain stocks (data is from a csv file). Here's what the data looks like:
head(clPrices_xts[1:5,1:10])
ACINO.HLDG.N ACTELION.N ADDEX.N ALPIQ.HOLDING.N AMS ARBONIA.I ARBONIA.N ARPIDA.N ASCOM.N.10 ATTISHOLZ.N
1996-08-02 NA NA NA NA NA 700 NA NA NA 516
1996-08-05 NA NA NA NA NA 700 NA NA NA 530
1996-08-06 NA NA NA NA NA 700 NA NA NA 530
1996-08-07 NA NA NA NA NA NA NA NA NA 532
1996-08-08 NA NA NA NA NA 680 NA NA NA 540
str(clPrices_xts)
An ‘xts’ object on 1996-08-02/2017-09-13 containing:
Data: num [1:5900, 1:74] NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:74] "ACINO.HLDG.N" "ACTELION.N" "ADDEX.N" "ALPIQ.HOLDING.N" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
I need to manipulate this time series (ts). For example, I would like to modify the value for ACTELION.N on the first day of the ts:
clPrices_xts["1996-08-02"]$ACTELION.N
ACTELION.N
1996-08-02 NA
clPrices_xts["1996-08-02"]$ACTELION.N <- 0
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
Anyone has an idea of why I'm getting this error? To me, it seems that there's only one single element that I want to modify...
N.B. :
Not sure if it's useful, but I transform the data from long to wide data using reshape2::dcast function, and then to xts:
as.xts(clPrices_wide[,-1] %>%
apply(2, function(x) ifelse(is.nan(x), NA, x))
, order.by = clPrices_wide$TRADE_DATE)
You get the message because you are selecting a row with multiple positions and length > 1 but only writing one value. For you to select only one, yo should specify indexes for row and column writing for example: clPrices_xts["1996-08-02", "ACTELION.N"] <- 0
I hope that is the problem. Cheers !,
EDIT:
I found another solution, by adding a comma within the brackets:
clPrices_xts["1996-08-02",]$ACTELION.N <- 0
works as well.
Try this:
clPrices_xts["1996-08-02",] <- c(0)

Fill in time series gaps with both LCOF and NOCB methods but acknowledge breaks in time series

There are edits to this post at the end.
I have a large dataset of daily dietary records for a population of individuals. There are data missing at random from each of the individuals. This is an example for one individual (I will eventually generalize this solution to the population):
> str(final_daily)
'data.frame': 387 obs. of 10 variables:
$ Date : chr "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
$ MEID.1 : Factor w/ 97 levels "","1","1.1","1.1a",..: NA NA NA 17 24 NA NA NA NA NA ...
$ MEID.2 : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
$ MEID.3 : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
$ MEID.4 : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
$ MEID.5 : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
$ MEID.6 : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
$ DAYT : int NA NA NA 1 8 NA NA NA NA NA ...
$ DATT : int NA NA NA 1 1 NA NA NA NA NA ...
$ Reason.For.Change: chr "0" "0" "0" "0" ...
I am aware of the implementations that can be used to fill in missing data such as last observation carried forward (LOCF) and next observation carried backwards (NOCB). Importantly, the missing data gaps can exist for as few as a single date to up to months of days at a time.
I would like to create an imputation method that uses LOCF for the first half of the missing time period and NOCB for the second half of the missing time period. This is more important for large time series gaps (I don't want to use dietary intake on February 28 to be representative for August 1 when August 2 is available). Can anyone suggest a possible solution here?
Importantly, I also have a column (Reason.For.Change) which should constrain the imputation methods as in Filling in missing (blanks) in a data table, per category - backwards and forwards. For example, when Reason.For.Change has a value >0, the imputation should recognize this. In other words, Reason.For.Change values >0 denote "different" time series within an individual that starts on the day where Reason.For.Change is >0, and these time series must be imputed separately.
Essentially, this column creates two conditions: when a record is not available the date prior to a date where Reason.For.Change is >0, only LOCF can be used. Second, since a record of diet intake is not available on the same date that Reason.For.Change is >0, only NOCB can be used. (This second example is analagous to the example in Filling in missing (blanks) in a data table, per category - backwards and forwards where patients are missing 'doctor' on their first visit.)
Any advice/direction is appreciated to accomplish the following which I summarize below
Imputation method for time series gaps that includes LOCF and
NOCB for the first and last 50% of the gap
Imputation method in 1) that acknowledges breaks in the time series
denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date
[Edit] After thinking some more, the implementations in R -- Carry last observation forward n times and Fill NA in a time series only to a limited number seem to offer a step in the direction of addressing 1) here in my question. However, I would like to generalize their use of LOCF n-times to LOCF for length(missing data)/2 ...
[Edit 2] After thinking even more, I have added a new column in my dataframe, GAP_DAYS, which counts the number of days in the missing time period (gap). Here is str() on the data after the new column was added.
> str(final_daily_intake2)
'data.frame': 387 obs. of 11 variables:
$ Date : chr "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
$ MEID.1 : chr NA NA NA "14" ...
$ MEID.2 : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
$ MEID.3 : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
$ MEID.4 : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
$ MEID.5 : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
$ MEID.6 : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
$ DAYT : int NA NA NA 1 8 NA NA NA NA NA ...
$ DATT : int NA NA NA 1 1 NA NA NA NA NA ...
$ Reason.For.Change: chr "0" "0" "0" "0" ...
$ GAP_Days : chr "1" "2" "3" "NA" ...
I was thinking that this could be used to determine the n number of days to use LOCF on, for each gap period. For example, in the first missing data time period, there are 3 days missing (hence 1, 2, 3, in the str() for GAP_Days). In this example, since it is an odd number of days, I would like LOCF to use the result of round(3 * 0.5) to obtain a value of 2, which would be used as input to LOCF. In a longer time period, for example, where the length of GAP_Days is 30, LOCF would use the result of round(30 * 0.5) such that LOCF would be used for 15 days.
I think this approach can be used to go over the dataframe once with LOCF, and then a second time with NOCB. (Although I still haven't addressed the need to acknowledge breaks in the time series denoted by Reason.For.Change).
Much thanks.
Since the text is very long I'll point out the questions again:
Imputation method for time series gaps that includes LOCF and NOCB for the first and last 50% of the gap
Imputation method in 1) that acknowledges breaks in the time series denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date
As far as I know, there is no package available for R, which directly enables you to do one of this tasks.
To 1):
There are quite a bunch of packages which contain a locf option:
imputeTS::na_locf()
zoo::na.locf()
xts::na.locf()
spacetime::na.locf()
Indeed your idea for imputation makes pretty much sense.
But none of the packages has a option for your requested behavior. What you can do with e.g. zoo is set the maxgap parameter. Runs of more than maxgap NAs are then retained. Which means you can/must treat them separately afterwards.
You would have to program your requested behavior on your own.
Another idea could be using other more advanced function of these packages, that make use of both sides of the NA gaps.
An example would be imputeTS::na_ma() which imputes the values with an moving average (you can set the window size).
There are also even more advanced functions like
imputeTS::na_kalman()
imputeTS::na_interpolation()
forecast::na.interp()
zoo::na.StructTS()
These also take into account saisonal behavior (weekday patterns) and trend and other things. Problem with these is of course they are not as easy reasonable as the simple algorithms like locf or ma.
To 2):
There is also no premade function for this. This would also have to be coded individually.

R subsetting a data frame with conditionsals on date values where some fields have no date

I have a data frame:
'data.frame': 2611029 obs. of 10 variables:
$ eid : int 28 28 28 28 28 36 36 36 36 37 ...
$ created : Factor w/ 36204 levels "0000-00-00 00:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
$ class_id : int NA NA NA NA NA NA NA NA NA NA ...
$ min.e.event_time.: Factor w/ 16175 levels "2013-04-15 11:17:19",..: NA NA NA NA NA NA NA NA NA NA ...
$ lead_date : Factor w/ 11199 levels "2012-10-11 18:39:12",..: NA NA NA NA NA NA NA NA NA NA ...
$ camp : int 44698 44698 44699 44701 44701 44715 44715 44909 44909 44699 ...
$ event_date : Factor w/ 695747 levels "2008-01-18 12:18:01",..: 1 5 2 32 36 6 17039 23 24 2 ...
$ event : Factor w/ 3 levels "click","open",..: 3 2 3 3 2 3 2 3 2 3 ...
$ message_name : Factor w/ 2707 levels ""," 2015-03 CAD Promotion Update",..: 2163 2163 2163 1106 1106 2163 2163 1990 1990 2163 ...
$ subject_lin : Factor w/ 2043 levels ""," Christie Office Holiday Hours",..: 613 613 613 248 248 613 613 612 612 613 ...
Each line item is an instance of a user (eid) having received an email (event_date).
event_date, lead_date and created are all dates. Till now I have transformed these dates using as.Date() subsequent to subsetting the data so only records with complete.cases() of these dates. This allowed me to do aggregation and subsetting based conditionals e.g. where event_date < lead_date.
If I try to convert dates in data as is, without removing na values, I receive the message
Error in charToDate(x) :
character string is not in a standard unambiguous format
The purpose of the analysis is to look at the impact of receiving an email on becoming a lead (thus lead_date would be populated, NA otherwise). I therefore don't want to exclude people who never became a lead by subsetting the entire df on complete lead dates.
But I still want to perform calculations on those records with dates, leaving the NAs as their own group.
Is there anything I can do here? I want R to ignore NA results when using functions like subset or aggregation. I also want to convert all the non NA dates into dates using as.Date()
** following posting**
I probably could have asked this in a much simpler way: can I convert a field in a data frame to a date where it's feasible and ignore na values otherwise?
Replace all your as.Date( ) calls with as.Date( , format="%Y-%m-%d")
> as.Date(factor("0000-00-00 00:00:00"))
Error in charToDate(x) :
character string is not in a standard unambiguous format
> as.Date(factor("0000-00-00 00:00:00"), format="%Y-%m-%d")
[1] NA
Then describe the problems (code and errors) you encounter with the updated dataset. It's not possible to predict where you are getting stuck on the next steps from the description. There is an is.na function that cam be used in combination with other logical tests.
Do remember that is.na(NA) | NA will return TRUE. That doesn't work with & (AND) but will with OR.

Resources