Add missing values in time series efficiently - r

I have 500 datasets (panel data). In each I have a time series (week) across different shops (store). Within each shop, I would need to add missing time series observations.
A sample of my data would be:
store week value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like:
store week value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
I currently use the following code (which works, but takes very very long on my data):
stores<-unique(mydata$store)
for (i in 1:length(stores)){
mydata <- merge(
expand.grid(week=min(mydata$week):max(mydata$week)),
mydata, all=TRUE)
mydata[is.na(mydata)] <- 0
}
Are there better and more efficient ways to do so?

Here's a dplyr/tidyr option you could try:
library(dplyr); library(tidyr)
group_by(df, store) %>%
complete(week = full_seq(week, 1L), fill = list(value = 0))
#Source: local data frame [9 x 3]
#
# store week value
# (int) (int) (dbl)
#1 1 1 50
#2 1 2 0
#3 1 3 52
#4 1 4 10
#5 2 1 4
#6 2 2 0
#7 2 3 0
#8 2 4 84
#9 2 5 2
By default, if you don't specify the fill parameter, new rows will be filled with NA. Since you seem to have many other columns, I would advise to leave out the fill parameter so you end up with NAs, and if required, make another step with mutate_each to turn NAs into 0 (if that's appropriate).
group_by(df, store) %>%
complete(week = full_seq(week, 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -store, -week)

Related

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

Panel Data in R: Get complete cases of data based on individuals

I'm working on an unbalanced panel dataset. Data came from a game and for every user (user_id) in the record I have data for every level (level) of the game. As recording data started some time after introduction of the game, for some users I don't have data regarding the first levels, that's why I want to throw them out in a first step.
I've tried the complete.cases-function, but it only excludes the rows with the missing values (NAs), but not data for the whole user with missing values in level 1.
panel <- panel[complete.cases(panel), ]
That's why I need a code that excludes every user who has no record in level 1 (which in my dataset means he has an "NA" at one of the dependent variables, i.e. number of activities).
Update #1:
Data looks like this (thanks to thc):
> game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=c(0,150,170,80,100,110,75,100,0))
> game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I now want to exclude data from player 1, because he has a score of 0 in level 1.
Here is one approach
Example data:
game_data <- data.frame(player = c(1,1,2,2,2,3,3,3), level = c(2,3,1,2,3,1,2,3), score=sample(100, 8))
> game_data
player level score
1 1 2 19
2 1 3 13
3 2 1 65
4 2 2 32
5 2 3 22
6 3 1 98
7 3 2 58
8 3 3 84
library(dplyr)
game_data %>% group_by(player) %>% filter(any(level == 1)) %>% as.data.frame
player level score
1 2 1 65
2 2 2 32
3 2 3 22
4 3 1 98
5 3 2 58
6 3 3 84
I think I now find a solution with your help:
game_data %>% group_by(player) %>% filter(any(level == 1 & score > 0)) %>% as.data.frame
This seems to work and I just needed a little adjustment from your code thc, thank you very much for your help!

How can I arrange data from wide format to long format, and specify relationships

Currently I have a file which I need to converted from wide format to long format. The example of the data is:
Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0
However, I need it in the long format as follows
Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0
So far I have tried in R to use the melt function
datalong <- melt(exp2_simon_shortform, id ="Subject")
But it treats every single column name as a unique variable each with its own value. Does anybody know how I could get from wide to long as specified, making reference to the column header names?
Cheers.
EDIT: I've realised I made an error. My final output needs to be as follows. So from the Cat1_ portion, I actually need to get out "Cat" and "1"
Subject Animal CatNumber Weight Sickness
1 Cat 1 10 1
1 Cat 2 11 0
1 Cat 3 12 0
2 Cat 1 7 1
2 Cat 2 8 0
2 Cat 3 9 0
Any updated solutions much appreciated.
The "dplyr" + "tidyr" approach might be something like:
library(dplyr)
library(tidyr)
mydf %>%
gather(var, val, -Subject) %>%
separate(var, into = c("CatNumber", "variable")) %>%
spread(variable, val)
# Subject CatNumber Sick Weight
# 1 1 Cat1 1 10
# 2 1 Cat2 0 11
# 3 1 Cat3 0 12
# 4 2 Cat1 1 7
# 5 2 Cat2 0 8
# 6 2 Cat3 0 9
Add a mutate in there along with gsub to remove the "Cat" part of the "CatNumber" column.
Update
Based on the discussions in chat, your data actually look something more like:
A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")
colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])
subject = 1:60
set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE),
nrow = length(subject), dimnames = list(NULL, colNames))
mydf <- data.frame(Subject = subject, M)
Thus, you will need to do a few additional steps to get the output you desire. Try:
library(dplyr)
library(tidyr)
mydf %>%
group_by(Subject) %>% ## Your ID variable
gather(var, val, -Subject) %>% ## Make long data. Everything except your IDs
separate(var, into = c("partA", "partB")) %>% ## Split new column into two parts
mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
separate(partA, into = c("A1", "A2")) %>% ## Split this new column
spread(partB, val) ## Transform to wide form
Which yields:
Source: local data frame [900 x 5]
Subject A1 A2 ResumptionMisses ResumptionTime
(int) (chr) (chr) (int) (int)
1 1 ATCint 1 9 3
2 1 ATCint 2 4 3
3 1 ATCint 3 2 2
4 1 ATCint 4 7 4
5 1 ATCint 5 7 1
6 1 Blank 1 4 10
7 1 Blank 2 2 4
8 1 Blank 3 7 5
9 1 Blank 4 1 9
10 1 Blank 5 10 10
.. ... ... ... ... ...
You can do it with base reshape, like:
reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
v.names=c("Weight","Sick"), timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
Alternatively, since reshape expects names like variablename_groupname you could change the names then get reshape to do the hard work:
names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1,
sep="_", timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
We can use melt from library(data.table) which can take multiple patterns for measure variable.
library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'),
variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT
# Subject CatNumber Weight Sick
#1: 1 1 10 1
#2: 1 2 11 0
#3: 1 3 12 0
#4: 2 1 7 1
#5: 2 2 8 0
#6: 2 3 9 0
If we need the 'Animal' column, we can grep for 'Cat' columns and remove the suffix substring with sub, assign (:=) it to create the 'Animal' column.
DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]
DT
# Subject CatNumber Weight Sick Animal
#1: 1 1 10 1 Cat
#2: 1 2 11 0 Cat
#3: 1 3 12 0 Cat
#4: 2 1 7 1 Cat
#5: 2 2 8 0 Cat
#6: 2 3 9 0 Cat

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Resources