Rearrange dataframe to fit longitudinal model in R - r

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.

Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()

You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))

I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

Related

How to merge rows based on conditions with characters values? (Household data)

I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Extracting counts of a variable grouped at 2 levels

I have weather data tagged by year, month and day. Here is some of the data:
Date MinT Year Month
1976-01-01 1.1 1976 1
1976-01-02 0.3 1976 1
1976-01-03 1.3 1976 1
The data run is 1976:2016 for all months. Call this TestData.
I can group and subset as follows (it is very clunky but that is because I have been trying to test each step)
temp1 <- TestData %>%
group_by(Year)
temp2 <- temp1 %>%
subset(between(Month, 1, 3))
temp3 <- temp2
v1 <- replace(temp3$minT, temp3$minT >-2.0,0) ### replaces data above the threshold
temp3["v1"] <- v1
index1 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = sum)) ## sums the month 1-3-2 degree values
index2 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = length)) ## counts the number of items in each year for the selected period.
index2 gives me a count of the days in each month. I can use index1 and 2 to create index of 'weather for the month'.
What I would like is to be able to get a count of all of the days below -2 (or whatever) and so get an index of comparable severity for each month.
The v1 assignment is necessary because if I use rle to count instances, some months will have zero instances and they drop from the final tally meaning the compiled table of indices against minT, year and month has index vectors of different lengths which R doesn't like. I have tried rle as the FUN in the index2 assignment but that would not let me reach the day counts. The same was true for using a range value with length in that assignment (index3) as well.
Short of generating a mini table for each year, I am stuck. Does anyone have any suggestions?
I guess summarise is the function you are looking for. Something like this (different data, same principle):
library(latticeExtra)
threshold <- 40
SeatacWeather %>%
group_by(year, month) %>%
filter(min.temp < threshold) %>%
summarise(days_below_threshold = n())

R dplyr chaining group by into function

I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?
The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.
What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data
As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Resources