I have a matrix-like data frame with an additional column denoting time. It contains information on the number of enrolled students in a given school, from grade 5 (column A) to grade 9 (column E).
time A B C D E
1 13 1842 1844 1689 1776 1716
2 14 1898 1785 1807 1617 1679
3 15 2065 1865 1748 1731 1590
4 16 2215 1994 1811 1708 1703
5 17 2174 2122 1903 1765 1699
I need to trace the size of the cohort over time, meaning that I need row-wise information on how many fifth graders from each starting year remained in the school from grades 6 through 9. For example, for the cohort that has begun fifth grade in 2013, I want information on how many remained in sixth grade in 2014, and so on.
Expected output
This is what I would like to end up with:
start.time point.A point.B point.C point.D point.E
1 13 1842 1785 1748 1708 1699
2 14 1898 1865 1811 1765 NA
3 15 2065 1811 1765 NA NA
4 16 2215 1765 NA NA NA
5 17 2174 NA NA NA NA
I have looked at diag() from base.R, but I could only get the the data from the main diagonal. Ideally, I'd like to accomplish this using dplyr syntax and the pipe.
Data
structure(list(time = 13:17, A = c(1842, 1898, 2065, 2215, 2174), B = c(1844, 1785, 1865, 1994, 2122), C = c(1689, 1807, 1748, 1811, 1903), D = c(1776, 1617, 1731, 1708, 1765), E = c(1716, 1679, 1590, 1703, 1699)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), vars = "time", drop = TRUE, indices = list(
0L, 1L, 2L, 3L, 4L), group_sizes = c(1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
time = 13:17), class = "data.frame", row.names = c(NA, -5L), vars = "time", drop = TRUE, .Names = "time"), .Names = c("time", "A", "B", "C", "D", "E"))
Convert the input DF except for the first column to a matrix mat. Then since row(mat) - col(mat) is constant on diagonals split with respect to that creating a list of ts class series in L. We used ts class since we can later cbind them even if they are of different lengths. The diagonals for which row(mat) - col(mat) >= 0 are the only ones we want so pick off those, cbind them together and transpose the result. Then replace all columns in DF except the first with that. No packages are used.
mat <- as.matrix(DF[-1])
L <- lapply(split(mat, row(mat) - col(mat)), ts)
replace(DF, -1, t(do.call("cbind", L[as.numeric(names(L)) >= 0])))
giving:
time A B C D E
1 13 1842 1785 1748 1708 1699
2 14 1898 1865 1811 1765 NA
3 15 2065 1994 1903 NA NA
4 16 2215 2122 NA NA NA
5 17 2174 NA NA NA NA
Since you mentioned dplyr in your question, you could use dplyr::lead to shift the values of columns B to E by 1, 2 etc. respectively, and then bind the result with columns time and A from your original data as follows
library(tidyverse)
bind_cols(df[, 1:2], map2_df(.x = df[, c(3:ncol(df))],
.y = seq_along(df[, 3:ncol(df)]),
.f = ~dplyr::lead(x = .x, n = .y)))
# A tibble: 5 x 6
# Groups: time [5]
# time A B C D E
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 13 1842 1785 1748 1708 1699
#2 14 1898 1865 1811 1765 NA
#3 15 2065 1994 1903 NA NA
#4 16 2215 2122 NA NA NA
#5 17 2174 NA NA NA NA
Note that your data is grouped by time the way you provided it.
With some grouping and arranging and row_number(), we can do this with dplyr and tidyr, and we don't lose values.
Looks a bit messy, but here I create a 2-dimensional index where the second dimension is inverted. When these index positions are summed, we get a matching value for diagonal rows.
data %>%
ungroup() %>%
mutate(row = row_number()) %>%
gather(class, stud, A:E) %>%
arrange(row, desc(class)) %>%
group_by(row) %>%
mutate(time_left = row_number()) %>%
ungroup() %>%
transmute(time, class, stud, start_year = time_left + row - 1) %>%
ggplot(aes(time, stud, color = factor(start_year))) +
geom_line() +
geom_point()
Replace the mirrored upper triangle of "d" with the values from the lower triangle.
m <- as.matrix(d[-1])
d[-1] <- NA
d[-1][upper.tri(m, diag = TRUE)[ , ncol(m):1]] <- m[lower.tri(m, diag = TRUE)]
# time A B C D E
# 1 13 1842 1785 1748 1708 1699
# 2 14 1898 1865 1811 1765 NA
# 3 15 2065 1994 1903 NA NA
# 4 16 2215 2122 NA NA NA
# 5 17 2174 NA NA NA NA
Related
Is there is a way to sum variables (e.g. sales and units) for all unique variable names (brands like coke and pepsi) within a dataframe.
To help, here is some example data.
set.seed(123)
period <- seq(as.Date('2021/01/01'), as.Date('2021/01/07'), by="day")
Coke_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Regular_Sales <- sample(500:1000,7, replace = TRUE)
Coke_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Regular_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
df <- data.frame(Coke_Regular_Units, Coke_Diet_Units, Coke_Regular_Sales, Coke_Diet_Sales,
Pepsi_Regular_Units, Pepsi_Diet_Units, Pepsi_Regular_Sales, Pepsi_Diet_Sales)
> head(df)
period Coke_Regular_Units Coke_Diet_Units Coke_Regular_Sales Coke_Diet_Sales Pepsi_Regular_Units
1 2021-01-01 1414 1117 589 847 1425
2 2021-01-02 1462 1298 590 636 1648
3 2021-01-03 1178 1228 755 976 1765
4 2021-01-04 1525 1243 696 854 1210
5 2021-01-05 1194 1013 998 827 1931
6 2021-01-06 1937 1373 590 525 1589
Pepsi_Diet_Units Pepsi_Regular_Sales Pepsi_Diet_Sales
1 1554 608 943
2 1870 762 808
3 1372 892 634
4 1843 924 808
5 1142 829 910
6 1543 522 723
I like a code to automatically calculate Coke_Sales, Coke_Units, Pepsi_Sales, Pepsi_Units, Regular_Sales and Diet_Units.
I am currently doing it like this for each variable
library(dplyr)
df$Coke_Sales <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Sales")))))
df$Coke_Units <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Units")))))
This is ok for a small number of variables, but I need to do this for 100s of variables. Is there any function that enables this? It would need to automatically find the unique variable names like Coke, Pepsi, Diet and Regular. The metric is the last part of the variable name, so doesn't necessarily need to auto-find this but would be great. If it makes it any easier, it would be ok to specify the metrics as there are only 3 metrics at most, but there are hundreds of brands.
If it cant be automated, is there a way it can be simplified, where I specify the variables required. Not perfect but still an improvement. For example including these lines of code to specify variables to sum and metrics required.
VarsToSum <- c("Coke", "Pepsi", "Diet", "Regular")
Metrics <- c("Sales", "Units")
If it can't be accomplished that way either, maybe I need to break into smaller steps, any tips would be great. Trying to think how to do it, should I try to find unique name before a prefix "_", then calculate "Sales" and "Units" for those unique names. Would this be the best way to do it? Or should I reshape the data? Are there any other routes to get there?
Any help, or directions how to achieve this would be greatly appreciated. Thanks
here is a data.tableapproach...
library( data.table )
setDT(df) #make it a data.table
#melt to long
ans <- melt( df, id.vars = "period", variable.factor = FALSE )
#split variable to 3 new columns
ans[, c("brand", "type", "what") := tstrsplit( variable, "_" ) ]
# > head(ans)
# period variable value brand type what
# 1: 2021-01-01 Coke_Regular_Units 1414 Coke Regular Units
# 2: 2021-01-02 Coke_Regular_Units 1462 Coke Regular Units
# 3: 2021-01-03 Coke_Regular_Units 1178 Coke Regular Units
# 4: 2021-01-04 Coke_Regular_Units 1525 Coke Regular Units
# 5: 2021-01-05 Coke_Regular_Units 1194 Coke Regular Units
# 6: 2021-01-06 Coke_Regular_Units 1937 Coke Regular Units
#summarise however you like
ans[, .(total = sum(value) ), by = .(brand, type, what)]
# brand type what total
# 1: Coke Regular Units 10527
# 2: Coke Diet Units 8936
# 3: Coke Regular Sales 5158
# 4: Coke Diet Sales 5171
# 5: Pepsi Regular Units 11160
# 6: Pepsi Diet Units 10813
# 7: Pepsi Regular Sales 5447
# 8: Pepsi Diet Sales 5491
Using outer for pasteing the syllables and grep.
sapply(outer(c("Coke", "Pepsi"), c("Sales", "Units"), paste, sep=".*"), function(x)
rowSums(df[grep(x, names(df))]))
# Coke.*Sales Pepsi.*Sales Coke.*Units Pepsi.*Units
# [1,] 1436 1551 2531 2979
# [2,] 1226 1570 2760 3518
# [3,] 1731 1526 2406 3137
# [4,] 1550 1732 2768 3053
# [5,] 1825 1739 2207 3073
# [6,] 1115 1245 3310 3132
# [7,] 1446 1575 3481 3081
Here's a solution similar in spirit to that of #Wimpel, but with the tidyverse :
library(tidyverse)
summary_df <-
df %>%
pivot_longer(cols = ends_with("Sales") | ends_with("Units"),
names_to = c("brand", "type", ".value"),
names_pattern = "(.*)_(.*)_(.*)") %>%
group_by(brand) %>%
summarize(Sales = sum(Sales),
Units = sum(Units)) %>%
pivot_wider(names_from = "brand",
values_from = c("Sales", "Units"),
names_glue = "{brand}_{.value}")
summary_df
# # A tibble: 1 x 4
# Coke_Sales Pepsi_Sales Coke_Units Pepsi_Units
# <int> <int> <int> <int>
# 1 10329 10938 19463 21973
I'm looking at covid-19 data to calculate estimates for the reproductive number R0.
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(TTR)
# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv"
DoubleCOV <- read.csv(url, stringsAsFactors = FALSE)
names(DoubleCOV)[1] <- "countyFIPS"
DoubleCovid <- pivot_longer(DoubleCOV, cols=starts_with("X"),
values_to="cases",
names_to=c("X","date_infected"),
names_sep="X") %>%
mutate(infected = as.Date(date_infected, format="%m.%d.%y"),
countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))
#data is by county, summarise for the state of interest
stateData <- DoubleCovid %>% filter(State == "AL") %>% filter(cases != 0) %>%
group_by(infected) %>% summarise(sum(cases)) %>%
mutate(DaysSince = infected - min(infected))
names(stateData)[2] <- "cumCases"
#3 day moving average to smooth a little
stateData <- stateData %>% mutate(MA = runMean(cumCases,3))
#calculate doubling rate (DR) and then R0 infectious period/doubling rate
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
CDplot <- stateData %>%
ggplot(mapping = aes(x = as.numeric(DaysSince), y = R0)) +
geom_line(color = "firebrick")
print(CDplot)
So in the above the state of interest is Alabama, hence filter(State == "AL") and this works.
But if I change the state to "NY" I get
Error in `$<-.data.frame`(`*tmp*`, "DR", value = c(NA, NA, NA, 0.733907206043719 :
replacement has 4 rows, data has 39
head(stateData) yields
infected cumCases DaysSince MA
<date> <int> <drtn> <dbl>
1 2020-03-02 1 0 days NA
2 2020-03-03 2 1 days NA
3 2020-03-04 11 2 days 4.67
4 2020-03-05 23 3 days 12
5 2020-03-06 25 4 days 19.7
6 2020-03-07 77 5 days 41.7
The moving average values in rows 3 and 4 (12 and 4.67) would yield a doubling rate of 0.734 which aligns with the value in the error message value = c(NA, NA, NA, 0.733907206043719 but why does it throw an error after that?
Bonus question: I know loops are frowned upon in R...is there a way to get the moving average and R0 calculation without one?
You have to initialise the new variables before you can access them using the j index. Due to recycling, Alabama, which has 28 rows (divisible by 4), does not return an error, only the warnings about uninitialised columns. New York, however, has 39 rows, which is not divisible by 4 so recycling fails and R returns an error. You shouldn't ignore warnings, sometimes you can, but it's not a good idea.
Try this to see what R (you) is trying to do:
stateData[4]
You should get all rows of the 4th column, not the 4th row.
Solution: initialise your DR and R0 columns first.
stateData$DR <- NA
stateData$R0 <- NA
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
For the bonus question, you can use lag in the same mutate with MA:
stateData <- stateData %>% mutate(MA = runMean(cumCases,3),
DR = log(2)/log(MA/lag(MA)),
R0 = 14 / DR)
stateData
# A tibble: 28 x 6
infected cumCases DaysSince MA DR R0
<date> <int> <drtn> <dbl> <dbl> <dbl>
1 2020-03-13 5 0 days NA NA NA
2 2020-03-14 11 1 days NA NA NA
3 2020-03-15 22 2 days 12.7 NA NA
4 2020-03-16 29 3 days 20.7 1.42 9.89
5 2020-03-17 39 4 days 30 1.86 7.53
6 2020-03-18 51 5 days 39.7 2.48 5.64
7 2020-03-19 78 6 days 56 2.01 6.96
8 2020-03-20 106 7 days 78.3 2.07 6.78
9 2020-03-21 131 8 days 105 2.37 5.92
10 2020-03-22 167 9 days 135. 2.79 5.03
# ... with 18 more rows
I'm using Alabama's data.
I have a bunch of time series data stacked on top of one another in a data frame; one series for each region in a country. I'd like to apply the seas() function (from the seasonal package) to each series, iteratively, to make the series seasonally adjusted. To do this, I first have to convert the series to a ts class. I'm struggling to do all this using purrr.
Here's a minimum worked example:
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
For each region (indexed by a number) I'd like to perform the following operations. Here's the first region as an example:
tem1 <- df %>% filter(region==1)
tem2 <- ts(data = tem1$var, frequency = 4, start=c(1990,1))
tem3 <- seas(tem2)
tem4 <- as.data.frame(tem3$data)
I'd then like to stack the output (ie. the multiple tem4 data frames, one for each region), along with the region and quarter identifiers.
So, the start of the output for region 1 would be this:
final seasonaladj trend irregular region quarter
1 27 27 96.95 -67.97279 1 1
2 126 126 96.95 27.87381 1 2
3 124 124 96.95 27.10823 1 3
4 127 127 96.95 30.55075 1 4
5 173 173 96.95 75.01355 1 5
6 130 130 96.95 32.10672 1 6
The data for region 2 would be below this etc.
I started with the following but without luck so far. Basically, I'm struggling to get the time series into the tibble:
seas.adjusted <- df %>%
group_by(region) %>%
mutate(data.ts = map(.x = data$var,
.f = as.ts,
start = 1990,
freq = 4))
I don't know much about the seasonal adjustment part, so there may be things I missed, but I can help with moving your calculations into a map-friendly function.
After grouping by region, you can nest the data so there's a nested data frame for each region. Then you can run essentially the same code as you had, but inside a function in map. Unnesting the resulting column gives you a long-shaped data frame of adjustments.
Like I said, I don't have the expertise to know whether those last two columns having NAs is expected or not.
Edit: Based on #wibeasley's question about retaining the quarter column, I'm adding a mutate that adds a column of the quarters listed in the nested data frame.
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
df %>%
group_by(region) %>%
nest() %>%
mutate(data.ts = map(data, function(x) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(quarter = x$quarter)
})) %>%
unnest(data.ts)
#> # A tibble: 200 x 8
#> region final seasonaladj trend irregular quarter seasonal adjustfac
#> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 27 27 97.0 -68.0 1 NA NA
#> 2 1 126 126 97.0 27.9 2 NA NA
#> 3 1 124 124 97.0 27.1 3 NA NA
#> 4 1 127 127 97.0 30.6 4 NA NA
#> 5 1 173 173 97.0 75.0 5 NA NA
#> 6 1 130 130 97.0 32.1 6 NA NA
#> 7 1 6 6 97.0 -89.0 7 NA NA
#> 8 1 50 50 97.0 -46.5 8 NA NA
#> 9 1 135 135 97.0 36.7 9 NA NA
#> 10 1 105 105 97.0 8.81 10 NA NA
#> # ... with 190 more rows
I also gave a bit more thought to doing this without nesting, and instead tried doing it with a split. Passing that list of data frames into imap_dfr let me take each split piece of the data frame and its name (in this case, the value of region), then return everything rbinded back together into one data frame. I sometimes shy away from nested data just because I have trouble seeing what's going on, so this is an alternative that is maybe more transparent.
df %>%
split(.$region) %>%
imap_dfr(function(x, reg) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(region = reg, quarter = x$quarter)
}) %>%
select(region, quarter, everything()) %>%
head()
#> region quarter final seasonaladj trend irregular seasonal adjustfac
#> 1 1 1 27 27 96.95 -67.97274 NA NA
#> 2 1 2 126 126 96.95 27.87378 NA NA
#> 3 1 3 124 124 96.95 27.10823 NA NA
#> 4 1 4 127 127 96.95 30.55077 NA NA
#> 5 1 5 173 173 96.95 75.01353 NA NA
#> 6 1 6 130 130 96.95 32.10669 NA NA
Created on 2018-08-12 by the reprex package (v0.2.0).
I put all the action inside of f(), and then called it with purrr::map_df(). The re-inclusion of quarter is a hack.
f <- function( .region ) {
d <- df %>%
dplyr::filter(region == .region)
y <- d %>%
dplyr::pull(var) %>%
ts(frequency = 4, start=c(1990,1)) %>%
seas()
y$data %>%
as.data.frame() %>%
# dplyr::select(-seasonal, -adjustfac) %>%
dplyr::mutate(
quarter = d$quarter
)
}
purrr::map_df(1:10, f, .id = "region")
results:
region final seasonaladj trend irregular quarter seasonal adjustfac
1 1 27.00000 27.00000 96.95000 -6.797279e+01 1 NA NA
2 1 126.00000 126.00000 96.95000 2.787381e+01 2 NA NA
3 1 124.00000 124.00000 96.95000 2.710823e+01 3 NA NA
4 1 127.00000 127.00000 96.95000 3.055075e+01 4 NA NA
5 1 173.00000 173.00000 96.95000 7.501355e+01 5 NA NA
6 1 130.00000 130.00000 96.95000 3.210672e+01 6 NA NA
7 1 6.00000 6.00000 96.95000 -8.899356e+01 7 NA NA
8 1 50.00000 50.00000 96.95000 -4.647254e+01 8 NA NA
9 1 135.00000 135.00000 96.95000 3.671077e+01 9 NA NA
10 1 105.00000 105.00000 96.95000 8.806955e+00 10 NA NA
...
96 5 55.01724 55.01724 60.25848 9.130207e-01 16 1.9084928 1.9084928
97 5 60.21549 60.21549 59.43828 1.013076e+00 17 1.0462424 1.0462424
98 5 58.30626 58.30626 58.87065 9.904130e-01 18 0.1715082 0.1715082
99 5 61.68175 61.68175 58.07827 1.062045e+00 19 1.0537962 1.0537962
100 5 59.30138 59.30138 56.70798 1.045733e+00 20 2.5294523 2.5294523
...
I have an untidy dataset that combines two variables (some missing) in each of two columns (a small subsample in the data frame 'test' below). I'm struggling to create the desired tidy dataset below.
untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")
Desired data frame
N_patients N_ears pct_patients pct_ears
173 NA 58.61 NA
60 NA 13.30 NA
54 96 11.11 NA
168 328 14.79 10.45
Thanks!
Seems there is always an edge case - where both answers fail to consider something about the 5th row. Seems to be just a regex issue. Suggestions on how to fix?
untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
ie. row 5, [35.55%] is parsed as pct_patients
N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1 173 58.61% 173 NA 58.61 NA
2 60 13.30% 60 NA 13.30 NA
3 54 [96] 11.11% 54 96 11.11 NA
4 168 [328] 52.38% 168 328 52.38 NA
5 75 [150] [35.33%] 75 150 35.33 NA
Happily, this is pretty easy with the tidyr package in the tidyverse.
library(tidyverse)
test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"),
`% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")),
Names = c("N [ears]", "% Otorrhea"),
row.names = c(NA, 5L), class = "data.frame")
test %>%
separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\\s\\[", fill = "right") %>%
separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\\s\\[", fill = "right") %>%
mutate_each(funs(parse_number))
#> N_patients N_ears pct_patients pct_ears
#> 1 173 NA 58.61 NA
#> 2 60 NA 13.30 NA
#> 3 54 96 11.11 NA
#> 4 168 328 52.38 NA
#> 5 906 1685 14.79 10.45
Here is an alternative with extract() function with regular expressions:
library(tidyr)
test %>%
extract(`N [ears]`, into = c("N_patients", "N_ears"),
regex = "^(\\d+)(?:\\s\\[(\\d+)\\])?$") %>%
extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"),
regex = "^([.0-9]+)%(?:\\s\\[([.0-9]+)%\\])?$")
# N_patients N_ears pct_patients pct_ears
#1 173 <NA> 58.61 <NA>
#2 60 <NA> 13.30 <NA>
#3 54 96 11.11 <NA>
#4 168 328 52.38 <NA>
#5 906 1685 14.79 10.45
Here we can use non-capture group (?:...) with ? to capture optional ears columns.
The best answer for my actual dataset was provided by in the comment by
https://stackoverflow.com/users/4497050/alistaire
Shown below, wrapped in a simple funtion.
library(tidyverse)
make_tidy <- function(untidy){
tidy <- untidy %>%
separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>%
separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\\d.]+', extra = 'drop', convert = TRUE)
}
tidy_2 <- make_tidy(untidy_2)
Correctly parses untidy_2
> tidy_2
# A tibble: 5 × 4
N_patients N_ears pct_patients pct_ears
* <int> <int> <dbl> <dbl>
1 173 NA 58.61 NA
2 60 NA 13.30 NA
3 54 96 11.11 NA
4 168 328 52.38 NA
5 906 1685 14.79 10.45
I have a dataframe that provides all kinds of sales info- date, session, time, day of week, product type, total sales, etc. It also includes a single column that provides the order in which all products were purchased in that session. Some of the products are text names, some are numbers.
The products with text names never change, but the products with numerical names rotate as new ones are developed. (This is why they are listed in a single column- the "numerical" products change so much that the dataframe would get maddeningly wide in just a few months, plus some other issues)
Here's a small subset:
Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4
Okay, so in the above example, in session 20764 (the first one), sales were $15 and it was all spent on ProductA. In the next session, ProductB was purchased three times. In the third session, product 1069 was purchased twice, and so on.
I am going to be doing a lot with this, but I don't know how to tell R that, in this column, a ':' acts as a separator between products, and an 'x' signifies the number of "numerical' products that were purchased. Any ideas?
Some examples of what I am trying to know:
1. Which Product was purchased first in a session;
2. Which products were purchased most often with each other; and,
3. I'd like to be able to, say, aggregate sessions that contain certain combinations of products (e.g, 1067 and 1046 and Quinto)
I know this is a broad request for on here, but any info on how to get R to recognize these unique-to-this-column identifiers would be tremendously helpful. Thanks in advance.
Also, here's the dput()
structure(list(Session = c(20764L, 31976L, 27966L, 324L, 6943L,
14899L, 25756L, 27279L, 31981L, 2782L), TotSales = c(5, 5, 20,
1, 25, 2, 9, 5, 1, 520), GameList = structure(c(6L, 9L, 4L, 3L,
5L, 2L, 8L, 7L, 3L, 1L), .Label = c("1046x2:1046x2:1046x1:1046x1:1046x1:1046x4",
"1062x2", "1067x1", "1069x2", "1071x1:1064x1:1038x2:1034x1:ProductE",
"ProductA", "ProductD:ProductD:ProductD:ProductC", "ProductB:ProductB:ProductC",
"ProductB:ProductB:ProductB"), class = "factor")), .Names = c("Session",
"TotSales", "GameList"), row.names = c(320780L, 296529L, 98969L,
47065L, 19065L, 92026L, 327431L, 291843L, 296534L, 15055L), class = "data.frame")
Here is an alternate with data.table. I won't answer all your questions, but this should get you going. First, convert to long format:
library(data.table)
dt <- data.table(df) # assumes your data is in `df`
split_fun <- function(x) {
y <- unlist(strsplit(as.character(x), ":"))
z <- strsplit(y, "(?<=[0-9])+x(?=[0-9]+$)", perl=T)
unlist(lapply(z, function(x) if(length(x) == 2) rep(x[[1]], x[[2]]) else x[[1]]))
}
dt.long <- dt[, list(TotSales, split_fun(GameList)), by=Session]
Now, to answer Q1 (first product in session):
dt.long[, head(V2, 1L), by=Session]
Produces:
Session V1
1: 20764 ProductA
2: 31976 ProductB
3: 27966 1069
4: 324 1067
... 6 rows omitted
And Q3 (aggregate sessions that contain multiple products):
dt.long[,
if(length(items <- .SD[all(c("ProductB") %in% V2), V2])) paste0(items, collapse=", "),
by=Session
]
Produces (note you don't have any sessions with more than one product shared, but you can easily modify the above for multiple products for your real data):
Session V1
1: 31976 ProductB, ProductB, ProductB
2: 25756 ProductC, ProductC, ProductB
Q2 is a bit trickier, but I'll leave that one to you. I'm also not 100% sure what you mean by that question. One thing worth highlighting, dt.long here has the products repeated however many times they were "xed". For example, with session 27966, product 1069 shows up twice, so you can count rows for each product if you want:
> dt.long[Session==27966]
Session TotSales V2
1: 27966 25 1069
2: 27966 25 1069
Note that the regular expression we use to split products will work so long as you don't have products with names (not codes) like "BLHABLBHA98877x998".
You need to parse the GameList column. This is probably kind of slow for bigger datasets, but should show the general idea:
options(stringsAsFactors=FALSE)
DF <- read.table(text="Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4", header=TRUE)
DF <- do.call(rbind,
lapply(seq_len(nrow(DF)),
function(i) cbind.data.frame(DF[i,-3],
Game=strsplit(DF$GameList, ":", fixed=TRUE)[[i]])))
DF <- cbind(DF,
t(sapply(strsplit(DF$Game, "x", fixed=TRUE),
function(x) {if (length(x)<2L) x <- c(x, 1); x})))
DF <- DF[,-3]
names(DF)[3:4] <- c("Game", "Amount")
DF$Amount <- as.integer(DF$Amount)
DF$index <- seq_len(nrow(DF))
# Session TotSales Game Amount index
# 1 20764 15 ProductA 1 1
# 2 31976 7 ProductB 1 2
# 3 31976 7 ProductB 1 3
# 4 31976 7 ProductB 1 4
# 31 27966 25 1069 2 5
# 41 324 3 1067 1 6
# 7 6943 28 1071 1 7
# 8 6943 28 1064 1 8
# 9 6943 28 1038 2 9
# 10 6943 28 1034 1 10
# 11 6943 28 ProductE 1 11
# 6 14899 12 1062 2 12
# 13 25756 8 ProductC 1 13
# 14 25756 8 ProductC 1 14
# 15 25756 8 ProductB 1 15
# 16 27279 6 ProductD 1 16
# 17 27279 6 ProductD 1 17
# 18 27279 6 ProductD 1 18
# 19 27279 6 PcoductC 1 19
# 91 31981 4 1067 1 20
# 21 2782 529 1046 2 21
# 22 2782 529 1046 2 22
# 23 2782 529 1046 1 23
# 24 2782 529 1046 1 24
# 25 2782 529 1046 1 25
# 26 2782 529 1046 4 26
Note that I assume that there is no x in the product names. If there is, you need a regex as shown by #BrodieG for splitting.
Now you can do things like this:
aggregate(Game~Session, DF, head, 1)
# Session Game
# 1 324 1067
# 2 2782 1046
# 3 6943 1071
# 4 14899 1062
# 5 20764 ProductA
# 6 25756 ProductC
# 7 27279 ProductD
# 8 27966 1069
# 9 31976 ProductB
# 10 31981 1067