Reshape Data Long to Wide - understanding reshape parameters - r

I have a long format dataframe dogs that I'm trying to reformat to wide using the reshape() function. It currently looks like so:
dogid month year trainingtype home school timeincomp
12345 1 2014 1 1 1 340
12345 2 2014 1 1 1 360
31323 12 2015 2 7 3 440
31323 1 2014 1 7 3 500
31323 2 2014 1 7 3 520
The dogid column is a bunch of ids, one for each dog. The month column varies for 1 to 12 for the 12 months, and year from 2014 to 2015. Trainingtype varies for 1 to 2. Each dog has a timeincomp value for every month-year-trainingtype combination, so 48 entries per dog. Home and school vary from 1-8 and are constant per dog (every entry for the same dog has the same school and home). Time in comp is my response variable.
I would like my table to look like so:
dogid home school month1year2014trainingtype1 month2year2014trainingtype1
12345 1 1 340 360
31323 7 3 500 520
etc. (with columns for each month-year-trainingtype combination)
What parameters should I use in reshape to achieve this?

You can use the function dcast from package reshape2. It's easier to understand. The left side of the formula is the one that stays long, while the right side is the one that goes wide.
The fun.aggregate is the function to apply in case that there is more than 1 number per case. If you're sure you don't have repeated cases, you can use mean or sum
dcast(data, formula= dogid + home + school ~ month + year + trainingtype,
value.var = 'timeincomp',
fun.aggregate = sum)
I hope it works:
dogid home school 1_2014_1 2_2014_1 12_2015_2
1 12345 1 1 340 360 0
2 31323 7 3 500 520 440

In this case, using base reshape, you essentially want an interaction() of the three time variables to define your wide variables, so:
idvars <- c("dogid","home","school")
grpvars <- c("year","month","trainingtype")
outvar <- "timeincomp"
time <- interaction(dat[grpvars])
reshape(
cbind(dat[c(idvars,outvar)],time),
idvar=idvars,
timevar="time",
direction="wide"
)
# dogid home school timeincomp.2014.1.1 timeincomp.2014.2.1 timeincomp.2015.12.2
#1 12345 1 1 340 360 NA
#3 31323 7 3 500 520 440

You can do the same thing using the new replacement for reshape2, tidyr:
library(tidyr)
library(dplyr)
data %>% unite(newcol, c(year, month, trainingtype)) %>%
spread(newcol, timeincomp)
dogid home school 2014_1_1 2014_2_1 2015_12_2
1 12345 1 1 340 360 NA
2 31323 7 3 500 520 440
First, we unite the year, month and trainingtype columns into a new column called newcol, then we spread the data with timeincomp as our value variable.
The NA is there as we have no value, you can give it one by changing fill = NA in the spread function.

With tidyr_1.0.0 and above, another option is pivot_wider
library(tidyverse)
df <- tribble(
~dogid, ~month, ~year, ~trainingtype, ~home, ~school, ~timeincomp,
12345, 1, 2014, 1, 1, 1, 340,
12345, 2, 2014, 1, 1, 1, 360,
31323, 12, 2015, 2, 7, 3, 440,
31323, 1, 2014, 1, 7, 3, 500,
31323, 2, 2014, 1, 7, 3, 520
)
df %>% pivot_wider(
id_cols = c(dogid,home, school),
names_from = c(month, year, trainingtype),
values_from = c(timeincomp),
)

Related

Using R, how can I count objects according to multiple conditions?

I am trying to count objects in data frame of 911 calls according to certain conditions and I am having trouble with the logic. My actual data has over 3 million rows, so I've tried to simplify my problem by considering this small subset:
dat <- structure(list(call = c("14-1234", "14-4523", "14-7711", "14-8199", "14-3124"),
badge = c("8456", "1098", "3432", "4750", "5122"),
off.sex = c("Male", "Male", "Female", "Male", "Male"),
shift = c("1", "1", "1", "1", "2"),
assignedmin = c(1902, 1870, 1950, 1899, 1907),
clearedmin = c(1980, 1910, 1990, 1912, 1956)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The variable "call" identifies 911 calls, "badge" identifies officers, "shift" basically identifies a stretch of time in a particular area. The specific minute a call comes in is given by "assignedmin" and the call is considered cleared at the time given by "clearedmin."
I want to count how many officers on a given shift are able to respond to a particular call. For example, for call 14-1234, officer 8456 is assigned at time 1902. How many other officers would have been able to respond to that call? Officer 1098 was preoccupied with a different call from minute 1870 to minute 1910, and so would not have been able to respond to the call occurring at minute 1902. However, based on this simple data set officer 3432 would not have been busy at that time and so would be considered available. Officer 5122 was unoccupied at that time, but was on a different shift and so would not be considered available.
Desired output:
call badge off.sex shift assignedmin clearedmin n_shift n_avail n_unavail n_shift_male n_male_avail
1 14-1234 8456 Male 1 1902 1980 4 2 2 3 1
2 14-4523 1098 Male 1 1870 1910 4 4 0 3 3
3 14-7711 3432 Female 1 1950 1990 4 3 1 3 2
4 14-8199 4750 Male 1 1899 1912 4 3 1 3 2
5 14-3124 5122 Male 2 1907 1956 1 1 1 1 1
I hope this is not too convoluted. Basically, at the time given by assignedmin, an officer is available if he or she is on the same shift and not occupied with another call. I can easily count the number of officers on a shift using dplyr and data.table like so:
dat <- dat %>% group_by(shift) %>% mutate(n_shift = uniqueN(badge),
n_shift_male = uniqueN(badge[off.sex == 'Male']) %>% ungroup()
An option using data.table to count number of officers per shift, then perform a non-equi self join to find out n_unavail and finally, n_avail = n_shift - n_unavail:
library(data.table)
setDT(dat)[, c("n_shift", "n_shift_male") := .(.N, sum(off.sex=="Male")), shift]
dat[, c("n_unavail", "n_male_not_avail") :=
dat[dat, on=.(shift, assignedmin<=assignedmin, clearedmin>=assignedmin),
by=.EACHI, .(.N - 1L, sum(x.off.sex[x.call != i.call]=="Male"))][,
(1L:3L) := NULL]
]
dat[, c("n_avail", "n_male_avail") := .(n_shift - n_unavail, n_shift_male - n_male_not_avail)]
output:
call badge off.sex shift assignedmin clearedmin n_shift n_shift_male n_unavail n_male_not_avail n_avail n_male_avail
1: 14-1234 8456 Male 1 1902 1980 4 3 2 2 2 1
2: 14-4523 1098 Male 1 1870 1910 4 3 0 0 4 3
3: 14-7711 3432 Female 1 1950 1990 4 3 1 1 3 2
4: 14-8199 4750 Male 1 1899 1912 4 3 1 1 3 2
5: 14-3124 5122 Male 2 1907 1956 1 1 0 0 1 1
The n_unavail column can be filled as below. First, I join the table by itself on shift, so that there is a row for every officer combination in the same shift (this can be infeasible if your dataset is large). Then, I calculate whether the _other officer is unavailable at the time of the call, and count them.
dat %>%
left_join(dat, by = "shift", suffix = c("", "_other")) %>%
mutate(unavail = (assignedmin_other < assignedmin & clearedmin_other > assignedmin)) %>%
group_by(call) %>%
summarise(n_avail = sum(!unavail),
n_unavail = sum(unavail))
# call n_avail n_unavail
# <chr> <int> <int>
# 1 14-1234 2 2
# 2 14-3124 1 0
# 3 14-4523 4 0
# 4 14-7711 3 1
# 5 14-8199 3 1
This can be joined to your table to get your desired result.

split-apply-combine R

I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Resources