Extracting specific data from hierarchical-data in R - r

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
The Total.Deaths will then be the sum for each of these groups.
So I want it to look like this
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
I have a lot of data and have searched for a few days, but unable to find a function to help be do this.

There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:
#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)
#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))
#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"
Then we can generate summaries using ddply and summarise:
datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
TotalDeaths = sum(Total.Deaths))
I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

Related

R - manipulating time series data

I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]

Fill values in between values in rows in R based on condition

I have data that look like this:
id <- c(rep(1,5), rep(2,5), rep(3,4), rep(4,2), rep(5, 1))
year <- c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1994,1990,1994, 1994)
gender <- c(rep("female", 5), rep("male", 5), rep("male", 4), rep("female", 2), rep("male", 1))
dat <- data.frame(id,year,gender)
As you can see, id 1 and 2 have observations for every year between 1990 and 1994, while there are missing observations in between 1990 and 1994 for ids 3 and 4, and, finally, only one observation for id 5.
What I want to do is to copy column id and gender and insert the missing observations for id 3 and 4 so that there are observations from 1990 too 1994, while I want to do nothing with id 1, 2 or 5. Is there are way to create a sequence with numbers from the oldest to the newest observation based on the condition that there is a gap between two numbers grouped by a variable, such as id?
The final result should look like this:
id year gender
<dbl> <dbl> <chr>
1 1 1990 female
2 1 1991 female
3 1 1992 female
4 1 1993 female
5 1 1994 female
6 2 1990 male
7 2 1991 male
8 2 1992 male
9 2 1993 male
10 2 1994 male
11 3 1990 male
12 3 1991 male
13 3 1992 male
14 3 1993 male
15 3 1994 male
16 4 1990 female
17 4 1991 female
18 4 1992 female
19 4 1993 female
20 4 1994 female
21 5 1994 male
Filter the dataset for id 3 and 4, complete their observations and bind the data to other id's where id is not 3 and 4.
library(dplyr)
library(tidyr)
complete_id <- c(3, 4)
dat %>%
filter(id %in% complete_id) %>%
complete(id, year = 1990:1994) %>%
fill(gender) %>%
bind_rows(dat %>% filter(!id %in% complete_id)) %>%
arrange(id)
# id year gender
#1 1 1990 female
#2 1 1991 female
#3 1 1992 female
#4 1 1993 female
#5 1 1994 female
#6 2 1990 male
#7 2 1991 male
#8 2 1992 male
#9 2 1993 male
#10 2 1994 male
#11 3 1990 male
#12 3 1991 male
#13 3 1992 male
#14 3 1993 male
#15 3 1994 male
#16 4 1990 female
#17 4 1991 female
#18 4 1992 female
#19 4 1993 female
#20 4 1994 female
#21 5 1994 male

how can i plot a histogram of crime type vs HOURS in r

i have a big dataset, with diferent variables and i want to make a histogram of type of crime against HOURS. how can i do that in r?
DATE TIME PLACE ZONE TYPE.OF.CRIME WEEK
1 2011/01/01 23:00 KIEPIES CLUB <NA> ARMED ROBBERY 1
2 2011/01/03 10:00 AUSSPANNPLATZ Zone 14 ARMED ROBBERY 1
3 2011/01/07 14:00 UNAM BUSHES Zone 16 ARMED ROBBERY 1
4 2011/01/08 21:34 TOTAL SERV. STATION, KHOMASDAL Zone 9 ARMED ROBBERY 1
5 2011/01/15 <NA> WOODPALM STR 625 Zone 11 ARMED ROBBERY 2
6 2011/01/03 14:03 C KANDOVAZU STR Zone 5 ASSAULT GBH 1
HOUR day month year HOURS
1 23 1 1 2011 23
2 10 3 1 2011 10
3 14 7 1 2011 14
4 21 8 1 2011 21
5 <NA> 15 1 2011 <NA>
6 14 3 1 2011 14
ggplot(df, aes(x=TYPE.OF.CRIME, y=HOURS)) +
geom_histogram()
Something like this should work.

Help with persisting problem when using 'subset' function in R

I would like to use the subset function in R to extract smaller groups of panel study time series data.
My data consists of a dataframe made up of six columns: district(8 districts), gender, age interval(4 groups), year, month and a count column.
Example:
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 91
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
I would like to extract smaller subset for each district, Gender and age interval to get something like this:
District Gender Year Month AgeGroupNew TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
So far I have been trying to use this, thanks to DWin pointing it out in a previous question.
subset(datNew, subset=(District=="Eastern" & Gender=="Female" & AgeGroupNew=="01-4"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
But R keeps on giving me the output as above - which it shouldn't.
I have tried other combinations with success, but it seems using 'District' in the subset causes this <0 rows> (or 0-length row.names).
This works:
> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
District Gender Year Month AgeGroupNew TotalDeaths
77 Eastern Female 2004 8 0 10
269 Eastern Male 2004 8 0 6
461 Khayelitsha Female 2004 8 0 13
653 Khayelitsha Male 2004 8 0 15
845 Klipfontein Female 2004 8 0 7
1037 Klipfontein Male 2004 8 0 6
but not
> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
Any reason why District is causing this? It's absolutely wrong that there are 0 rows with that combination of the subset - there's enough data to my knowledge.
I've tried experimenting - and from other posts, this is a baby step closer to what I want to achieve, but still not working:
> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
5 Eastern Female 2003 2 0 4
9 Eastern Female 2003 3 0 5
13 Eastern Female 2003 4 0 12
17 Eastern Female 2003 5 0 7
21 Eastern Female 2003 6 0 13
With this I am unable to choose from the other Districts, such as "Southern", "Khayelitsha", etc. No matter what I change datNew[[1 or 2 or 3]] and District[[1 or 2 or 3]].
I don't really know what %in% does above?
I am so stuck. Any help asseblief.
Prediction: Give us the results str(datNew$District[1]) and all will be revealed. I predict there is a non-printing character that will show up, perhaps a trailing space (or two).
So with the results of str(...) the correct code would be:
subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")

How to extract longitudinal time-series data from a dataframe in R for time-series analysis and imputation

Thanks to joran for helping me to group data in my previous question where I wanted to make a data frame in R smaller so that I can do time-series analysis on the data.
Now I would like to actually further extract data from the dataframe. The dataframe is made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, gender, year, month and age group. The sixth column is the number of death counts for that specific combination. An extract looks like this:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2006 11 01-4 0
Northern Male 2006 11 05-14 1
Northern Male 2006 11 15+ 83
Northern Male 2006 12 0 3
Northern Male 2006 12 01-4 0
Northern Male 2006 12 05-14 0
Northern Male 2006 12 15+ 106
Southern Female 2003 1 0 6
Southern Female 2003 1 01-4 0
Southern Female 2003 1 05-14 3
Southern Female 2003 1 15+ 136
Southern Female 2003 2 0 6
Southern Female 2003 2 01-4 0
Southern Female 2003 2 05-14 1
Southern Female 2003 2 15+ 111
Southern Female 2003 3 0 2
Southern Female 2003 3 01-4 0
Southern Female 2003 3 05-14 1
Southern Female 2003 3 15+ 141
Southern Female 2003 4 0 4
I am new to time-series, and I think I will need to do this to analyse the data: I will need to extract smaller 'time-series' data objects that are unique and longitudinal data. For example from this above dataframe, I want to extract smaller data objects like this for each District, Gender and AgeGroup:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
I tried something in Excel, creating pivot tables with this data, and then tried to extract the string of information - but failed. After that I discovered reshapein R, but I either don't know the codes or perhaps should not use reshape to do this.
I am not even certain if this is the correct/ way to analyse this cross-sectional time-series data, ie. if there is actually another format required to analyse this data with functions such as read.ts(), ts() and arima().
My eventual aim is to use this data and the amelia2 package with its functions to impute for missing TotalDeaths for certain months in 2007 and 2008, where the data is of course missing.
Any help, how to do this and perhaps suggestions on how to tackle this problem would be gratefully appreciated.
For the narrow question of how to best extract:
subset(dfrm, subset=(District=="Northern" & Gender=="Male" & AgeGroup=="01-4"))
subset also has a select argument to narrow down the columns. I suspect a search on the term "extract" you were using would have only pulled up hits for the ?Extract page which surprisingly has no link to subset. (I trimmed a trailing space from an earlier version of the AgeGroup specification.)

Resources