Help with persisting problem when using 'subset' function in R - r

I would like to use the subset function in R to extract smaller groups of panel study time series data.
My data consists of a dataframe made up of six columns: district(8 districts), gender, age interval(4 groups), year, month and a count column.
Example:
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 91
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
I would like to extract smaller subset for each district, Gender and age interval to get something like this:
District Gender Year Month AgeGroupNew TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
So far I have been trying to use this, thanks to DWin pointing it out in a previous question.
subset(datNew, subset=(District=="Eastern" & Gender=="Female" & AgeGroupNew=="01-4"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
But R keeps on giving me the output as above - which it shouldn't.
I have tried other combinations with success, but it seems using 'District' in the subset causes this <0 rows> (or 0-length row.names).
This works:
> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
District Gender Year Month AgeGroupNew TotalDeaths
77 Eastern Female 2004 8 0 10
269 Eastern Male 2004 8 0 6
461 Khayelitsha Female 2004 8 0 13
653 Khayelitsha Male 2004 8 0 15
845 Klipfontein Female 2004 8 0 7
1037 Klipfontein Male 2004 8 0 6
but not
> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
Any reason why District is causing this? It's absolutely wrong that there are 0 rows with that combination of the subset - there's enough data to my knowledge.
I've tried experimenting - and from other posts, this is a baby step closer to what I want to achieve, but still not working:
> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
5 Eastern Female 2003 2 0 4
9 Eastern Female 2003 3 0 5
13 Eastern Female 2003 4 0 12
17 Eastern Female 2003 5 0 7
21 Eastern Female 2003 6 0 13
With this I am unable to choose from the other Districts, such as "Southern", "Khayelitsha", etc. No matter what I change datNew[[1 or 2 or 3]] and District[[1 or 2 or 3]].
I don't really know what %in% does above?
I am so stuck. Any help asseblief.

Prediction: Give us the results str(datNew$District[1]) and all will be revealed. I predict there is a non-printing character that will show up, perhaps a trailing space (or two).
So with the results of str(...) the correct code would be:
subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")

Related

Trying to combine observations repeated in a single column

Here is my data:
Year Count Common.name County
1 1993 0 Spotted Salamander Bennington
2 1993 6 Spotted Salamander Bennington
3 1993 12 Eastern Newt Bennington
4 1993 23 Eastern Newt Bennington
5 1993 1 American Toad Bennington
6 1993 2 Wood Frog Bennington
Here is what I want my data to look like:
Year Count Common.name County
1 1993 6 Spotted Salamander Bennington
2 1993 35 Eastern Newt Bennington
3 1993 97 American Toad Bennington
4 1993 2 Wood Frog Bennington
5 1993 209 Green Frog Bennington
6 1994 78 Spotted Salamander Chittenden
I have data from 1993 - 2017, sampling different counties on different dates. I would like to combine the year, count, and county for a given species. I don't know how to add them together appropriately.
I think what you need is aggregate.
DAT = read.table(text='Year Count Common.name County
1 1993 0 "Spotted Salamander" Bennington
2 1993 6 "Spotted Salamander" Bennington
3 1993 12 "Eastern Newt" Bennington
4 1993 23 "Eastern Newt" Bennington
5 1993 1 "American Toad" Bennington
6 1993 2 "Wood Frog" Bennington',
header=TRUE)
aggregate(DAT$Count, list(DAT$Year, DAT$Common.name, DAT$County), sum)
Group.1 Group.2 Group.3 x
1 1993 American Toad Bennington 1
2 1993 Eastern Newt Bennington 35
3 1993 Spotted Salamander Bennington 6
4 1993 Wood Frog Bennington 2

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

How to extract longitudinal time-series data from a dataframe in R for time-series analysis and imputation

Thanks to joran for helping me to group data in my previous question where I wanted to make a data frame in R smaller so that I can do time-series analysis on the data.
Now I would like to actually further extract data from the dataframe. The dataframe is made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, gender, year, month and age group. The sixth column is the number of death counts for that specific combination. An extract looks like this:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2006 11 01-4 0
Northern Male 2006 11 05-14 1
Northern Male 2006 11 15+ 83
Northern Male 2006 12 0 3
Northern Male 2006 12 01-4 0
Northern Male 2006 12 05-14 0
Northern Male 2006 12 15+ 106
Southern Female 2003 1 0 6
Southern Female 2003 1 01-4 0
Southern Female 2003 1 05-14 3
Southern Female 2003 1 15+ 136
Southern Female 2003 2 0 6
Southern Female 2003 2 01-4 0
Southern Female 2003 2 05-14 1
Southern Female 2003 2 15+ 111
Southern Female 2003 3 0 2
Southern Female 2003 3 01-4 0
Southern Female 2003 3 05-14 1
Southern Female 2003 3 15+ 141
Southern Female 2003 4 0 4
I am new to time-series, and I think I will need to do this to analyse the data: I will need to extract smaller 'time-series' data objects that are unique and longitudinal data. For example from this above dataframe, I want to extract smaller data objects like this for each District, Gender and AgeGroup:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
I tried something in Excel, creating pivot tables with this data, and then tried to extract the string of information - but failed. After that I discovered reshapein R, but I either don't know the codes or perhaps should not use reshape to do this.
I am not even certain if this is the correct/ way to analyse this cross-sectional time-series data, ie. if there is actually another format required to analyse this data with functions such as read.ts(), ts() and arima().
My eventual aim is to use this data and the amelia2 package with its functions to impute for missing TotalDeaths for certain months in 2007 and 2008, where the data is of course missing.
Any help, how to do this and perhaps suggestions on how to tackle this problem would be gratefully appreciated.
For the narrow question of how to best extract:
subset(dfrm, subset=(District=="Northern" & Gender=="Male" & AgeGroup=="01-4"))
subset also has a select argument to narrow down the columns. I suspect a search on the term "extract" you were using would have only pulled up hits for the ?Extract page which surprisingly has no link to subset. (I trimmed a trailing space from an earlier version of the AgeGroup specification.)

Extracting specific data from hierarchical-data in R

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
The Total.Deaths will then be the sum for each of these groups.
So I want it to look like this
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
I have a lot of data and have searched for a few days, but unable to find a function to help be do this.
There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:
#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)
#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))
#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"
Then we can generate summaries using ddply and summarise:
datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
TotalDeaths = sum(Total.Deaths))
I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

Resources