Trying to combine observations repeated in a single column - r

Here is my data:
Year Count Common.name County
1 1993 0 Spotted Salamander Bennington
2 1993 6 Spotted Salamander Bennington
3 1993 12 Eastern Newt Bennington
4 1993 23 Eastern Newt Bennington
5 1993 1 American Toad Bennington
6 1993 2 Wood Frog Bennington
Here is what I want my data to look like:
Year Count Common.name County
1 1993 6 Spotted Salamander Bennington
2 1993 35 Eastern Newt Bennington
3 1993 97 American Toad Bennington
4 1993 2 Wood Frog Bennington
5 1993 209 Green Frog Bennington
6 1994 78 Spotted Salamander Chittenden
I have data from 1993 - 2017, sampling different counties on different dates. I would like to combine the year, count, and county for a given species. I don't know how to add them together appropriately.

I think what you need is aggregate.
DAT = read.table(text='Year Count Common.name County
1 1993 0 "Spotted Salamander" Bennington
2 1993 6 "Spotted Salamander" Bennington
3 1993 12 "Eastern Newt" Bennington
4 1993 23 "Eastern Newt" Bennington
5 1993 1 "American Toad" Bennington
6 1993 2 "Wood Frog" Bennington',
header=TRUE)
aggregate(DAT$Count, list(DAT$Year, DAT$Common.name, DAT$County), sum)
Group.1 Group.2 Group.3 x
1 1993 American Toad Bennington 1
2 1993 Eastern Newt Bennington 35
3 1993 Spotted Salamander Bennington 6
4 1993 Wood Frog Bennington 2

Related

Insert missing rows in a dataframe based in values criteria

Given a dataframe like this
country rest count
Argentina pizza 26
Argentina asador 22
Brazil feijoada 52
Brazil pizza 67
Germany pizza 22
Germany biergarten 52
Germany kebab 20
Let's suppose we want all the unique values in 'rest' column to be represented in as many rows as countries in the dataframe, even if they have no values. My desired output would look like this:
country rest count
Argentina pizza 26
Argentina asador 22
Argentina feijoada 0
Argentina biergarten 0
Argentina kebab 0
Brazil pizza 67
Brazil feijoada 52
Brazil asador 0
Brazil biergarten 0
Brazil kebab 0
Germany pizza 22
Germany biergarten 52
Germany kebab 20
Germany asador 0
Germany feijoada 0
Is there any simple way to reach this output through dplyr?
tidyr::complete(dat, country, rest, fill=list(count=0))
# # A tibble: 15 x 3
# country rest count
# <chr> <chr> <dbl>
# 1 Argentina asador 22
# 2 Argentina biergarten 0
# 3 Argentina feijoada 0
# 4 Argentina kebab 0
# 5 Argentina pizza 26
# 6 Brazil asador 0
# 7 Brazil biergarten 0
# 8 Brazil feijoada 52
# 9 Brazil kebab 0
# 10 Brazil pizza 67
# 11 Germany asador 0
# 12 Germany biergarten 52
# 13 Germany feijoada 0
# 14 Germany kebab 20
# 15 Germany pizza 22

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Help with persisting problem when using 'subset' function in R

I would like to use the subset function in R to extract smaller groups of panel study time series data.
My data consists of a dataframe made up of six columns: district(8 districts), gender, age interval(4 groups), year, month and a count column.
Example:
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 91
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
I would like to extract smaller subset for each district, Gender and age interval to get something like this:
District Gender Year Month AgeGroupNew TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
So far I have been trying to use this, thanks to DWin pointing it out in a previous question.
subset(datNew, subset=(District=="Eastern" & Gender=="Female" & AgeGroupNew=="01-4"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
But R keeps on giving me the output as above - which it shouldn't.
I have tried other combinations with success, but it seems using 'District' in the subset causes this <0 rows> (or 0-length row.names).
This works:
> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
District Gender Year Month AgeGroupNew TotalDeaths
77 Eastern Female 2004 8 0 10
269 Eastern Male 2004 8 0 6
461 Khayelitsha Female 2004 8 0 13
653 Khayelitsha Male 2004 8 0 15
845 Klipfontein Female 2004 8 0 7
1037 Klipfontein Male 2004 8 0 6
but not
> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
Any reason why District is causing this? It's absolutely wrong that there are 0 rows with that combination of the subset - there's enough data to my knowledge.
I've tried experimenting - and from other posts, this is a baby step closer to what I want to achieve, but still not working:
> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
5 Eastern Female 2003 2 0 4
9 Eastern Female 2003 3 0 5
13 Eastern Female 2003 4 0 12
17 Eastern Female 2003 5 0 7
21 Eastern Female 2003 6 0 13
With this I am unable to choose from the other Districts, such as "Southern", "Khayelitsha", etc. No matter what I change datNew[[1 or 2 or 3]] and District[[1 or 2 or 3]].
I don't really know what %in% does above?
I am so stuck. Any help asseblief.
Prediction: Give us the results str(datNew$District[1]) and all will be revealed. I predict there is a non-printing character that will show up, perhaps a trailing space (or two).
So with the results of str(...) the correct code would be:
subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")

Extracting specific data from hierarchical-data in R

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
The Total.Deaths will then be the sum for each of these groups.
So I want it to look like this
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
I have a lot of data and have searched for a few days, but unable to find a function to help be do this.
There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:
#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)
#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))
#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"
Then we can generate summaries using ddply and summarise:
datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
TotalDeaths = sum(Total.Deaths))
I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

Resources