Subsetting two factors in R - r

I have a huge dataset, and I have a column called season. There are 4 seasons i.e. Winter, Spring, Summer and Autumn.
Region Year Male Female Area DATE Day Month Season
WEST 1996 0 1 4 06-04-96 Saturday April Spring
EAST 1996 0 1 16 29-06-96 Saturday June Summer
WEST 1996 0 1 4 19-10-96 Saturday October Winter
WEST 1996 0 1 4 20-10-96 Sunday October Winter
EAST 1996 0 1 16 01-11-96 Friday November Winter
EAST 1996 0 1 16 11-11-96 Monday November Winter
WEST 1996 0 1 4 19-11-96 Tuesday November Winter
WEST 1996 0 1 4 28-11-96 Thursday November Winter
WEST 1996 0 1 4 10-12-96 Tuesday December Winter
WEST 1997 0 1 4 17-01-97 Friday January Winter
WEST 1997 0 1 4 28-03-97 Friday March Spring
So I am trying to create a subset where I want R to show me entries with season as Winter and Autumn.
I created a subset first of the portion I want.
secondphase<-subset(eb1, Area>16)
now from this subset, I want where Season is Winter and Autumn.
I tried these codes-
th2<-subset(secondphase, Season== "Winter")
th3<-subset(secondphase, Season=="Autumn")
Now is there a way to merge these two subsets? or create a subset where I can select the conditions where I want area>16, season should be Winter and autumn.
Thanks for the Help.

You could also use the dplyr package with the filter function
filter(secondphase, grepl("Winter|Autumn", Season))

Method 1
my_subset <- eb1[eb1$Season %in% c("Winter", "Autumn") & eb1$Area > 16,]
Method 2
th2 <- subset(secondphase, Season== "Winter")
th3 <- subset(secondphase, Season=="Autumn")
final <- rbind(th2, th3)
Method 3
final <-subset(eb1[eb1$Area > 16,], Season== "Winter" | Season=="Autumn")

With a data.table approach,
library("data.table")
DT<-data.table(eb1)
subsetDT<-subset(DT, Season %in% c("Autmn","Winter") & Area > 16)
does the job.

Related

String comparison in 2 adjacent rows of a data frame in R

I have a data-frame with 627 observations and 16 variables are present. I am considering one column named "ZoneDivison" which has factors: North Eastern, Eastern and South Eastern.
So, I want to compare the adjacent row values and create a new column which has 1, if two adjacent rows have same zones, else 0, if the adjacent rows are different.
I referred to the following links to find a way out:
[here] Matching two Columns in R
[here] compare row values over multiple rows (R)
library(dplyr)
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
a=data.frame(a)
colnames(a)="ZoneDivision"
#comparing the zones
library(plyr)
ddply(n, .(ZoneDivision),summarize,ZoneMatching=Position(isTRUE,ZoneDivision))
Expected Result
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
Actual Result
ZoneDivision ZoneMatching
1 Eastern NA
2 North Eastern NA
3 South Eastern NA
How should I proceed? Please help!!
Using base R, we can do
as.numeric(c(NA, a$ZoneDivision[-1] == a$ZoneDivision[-nrow(a)]))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1
The data.table way:
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
dt <- as.data.table(a)
dt[,'ZoneMatching' := as.numeric(.SD[,a] == shift(.SD[,a],1))]
Where you add a new ZoneMatching column as the numeric values of the logical comparison between the a column and the lagged values, generated by the shift() function.
You can use lag to get that:
library(dplyr)
a %>%
mutate(ZoneMatching = as.numeric((ZoneDivision == lag(ZoneDivision, 1))))
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
We can use base R
with(a, c(NA, +(head(ZoneDivision, -1) == tail(ZoneDivision, -1))))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1

Trying to combine observations repeated in a single column

Here is my data:
Year Count Common.name County
1 1993 0 Spotted Salamander Bennington
2 1993 6 Spotted Salamander Bennington
3 1993 12 Eastern Newt Bennington
4 1993 23 Eastern Newt Bennington
5 1993 1 American Toad Bennington
6 1993 2 Wood Frog Bennington
Here is what I want my data to look like:
Year Count Common.name County
1 1993 6 Spotted Salamander Bennington
2 1993 35 Eastern Newt Bennington
3 1993 97 American Toad Bennington
4 1993 2 Wood Frog Bennington
5 1993 209 Green Frog Bennington
6 1994 78 Spotted Salamander Chittenden
I have data from 1993 - 2017, sampling different counties on different dates. I would like to combine the year, count, and county for a given species. I don't know how to add them together appropriately.
I think what you need is aggregate.
DAT = read.table(text='Year Count Common.name County
1 1993 0 "Spotted Salamander" Bennington
2 1993 6 "Spotted Salamander" Bennington
3 1993 12 "Eastern Newt" Bennington
4 1993 23 "Eastern Newt" Bennington
5 1993 1 "American Toad" Bennington
6 1993 2 "Wood Frog" Bennington',
header=TRUE)
aggregate(DAT$Count, list(DAT$Year, DAT$Common.name, DAT$County), sum)
Group.1 Group.2 Group.3 x
1 1993 American Toad Bennington 1
2 1993 Eastern Newt Bennington 35
3 1993 Spotted Salamander Bennington 6
4 1993 Wood Frog Bennington 2

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

Help with persisting problem when using 'subset' function in R

I would like to use the subset function in R to extract smaller groups of panel study time series data.
My data consists of a dataframe made up of six columns: district(8 districts), gender, age interval(4 groups), year, month and a count column.
Example:
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 91
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
I would like to extract smaller subset for each district, Gender and age interval to get something like this:
District Gender Year Month AgeGroupNew TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
So far I have been trying to use this, thanks to DWin pointing it out in a previous question.
subset(datNew, subset=(District=="Eastern" & Gender=="Female" & AgeGroupNew=="01-4"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
But R keeps on giving me the output as above - which it shouldn't.
I have tried other combinations with success, but it seems using 'District' in the subset causes this <0 rows> (or 0-length row.names).
This works:
> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
District Gender Year Month AgeGroupNew TotalDeaths
77 Eastern Female 2004 8 0 10
269 Eastern Male 2004 8 0 6
461 Khayelitsha Female 2004 8 0 13
653 Khayelitsha Male 2004 8 0 15
845 Klipfontein Female 2004 8 0 7
1037 Klipfontein Male 2004 8 0 6
but not
> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District Gender Year Month AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)
Any reason why District is causing this? It's absolutely wrong that there are 0 rows with that combination of the subset - there's enough data to my knowledge.
I've tried experimenting - and from other posts, this is a baby step closer to what I want to achieve, but still not working:
> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern Female 2003 1 0 4
5 Eastern Female 2003 2 0 4
9 Eastern Female 2003 3 0 5
13 Eastern Female 2003 4 0 12
17 Eastern Female 2003 5 0 7
21 Eastern Female 2003 6 0 13
With this I am unable to choose from the other Districts, such as "Southern", "Khayelitsha", etc. No matter what I change datNew[[1 or 2 or 3]] and District[[1 or 2 or 3]].
I don't really know what %in% does above?
I am so stuck. Any help asseblief.
Prediction: Give us the results str(datNew$District[1]) and all will be revealed. I predict there is a non-printing character that will show up, perhaps a trailing space (or two).
So with the results of str(...) the correct code would be:
subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")

Extracting specific data from hierarchical-data in R

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
The Total.Deaths will then be the sum for each of these groups.
So I want it to look like this
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
I have a lot of data and have searched for a few days, but unable to find a function to help be do this.
There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:
#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)
#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))
#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"
Then we can generate summaries using ddply and summarise:
datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
TotalDeaths = sum(Total.Deaths))
I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

Resources