I have count data from different regions per year. The original data is structured like this:
count region year
1 1 A 2011
2 2 A 2010
3 1 A 2009
4 5 A 2008
5 4 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
16 4 D 2011
17 3 D 2010
18 2 D 2009
19 1 D 2008
20 4 D 2007
I now need to combine (sum) the values only for region A and D per year and keep the value A for the column regions of these calculated sums. The output should look like this:
count region year
1 5 A 2011
2 5 A 2010
3 3 A 2009
4 6 A 2008
5 8 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
The counts for region B and C should not be changed. I tried but never received the needed output. Does anyone have a tip? I would be very grateful.
We may replace the D to A, and do a group_by sum
library(dplyr)
df1 %>%
group_by(region = replace(region, region == 'D', 'A'), year) %>%
summarise(count = sum(count), .groups = 'drop')
Related
I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
year
category
value
2005
a
1
2005
a
1
2005
a
1
2006
b
2
2006
b
2
2006
b
2
2007
c
3
2007
c
3
2007
c
3
2006
a
3
2007
b
6
2008
c
9
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22
So, I have a dataset that looks just like that :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2
But I do not want to have NAs in the cat column. Instead, I want every line within the same site to get the same value of cat.
Just like this :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA 1
3 10 2015 2.0 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0.0 2
8 11 2010 NA 2
9 11 2009 1.0 2
Any idea on how I can do that?
Use na.aggregate to fill in the NA values using ave to do it by site.
library(zoo)
transform(DF, cat = ave(cat, site, FUN = na.aggregate))
giving:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
Note
The input used, in reproducible form, is:
Lines <- "
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2"
DF <- read.table(text = Lines)
A complete base R alternative:
transform(DF, cat = ave(cat, site, FUN = function(x) x[!is.na(x)][1]))
which gives:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
The same logic implemented with dplyr:
library(dplyr)
DF %>%
group_by(site) %>%
mutate(cat = na.omit(cat)[1])
Or with na.locf of the zoo-package:
library(zoo)
transform(DF, cat = ave(cat, site, FUN = function(x) na.locf(na.locf(x, fromLast = TRUE, na.rm = FALSE))))
Or with fill from tidyr:
library(tidyr)
library(dplyr)
DF %>%
group_by(site) %>%
fill(cat) %>%
fill(cat, .direction = "up")
NOTE: I'm wondered what the added value is of the cat-column when cat has to be the same for each site. You'll end up with two grouping variables that do exactly the same, thus making one ot them redundant imo.
You can also use tidyr::fill
library(dplyr)
library(tidyr)
DF %>%
group_by(site) %>%
fill(cat,.direction = "up") %>%
fill(cat,.direction = "down") %>%
ungroup
# # A tibble: 9 x 4
# site year territories cat
# <int> <int> <dbl> <int>
# 1 10 2017 0 1
# 2 10 2016 NA 1
# 3 10 2015 2 1
# 4 10 2014 NA 1
# 5 10 2013 NA 1
# 6 11 2012 NA 2
# 7 11 2011 0 2
# 8 11 2010 NA 2
# 9 11 2009 1 2
dat=data.frame(
year=c(rep(2007,5),rep(2008,3),rep(2009,3)),
province=c("a","a","b","c","d","a","c","d","b","c","d"),
sale=1:11)
tapply(dat$sale,list(dat$year,dat$province),sum)
a b c d
2007 3 3 4 5
2008 6 NA 7 8
2009 NA 9 10 11
In the case , how can i change the tapply into aggregate to get the same result?
It would not be arranged as a table, but rather as a "long format" presentation.
> aggregate(dat$sale,list(dat$year,dat$province),sum)
Group.1 Group.2 x
1 2007 a 3
2 2008 a 6
3 2007 b 3
4 2009 b 9
5 2007 c 4
6 2008 c 7
7 2009 c 10
8 2007 d 5
9 2008 d 8
10 2009 d 11
Whether you consider that the same is not clear. The information content is the same.
I have a dataframe with counts of different items, in different years:
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
I know I could do this for the whole data frame using order(df$count), but I'm not sure how I would do it by year.
There is a rank function to help you with that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
data.table version for practice:
library(data.table)
DT <- as.data.table(df)
DT[,yrrank:=rank(-count,ties.method="first"),by=year]
item year count yrrank
1: a 2010 1 3
2: b 2010 4 2
3: c 2010 6 1
4: a 2011 3 2
5: b 2011 8 1
6: c 2011 3 3
7: a 2012 5 3
8: b 2012 7 2
9: c 2012 9 1
Using order function,
transform(dat, x= ave(count,year,FUN=function(x) order(x,decreasing=T)))
item year count x
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
EDIT
You can use plyr here also:
ddply(dat,.(year),transform,x = order(count,decreasing=T))
Using dplyr you could do it as follows:
library(dplyr) # 0.4.1
df %>%
group_by(year) %>%
mutate(yrrank = row_number(-count))
#Source: local data frame [9 x 4]
#Groups: year
#
# item year count yrrank
#1 a 2010 1 3
#2 b 2010 4 2
#3 c 2010 6 1
#4 a 2011 3 2
#5 b 2011 8 1
#6 c 2011 3 3
#7 a 2012 5 3
#8 b 2012 7 2
#9 c 2012 9 1
It is the same as:
df %>%
group_by(year) %>%
mutate(yrrank = rank(-count, ties.method = "first"))
Note that the resulting data is still grouped by "year". If you want to remove the grouping you can simply extend the pipe with %>% ungroup().
While using the answers given by others, I found that the following performs faster than the transform and dyplr variants:
df$year.rank <- ave(count, year, FUN = function(x) rank(-x, ties.method = "first"))
Say I have two matrix, A and B:
mth <- c(rep(1:5,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
year <- c(2008:2012)
mth <- c(1:5)
B <- data.frame(cbind(year,mth))
What I want should be look like:
mth <- c(rep(2008:2012,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
Basically what I need is to change the column mth in A with column year in B, Maybe I didn't search for the right keyword, I was not able to get what I want(I tried which()), please help, thank you.
A2 <- merge(A,B, by = "mth")[ , -1]
names(A2)[(which(names(A2)=="year"))] <- "mth"
> A2
day hr v mth
1 10 3 3 2008
2 11 3 3 2008
3 11 4 4 2009
4 10 4 4 2009
5 11 5 5 2010
6 10 5 5 2010
7 11 6 4 2011
8 10 6 4 2011
9 10 7 3 2012
10 11 7 3 2012
Probably the easiest solution is to use merge, which is equivalent to a sql join in a lot of ways:
merge(A,B)
#-----
merge(A, B)
mth day hr v year
1 1 10 3 3 2008
2 1 11 3 3 2008
3 2 11 4 4 2009
4 2 10 4 4 2009
5 3 11 5 5 2010
6 3 10 5 5 2010
7 4 11 6 4 2011
8 4 10 6 4 2011
9 5 10 7 3 2012
10 5 11 7 3 2012
You could also probably use match like this to replace mth in place:
A$mth <- B[match(A$mth, B$mth),1]
#-----
mth day hr v
1 2008 10 3 3
2 2009 10 4 4
3 2010 10 5 5
4 2011 10 6 4
5 2012 10 7 3
6 2008 11 3 3
7 2009 11 4 4
8 2010 11 5 5
9 2011 11 6 4
10 2012 11 7 3
While a little dense, that code indexes B by matching the two mth columns from A and B and then grabs the first column.+