Add a "rank" column to a data frame - r

I have a dataframe with counts of different items, in different years:
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
I know I could do this for the whole data frame using order(df$count), but I'm not sure how I would do it by year.

There is a rank function to help you with that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1

data.table version for practice:
library(data.table)
DT <- as.data.table(df)
DT[,yrrank:=rank(-count,ties.method="first"),by=year]
item year count yrrank
1: a 2010 1 3
2: b 2010 4 2
3: c 2010 6 1
4: a 2011 3 2
5: b 2011 8 1
6: c 2011 3 3
7: a 2012 5 3
8: b 2012 7 2
9: c 2012 9 1

Using order function,
transform(dat, x= ave(count,year,FUN=function(x) order(x,decreasing=T)))
item year count x
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
EDIT
You can use plyr here also:
ddply(dat,.(year),transform,x = order(count,decreasing=T))

Using dplyr you could do it as follows:
library(dplyr) # 0.4.1
df %>%
group_by(year) %>%
mutate(yrrank = row_number(-count))
#Source: local data frame [9 x 4]
#Groups: year
#
# item year count yrrank
#1 a 2010 1 3
#2 b 2010 4 2
#3 c 2010 6 1
#4 a 2011 3 2
#5 b 2011 8 1
#6 c 2011 3 3
#7 a 2012 5 3
#8 b 2012 7 2
#9 c 2012 9 1
It is the same as:
df %>%
group_by(year) %>%
mutate(yrrank = rank(-count, ties.method = "first"))
Note that the resulting data is still grouped by "year". If you want to remove the grouping you can simply extend the pipe with %>% ungroup().

While using the answers given by others, I found that the following performs faster than the transform and dyplr variants:
df$year.rank <- ave(count, year, FUN = function(x) rank(-x, ties.method = "first"))

Related

How to sum values with multiple conditions per year in R

I have count data from different regions per year. The original data is structured like this:
count region year
1 1 A 2011
2 2 A 2010
3 1 A 2009
4 5 A 2008
5 4 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
16 4 D 2011
17 3 D 2010
18 2 D 2009
19 1 D 2008
20 4 D 2007
I now need to combine (sum) the values only for region A and D per year and keep the value A for the column regions of these calculated sums. The output should look like this:
count region year
1 5 A 2011
2 5 A 2010
3 3 A 2009
4 6 A 2008
5 8 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
The counts for region B and C should not be changed. I tried but never received the needed output. Does anyone have a tip? I would be very grateful.
We may replace the D to A, and do a group_by sum
library(dplyr)
df1 %>%
group_by(region = replace(region, region == 'D', 'A'), year) %>%
summarise(count = sum(count), .groups = 'drop')

Change the form while merging multiple data frames

I have several data frames that are all in same format, like:
price <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
size <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
performance <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
> price
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> size
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> performance
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
and I want to merge these data frames but the result is in different form, the desired output is like:
> df
name Year price size performance
1 A 2001 1 1 1
2 A 2002 2 2 2
3 A 2003 3 3 3
4 B 2001 2 2 2
5 B 2002 3 3 3
6 B 2003 4 4 4
7 C 2001 3 3 3
8 C 2002 4 4 4
9 C 2003 5 5 5
which arranges the data in the order of names, and then the ordered date. Since I have over 2000 names and 180 dates in each of the 20 data frames it's too difficult to sort it by just imputing the specific name.
You need to convert your data frames to long format then join them together
library(tidyverse)
price_long <- price %>% gather(key, value = "price", -Year)
size_long <- size %>% gather(key, value = "size", -Year)
performance_long <- performance %>% gather(key, value = "performance", -Year)
price_long %>%
left_join(size_long) %>%
left_join(performance_long)
Joining, by = c("Year", "key")
Joining, by = c("Year", "key")
Year key price size performance
1 2001 A 1 1 1
2 2002 A 2 2 2
3 2003 A 3 3 3
4 2001 B 2 2 2
5 2002 B 3 3 3
6 2003 B 4 4 4
7 2001 C 4 4 4
8 2002 C 5 5 5
9 2003 C 6 6 6
you can use data.table
library(data.table)
a=list(price=price,size=size,performance=performance)
dcast(melt(rbindlist(a,T,idcol = "name"),1:2),variable+Year~name)
variable Year performance price size
1: A 2001 1 1 1
2: A 2002 2 2 2
3: A 2003 3 3 3
4: B 2001 2 2 2
5: B 2002 3 3 3
6: B 2003 4 4 4
7: C 2001 4 4 4
8: C 2002 5 5 5
9: C 2003 6 6 6
We can combine the data frames, gather and spread the combined data frame.
library(tidyverse)
dat <- list(price, size, performance) %>%
setNames(c("price", "size", "performance")) %>%
bind_rows(.id = "type") %>%
gather(name, value, A:C) %>%
spread(type, value) %>%
arrange(name, Year)
dat
# Year name performance price size
# 1 2001 A 1 1 1
# 2 2002 A 2 2 2
# 3 2003 A 3 3 3
# 4 2001 B 2 2 2
# 5 2002 B 3 3 3
# 6 2003 B 4 4 4
# 7 2001 C 4 4 4
# 8 2002 C 5 5 5
# 9 2003 C 6 6 6
dplyr::bind_rows comes quiet handy in such scenarios. A solution can be as:
library(tidyverse)
bind_rows(list(price = price, size = size, performance = performance), .id="Type") %>%
gather(Key, Value, - Type, -Year) %>%
spread(Type, Value)
# Year Key performance price size
# 1 2001 A 1 1 1
# 2 2001 B 2 2 2
# 3 2001 C 4 4 4
# 4 2002 A 2 2 2
# 5 2002 B 3 3 3
# 6 2002 C 5 5 5
# 7 2003 A 3 3 3
# 8 2003 B 4 4 4
# 9 2003 C 6 6 6
The above solution is very much similar to the one by #www. It just avoids use of setNames
To round it out, here's package-free base R answer.
# gather the data.frames into a list
myList <- mget(ls())
Note that the three data.frames are the only objects in my environment.
# get the final data.frame
Reduce(merge,
Map(function(x, y) setNames(cbind(x[1], stack(x[-1])), c("Year", y, "ID")),
myList, names(myList)))
This returns
Year ID performance price size
1 2001 A 1 1 1
2 2001 B 2 2 2
3 2001 C 4 4 4
4 2002 A 2 2 2
5 2002 B 3 3 3
6 2002 C 5 5 5
7 2003 A 3 3 3
8 2003 B 4 4 4
9 2003 C 6 6 6

Sum rows by interval Dataframe

I need help in a research project problem.
The code problem is: i have a big data frame called FRAMETRUE, and a need to sum certain columns of those rows by row in a new column that I will call Group1.
For example:
head.table(FRAMETRUE)
Municipalities 1989 1990 1991 1992 1993 1994 1995 1996 1997
A 3 3 5 2 3 4 2 5 3
B 7 1 2 4 5 0 4 8 9
C 10 15 1 3 2 NA 2 5 3
D 7 0 NA 5 3 6 4 5 5
E 5 1 2 4 0 3 5 4 2
I must sum the values in the rows from 1989 to 1995 in a new column called Group1. like the column Group1 should be
Group1
22
23
and so on...
I know it must be something simple, I just don't get it, I'm still learning R
If you are looking for an R solution, here's one way to do it: The trick is using [ combined with rowSums
FRAMETRUE$Group1 <- rowSums(FRAMETRUE[, 2:8], na.rm = TRUE)
A dplyr solution that allows you to refer to your columns by their names:
library(dplyr)
municipalities <- LETTERS[1:4]
year1989 <- sample(4)
year1990 <- sample(4)
year1991 <- sample(4)
df <- data.frame(municipalities,year1989,year1990,year1991)
# df
municipalities year1989 year1990 year1991
1 A 4 2 2
2 B 3 1 3
3 C 1 3 4
4 D 2 4 1
# Calculate row sums here
df <- mutate(df, Group1 = rowSums(select(df, year1989:year1991)))
# df
municipalities year1989 year1990 year1991 Group1
1 A 4 2 2 8
2 B 3 1 3 7
3 C 1 3 4 8
4 D 2 4 1 7

How to get all my values within the same categorie to be equal in my dataframe?

So, I have a dataset that looks just like that :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2
But I do not want to have NAs in the cat column. Instead, I want every line within the same site to get the same value of cat.
Just like this :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA 1
3 10 2015 2.0 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0.0 2
8 11 2010 NA 2
9 11 2009 1.0 2
Any idea on how I can do that?
Use na.aggregate to fill in the NA values using ave to do it by site.
library(zoo)
transform(DF, cat = ave(cat, site, FUN = na.aggregate))
giving:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
Note
The input used, in reproducible form, is:
Lines <- "
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2"
DF <- read.table(text = Lines)
A complete base R alternative:
transform(DF, cat = ave(cat, site, FUN = function(x) x[!is.na(x)][1]))
which gives:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
The same logic implemented with dplyr:
library(dplyr)
DF %>%
group_by(site) %>%
mutate(cat = na.omit(cat)[1])
Or with na.locf of the zoo-package:
library(zoo)
transform(DF, cat = ave(cat, site, FUN = function(x) na.locf(na.locf(x, fromLast = TRUE, na.rm = FALSE))))
Or with fill from tidyr:
library(tidyr)
library(dplyr)
DF %>%
group_by(site) %>%
fill(cat) %>%
fill(cat, .direction = "up")
NOTE: I'm wondered what the added value is of the cat-column when cat has to be the same for each site. You'll end up with two grouping variables that do exactly the same, thus making one ot them redundant imo.
You can also use tidyr::fill
library(dplyr)
library(tidyr)
DF %>%
group_by(site) %>%
fill(cat,.direction = "up") %>%
fill(cat,.direction = "down") %>%
ungroup
# # A tibble: 9 x 4
# site year territories cat
# <int> <int> <dbl> <int>
# 1 10 2017 0 1
# 2 10 2016 NA 1
# 3 10 2015 2 1
# 4 10 2014 NA 1
# 5 10 2013 NA 1
# 6 11 2012 NA 2
# 7 11 2011 0 2
# 8 11 2010 NA 2
# 9 11 2009 1 2

R paired column index

Say I have two matrix, A and B:
mth <- c(rep(1:5,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
year <- c(2008:2012)
mth <- c(1:5)
B <- data.frame(cbind(year,mth))
What I want should be look like:
mth <- c(rep(2008:2012,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
Basically what I need is to change the column mth in A with column year in B, Maybe I didn't search for the right keyword, I was not able to get what I want(I tried which()), please help, thank you.
A2 <- merge(A,B, by = "mth")[ , -1]
names(A2)[(which(names(A2)=="year"))] <- "mth"
> A2
day hr v mth
1 10 3 3 2008
2 11 3 3 2008
3 11 4 4 2009
4 10 4 4 2009
5 11 5 5 2010
6 10 5 5 2010
7 11 6 4 2011
8 10 6 4 2011
9 10 7 3 2012
10 11 7 3 2012
Probably the easiest solution is to use merge, which is equivalent to a sql join in a lot of ways:
merge(A,B)
#-----
merge(A, B)
mth day hr v year
1 1 10 3 3 2008
2 1 11 3 3 2008
3 2 11 4 4 2009
4 2 10 4 4 2009
5 3 11 5 5 2010
6 3 10 5 5 2010
7 4 11 6 4 2011
8 4 10 6 4 2011
9 5 10 7 3 2012
10 5 11 7 3 2012
You could also probably use match like this to replace mth in place:
A$mth <- B[match(A$mth, B$mth),1]
#-----
mth day hr v
1 2008 10 3 3
2 2009 10 4 4
3 2010 10 5 5
4 2011 10 6 4
5 2012 10 7 3
6 2008 11 3 3
7 2009 11 4 4
8 2010 11 5 5
9 2011 11 6 4
10 2012 11 7 3
While a little dense, that code indexes B by matching the two mth columns from A and B and then grabs the first column.+

Resources