How to add a ranking column for this dataset? - r

My data is as follows:
df <- data.frame(
comp_name = c("A","B","C","D","E","F","G","H","J","K","L","M"),
country = c("US", "UK", "France", "Germany", "US", "UK", "France", "Germany", "US", "UK", "France", "Germany"),
profit = c(100,125,150,165,150,110,110,125,130,250,95,100)
)
df:
comp_name country profit
1 A US 100
2 B UK 125
3 C France 150
4 D Germany 165
5 E US 150
6 F UK 110
7 G France 110
8 H Germany 125
9 J US 130
10 K UK 250
11 L France 95
12 M Germany 100
I would like to add a rank column to this data frame which ranks companies by profit by country, like this:
comp_name country profit rank
1 A US 100 3
2 B UK 125 2
3 C France 150 1
4 D Germany 165 1
5 E US 150 1
6 F UK 110 3
7 G France 110 2
8 H Germany 125 2
9 J US 130 2
10 K UK 250 1
11 L France 95 3
12 M Germany 100 3
I'm relatively new to R and don't know where to start with this. Any help would be greatly appreciated. Thanks!

Does this work:
library(dplyr)
df %>% group_by(country) %>% mutate(rank = rank(desc(profit)))
# A tibble: 12 x 4
# Groups: country [4]
comp_name country profit rank
<chr> <chr> <dbl> <dbl>
1 A US 100 3
2 B UK 125 2
3 C France 150 1
4 D Germany 165 1
5 E US 150 1
6 F UK 110 3
7 G France 110 2
8 H Germany 125 2
9 J US 130 2
10 K UK 250 1
11 L France 95 3
12 M Germany 100 3

An option with data.table
library(data.table)
setDT(df)[, Rank := frank(-profit), country]

A base R option using rank + ave
transform(
df,
Rank = ave(-profit, country, FUN = rank)
)
gives
comp_name country profit Rank
1 A US 100 3
2 B UK 125 2
3 C France 150 1
4 D Germany 165 1
5 E US 150 1
6 F UK 110 3
7 G France 110 2
8 H Germany 125 2
9 J US 130 2
10 K UK 250 1
11 L France 95 3
12 M Germany 100 3

df %>%
dplyr::group_by(country) %>%
dplyr::group_map(function(x, y){
x %>% dplyr::mutate(rank = rank(-profit))
}) %>%
dplyr::bind_rows()
Karthik S provided a cleaner answer.
Apparently, group_map here is redundant

Related

group "weighted" mean with multiple grouping variables and excluding own group value

I'm trying to get group "weighted" mean with multiple grouping variables and excluding own group value. This is related to my earlier post Get group mean with multiple grouping variables and excluding own group value, but when I applied it to my actual question (which is getting the weighted mean) I found out that it's much more complicated than getting the simple mean. Here's what I mean by that.
df <- data_frame(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = c(91,46,37,80,33,97,4,19,85,90,56,94),
wt = c(1,4,3,5,1,4,5,1,5,5,4,1)
) %>% arrange(state, year)
For unweighted mean case, the following code (from the accepted answer of my earlier post) should work.
df %>%
group_by(state, year) %>%
mutate(q = (sum(value) - value) / (n()-1))
The desired variable new_val, which is the weighted mean, would be the following. For instance, the first two rows of new_val column are calculated as 37*3/4 + 33*1/4 = 36, 91*1/2 + 33*1/2 = 62.
# A tibble: 12 x 6
state county year value wt new_val
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 AL a 2011 91 1 36
2 AL b 2011 37 3 62
3 AL c 2011 33 1 50.5
4 AL a 2012 46 4 87.6
5 AL b 2012 80 5 71.5
6 AL c 2012 97 4 64.9
7 CA d 2011 4 5 72.1
8 CA e 2011 85 5 27.1
9 CA f 2011 56 4 44.5
10 CA d 2012 19 1 90.7
11 CA e 2012 90 5 56.5
12 CA f 2012 94 1 78.2
I searched for similar posts with weighted mean in mind, but all the available ones were for the simple mean cases. Any comments would be greatly appreciated. Thank you!
We can use map_dbl to exclude current row in the calculation of weighted.mean
library(dplyr)
df %>%
group_by(state, year) %>%
mutate(new_val = purrr::map_dbl(row_number(),
~weighted.mean(value[-.x], wt[-.x])))
# state county year value wt new_val
# <chr> <chr> <int> <dbl> <dbl> <dbl>
# 1 AL a 2011 91 1 36
# 2 AL b 2011 37 3 62
# 3 AL c 2011 33 1 50.5
# 4 AL a 2012 46 4 87.6
# 5 AL b 2012 80 5 71.5
# 6 AL c 2012 97 4 64.9
# 7 CA d 2011 4 5 72.1
# 8 CA e 2011 85 5 27.1
# 9 CA f 2011 56 4 44.5
#10 CA d 2012 19 1 90.7
#11 CA e 2012 90 5 56.5
#12 CA f 2012 94 1 78.2

cumulative count of character vector

I want to make a cumulative count of country names from a data frame:
df <- data.frame(country = c("Sweden", "Germany", "Sweden", "Sweden", "Germany",
"Vietnam"), year= c(1834, 1846, 1847, 1852, 1860, 1865))
I have tried different version of count(), cumsum() and tally() but can’t seem to get it right.
Output should look like:
country year n
Sweden 1834 1
Germany 1846 2
Sweden 1847 2
Sweden 1852 2
Germany 1860 2
Vietnam 1865 3
df %>% mutate(count = cumsum(!duplicated(.$country))) %>% as_tibble()
#> # A tibble: 6 x 3
#> country year count
#> <fctr> <dbl> <int>
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
or
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
df %>% mutate(var2=dist_cum(country))
#> country year var2
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
You can try this:
library(ggplot2)
library(plyr)
df<-data.frame(country=c("Sweden","Germany","Sweden","Sweden","Germany","Vietnam", "Germany"),year= c(1834,1846,1847,1852,1860,1865,1860))
counts <- ddply(df, .(df$country, df$year), nrow)
The output is:
> counts
df$country df$year V1
1 Germany 1846 1
2 Germany 1860 2
3 Sweden 1834 1
4 Sweden 1847 1
5 Sweden 1852 1
6 Vietnam 1865 1

grouping data in R and summing by decade

I have the following dataset:
ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931
I need to summarise the data by 1920's and 1930's. So I need total points for ireland, england and france in the 1920-1922 and then another total point for ireland,england and france in 1930,1931.
Any ideas? I have tried but failed.
Dataset:
x <- read.table(text = "ireland england france
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931", header = T)
How about dividing the years by 10 and then summarizing?
library(dplyr)
x %>% mutate(decade = floor(year/10)*10) %>%
group_by(decade) %>%
summarize_all(sum) %>%
select(-year)
# A tibble: 2 x 5
# decade ireland england france
# <dbl> <int> <int> <int>
# 1 1920 15 8 7
# 2 1930 5 6 7
An R base solution
As A5C1D2H2I1M1N2O1R2T1 mentioned, you can use findIntervals() to set corresponding decade for each year and then, an aggregate() to group py decade
txt <-
"ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931"
df <- read.table(text=txt, header=T)
decades <- c(1920, 1930, 1940)
df$decade<- decades[findInterval(df$year, decades)]
aggregate(cbind(ireland,england,france) ~ decade , data = df, sum)
Output:
decade ireland england france
1 1920 15 8 7
2 1930 5 6 7

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

stratified sampling with group size below sample size in R

I have response data by market in the format:
head(df)
ID market q1 q2
470 France 1 3
625 Germany 0 2
155 Italy 1 6
648 Spain 0 5
862 France 1 7
699 Germany 0 8
460 Italy 1 6
333 Spain 1 5
776 Spain 1 4
and the following frequencies:
table(df$market)
France 140
Germany 300
Italy 50
Spain 75
I need to create a data frame with a sample of 100 responses per market, and all responses without replacement in cases when there's less than 100 of them.
so
table(df_new$market)
France 100
Germany 100
Italy 50
Spain 75
Thanks in advance!
The following looks valid:
set.seed(10); DF = data.frame(c1 = sample(LETTERS[1:4], 25, T), c2 = runif(25))
freqs = as.data.frame(table(DF$c1))
freqs$ss = ifelse(freqs$Freq >= 5, 5, freqs$Freq)
#> freqs
# Var1 Freq ss
#1 A 4 4
#2 B 11 5
#3 C 7 5
#4 D 3 3
res = mapply(function(x, y) DF[sample(which(DF$c1 %in% x), y), ],
x = freqs$Var1, y = freqs$ss, SIMPLIFY = F)
do.call(rbind, res)
# c1 c2
#5 A 0.3558977
#17 A 0.2289039
#6 A 0.5355970
#13 A 0.9546536
#3 B 0.2395891
#25 B 0.8015470
#10 B 0.4226376
#15 B 0.5005032
#19 B 0.7289646
#11 C 0.7477465
#9 C 0.8998325
#12 C 0.8226526
#1 C 0.7066469
#4 C 0.7707715
#23 D 0.4861003
#20 D 0.2498805
#21 D 0.1611833

Resources