R: creating new variable with conditions using dplyr - r

Hi I am trying to create a new variable with dplyr.
My data looks like the following:
Land happy year
<fctr> <int> <dbl>
1 Country1 09 2002
2 Country1 08 2012
3 Country3 05 2008
...
To create a variable with the mean of happy per Land and year, I used this code:
New <-df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T))
Now I would like to make a variable with this content:
(mean of happy in 2012)- (mean of happy in 2008) for each Country.
How can I build a new variable with these conditions?

Here's a dplyr/tidyr solution.
library(dplyr)
library(tidyr)
df <- df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T)) %>%
spread(year, mean.happy)

Here's a data.table solution. It's typically faster
library(data.table)
dt = read.table("clipboard", header = TRUE)
setDT(dt)
dt[ , "mean.happy" := mean(happy), by = .(Land, year)]
dt[ , "diff.happiness" := mean(happy[year == 2012]) - mean(happy[year == 2008])]
> dt
Land happy year mean.happy diff.happiness
1: Country1 9 2002 9 3
2: Country1 8 2012 8 3
3: Country3 5 2008 5 3

Related

Mean of few months for a monthly data in r

I want to find the average of the months from Nov to March, say Nov 1982 to Mar 1983. Then, for my result, I want a column with year and mean in another. If the mean is taken till Mar 1983, I want the year to be shown as 1983 along with that mean.
This is how my data looks like.
I want my result to look like this.
1983 29.108
1984 26.012
I am not very good with R packages, If there is an easy way to do this. I would really appreciate any help. Thank you.
Here is one approach to get average of Nov-March every year.
library(dplyr)
df %>%
#Remove data for month April-October
filter(!between(month, 4, 10)) %>%
#arrange the data by year and month
arrange(year, month) %>%
#Remove 1st 3 months of the first year and
#last 2 months of last year
filter(!(year == min(year) & month %in% 1:3 |
year == max(year) & month %in% 11:12)) %>%
#Create a group column for every November entry
group_by(grp = cumsum(month == 11)) %>%
#Take average for each year
summarise(year = last(year),
value = mean(value)) %>%
select(-grp)
# A tibble: 2 x 2
# year value
# <int> <dbl>
#1 1982 0.308
#2 1983 -0.646
data
It is easier to help if you provide data in a reproducible format which can be copied easily.
set.seed(123)
df <- data.frame(year = rep(1981:1983, each = 12),month = 1:12,value = rnorm(36))
With dplyr
# remove the "#" before in the begining of the next line if dplyr or tidyverse is not installed
#install.packages("dplyr")
library(dplyr) #reading the library
colnames(df) <- c("year","month","value") #here I assumed your dataset is named df
df<- df%>%
group_by(year) %>%
summarize(av_value =mean(value))
You can do this as follow using tidyverse
require(tidyverse)
year <- rep(1982:1984, 3)
month <- rep(1:12, 3)
value <- runif(length(month))
dat <- data.frame(year, month, value)
head(dat)
dat looks like your data
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398
The trick then is to group_by and summarise
dat %>%
group_by(year) %>%
summarise(value = mean(value))
Which gives you
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398

How to filter a grouped dataframe with a conditional statement using dplyr?

I want to filter a dataframe using dplyr using a conditional. The condition I want to test is whether a country-year combination has two versions.
df <- data.frame(country = c("country1", "country2", "country1", "country2", "country3"), year = rep(2011,5), version = c("versionA", "versionA", "versionB", "versionB", "versionB"))
Here is what I tried after looking here:
df %>%
group_by(country, year) %>%
{if unique(version)==1 . else filter(version == "versionA")}
What I am hoping to get is a dataframe that looks like this:
country year version
country1 2011 versionA
country2 2011 versionA
country3 2011 versionB
To count the number of unique values we can use n_distinct and filter the rows based on that.
library(dplyr)
df %>%
group_by(country, year) %>%
filter(if(n_distinct(version) == 2) version == 'versionA' else TRUE)
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
After grouping by 'country', 'year', filter if number of distinct elements is greater than 1 return the 'versionA', or else return the first element
library(dplyr)
df %>%
group_by(country, year) %>%
filter((n_distinct(version) > 1 & version == 'versionA')|row_number() == 1)
# A tibble: 3 x 3
# Groups: country, year [3]
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
Or this can be added in a if/else
df %>%
group_by(country, year) %>%
filter(if(n_distinct(version) > 1) version == 'versionA'
else row_number() ==1)
# A tibble: 3 x 3
# Groups: country, year [3]
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
Or another option is arrange
df %>%
arrange(country, year, version != 'versionA') %>%
group_by(country, year) %>%
slice(1)
Or with summarize
df %>%
group_by(country, year) %>%
summarise(version = if(n_distinct(version) > 1) 'versionA' else first(version))
Or using data.table
library(data.table)
setDT(df)[, .SD[if(n_distinct(version) > 1) version == 'versionA'
else 1], .(country, year)]
Base R one-liner thanks (#akrun):
df[!(duplicated(df[1:2])),]
Base R one-liner:
df[!(duplicated(df$country, df$year)),]
Tidyverse solution:
library(tidyverse)
df %>%
filter(!(duplicated(country, year)))
A more generic base R solution:
# Create a counter of versions for each year and country:
df$tmp <- with(lapply(df, function(x){if(is.factor(x)){as.character(x)}else{x}}),
ave(version, paste0(country, year), FUN = seq.int))
# Subset the dataframe to hold only the first record for each year/country:
df[which(df$tmp == 1), ]
A more generic tidyverse solution:
df %>%
arrange(version) %>%
filter(!(duplicated(country, year)))

R: Grouping in a hierarchy

I'm working on a dataset with a with grouping-system with six digits. The first two digits denote grouping on the top-level, the next two denote different sub-groups, and the last two digits denote specific type within the sub-group. I want to group the data to the top level in the hierarchy (two first digits only), and count unique names in each group.
An example for the GroupID 010203:
01 denotes BMW
02 denotes 3-series
03 denotes 320i (the exact model)
All I care about in this example is how many of each brand there is.
Toy dataset and wanted output:
df <- data.table(Quarter = c('Q4', 'Q4', 'Q4', 'Q4', 'Q3'),
GroupID = c(010203, 150503, 010101, 150609, 010000),
Name = c('AAAA', 'AAAA', 'BBBB', 'BBBB', 'CCCC'))
Output:
Quarter Group Counts
Q3 01 1
Q4 01 2
Q4 15 2
Using data.table we could do:
library(data.table)
dt[, Group := substr(GroupID, 1, 2)][
, Counts := .N, by = list(Group, Quarter)][
, head(.SD, 1), by = .(Quarter, Group, Counts)][
, .(Quarter, Group, Counts)]
Returns:
Quarter Group Counts
1: Q4 01 2
2: Q4 15 2
3: Q3 01 1
With dplyr and stringr we could do something like:
library(dplyr)
library(stringr)
df %>%
mutate(Group = str_sub(GroupID, 1, 2)) %>%
group_by(Group, Quarter) %>%
summarise(Counts = n()) %>%
ungroup()
Returns:
# A tibble: 3 x 3
Group Quarter Counts
<chr> <fct> <int>
1 01 Q3 1
2 01 Q4 2
3 15 Q4 2
Since you are already using data.table, you can do:
df[, Group := substr(GroupID,1,2)]
df <- df[,Counts := .N, .(Group,Quarter)][,.(Group, Quarter, Counts)]
df <- unique(df)
print(df)
Group Quarter Counts
1: 10 Q4 2
2: 15 Q4 2
3: 10 Q3 1
Here's my simple solution with plyr and base R, it is lightening fast.
library(plyr)
df$breakid <- as.character((substr(df$GroupID, start =0 , stop = 2)))
d <- plyr::count(df, c("Quarter", "breakid"))
Result
Quarter breakid freq
Q3 01 1
Q4 01 2
Q4 15 2
Alternatively, using tapply (and data.table indexing):
df$Brand <- substr(df$GroupID, 1, 2)
tapply(df$Brand, df[, .(Quarter, Brand)], length)
(If you don't care about the output being a matrix).

Conditional summing across columns with dplyr

I have a data frame with four habitats sampled over eight months. Ten samples were collected from each habitat each month. The number of individuals for species in each sample was counted. The following code generates a smaller data frame of a similar structure.
# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
I want to sum the total number of individuals by month, across all species sampled. I'm using ddply (preferred) but I'm open to other suggestions.
The closest I get is to add together the sum of each column, as shown here.
library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))
# Month tot_by_mon
# 1 Jan 84
# 2 Feb 92
# 3 Mar 67
This works, but I wonder if there is a generic method to handle cases with an "unknown" number of species. That is, the first species always begins in the 4th column but the last species could be in the 10th or 42nd column. I do not want to hard code the actual species names into the summary function. Note that the species names vary widely, such as Doryflav and Pheibica.
Similar to #useR's answer with data.table's melt, you can use tidyr to reshape with gather:
library(tidyr)
library(dplyr)
gather(df, Species, Value, matches("Species")) %>%
group_by(Month) %>% summarise(z = sum(Value))
# A tibble: 3 x 2
Month z
<fctr> <int>
1 Jan 90
2 Feb 81
3 Mar 70
If you know the columns by position instead of a pattern to be "matched"...
gather(df, Species, Value, -(1:3)) %>%
group_by(Month) %>% summarise(z = sum(Value))
(Results shown using #akrun's set.seed(123) example data.)
Here's another solution with data.table without needing to know the names of the "Species" columns:
library(data.table)
DT = melt(setDT(df), id.vars = c("Habitat", "Month", "Sample"))
DT[, .(tot_by_mon=sum(value)), by = "Month"]
or if you want it compact, here's a one-liner:
melt(setDT(df), 1:3)[, .(tot_by_mon=sum(value)), by = "Month"]
Result:
Month tot_by_mon
1: Jan 90
2: Feb 81
3: Mar 70
Data: (Setting seed to make example reproducible)
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
Suppose Speciess columns all start with Species, you can select them by the prefix and sum using group_by %>% do:
library(tidyverse)
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., starts_with('Species')))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58
If column names don't follow a pattern, you can select by column positions, for instance if Species columns go from 4th to the end of data frame:
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., 4:ncol(.)))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58
Here is another option with data.table without reshaping to 'long' format
library(data.table)
setDT(df)[, .(tot_by_mon = Reduce(`+`, lapply(.SD, sum))), Month,
.SDcols = Species1:Species3]
# Month tot_by_mon
#1: Jan 90
#2: Feb 81
#3: Mar 70
Or with tidyverse, we can also make use of map functions which would be efficient
library(dplyr)
library(purrr)
df %>%
group_by(Month) %>%
nest(starts_with('Species')) %>%
mutate(tot_by_mon = map_int(data, ~sum(unlist(.x)))) %>%
select(-data)
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 90
#2 Feb 81
#3 Mar 70
data
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2),
levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

max([column])where name = (each unique name in the name column) for each year in R

I am using the baby names data in R for practice.
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
The dataframe output looked like this:
For each name, there's all the years, and the related pct for that year. I am stuck with getting the year with the highest pct for each name. How do I do this?
Pretty simple, once you know where the babynames data comes from. You had everything needed:
library(dplyr)
library(babynames)
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
You were missing this final step:
df %>%
group_by(name) %>%
filter(pct == max(pct))
# A tibble: 95,025 x 3
# Groups: name [95,025]
name year pct
<chr> <dbl> <dbl>
1 Aaban 2014 4.338256e-06
2 Aabha 2014 2.440269e-06
3 Aabid 2003 1.316094e-06
4 Aabriella 2015 1.363073e-06
5 Aada 2015 1.363073e-06
6 Aadam 2015 5.997520e-06
7 Aadan 2009 6.031433e-06
8 Aadarsh 2014 4.880538e-06
9 Aaden 2009 3.335645e-04
10 Aadesh 2011 1.370356e-06
# ... with 95,015 more row
group_by and filter are your friends.

Resources