creating distinct values column till certain time - r

I have a question on how to count unique values till certain point in time. For example, I want to know how many unique location a person has lived till that point.
created<- c(2009,2010,2010,2011, 2012, 2011)
person <- c(A, A, A, A, B, B)
location<- c('London','Geneva', 'London', 'New York', 'London', 'London')
df <- data.frame (created, person, location)
I want to create a variable called unique that takes into consideration how many distinct places he has lived till that point in time. I have tried the following. Any suggestions?
library(dplyr)
df %>% group_by(person, location) %>% arrange(Created,.by_group = TRUE) %>% mutate (unique=distinct (location))
unique <- c(1, 2, 2, 3,1,1)

One way is to use cumsum and duplicated
library(dplyr)
df %>% group_by(person) %>% mutate(unique = cumsum(!duplicated(location)))
# created person location unique
# <dbl> <fct> <fct> <int>
#1 2009 A London 1
#2 2010 A Geneva 2
#3 2010 A London 2
#4 2011 A New York 3
#5 2012 B London 1
#6 2011 B London 1

We can use cummax
library(dplyr)
df %>%
group_by(person) %>%
mutate(unique = cummax(match(location, unique(location))))
# A tibble: 6 x 4
# Groups: person [2]
# created person location unique
# <dbl> <fct> <fct> <int>
#1 2009 A London 1
#2 2010 A Geneva 2
#3 2010 A London 2
#4 2011 A New York 3
#5 2012 B London 1
#6 2011 B London 1
Or with base R
df$unique <- with(df, ave(location, person, FUN =
function(x) cummax(match(x, unique(x)))))
data
df <- structure(list(created = c(2009, 2010, 2010, 2011, 2012, 2011
), person = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), location = structure(c(2L, 1L, 2L, 3L,
2L, 2L), .Label = c("Geneva", "London", "New York"), class = "factor")),
class = "data.frame", row.names = c(NA,
-6L))

Related

How to aggregate R dataframe of two columns based on values of another

My dataframe is as follows in which gender=="1" refers to men and gender=="2" refers to women, Occupations go from A to U and year goes from 2010 to 2018 (I give you a small example)
Gender Occupation Year
1 A 2010
1 A 2010
2 A 2010
1 B 2010
2 B 2010
1 A 2011
2 A 2011
1 C 2011
2 C 2011
I want an output that sums the number of rows in which gender and year and occupation is distinct like you can see next:
Year | Occupation | Men | Woman
2010 | A | 2 | 1
2010 | B | 1 | 1
2011 | A | 1 | 1
2011 | C | 1 | 1
I have tried the following:
Nr_gender_occupation <- data %>%
group_by(year, occupation) %>%
summarise(
Men = aggregate(gender=="1" ~ occupation, FUN= count),
Women = aggregate(gender=="2" ~ occupation, FUN=count)
)
We could use the index in 'Gender' to change the values, then with pivot_wider from tidyr reshape the data into 'wide' format
library(dplyr)
library(tidyr)
data %>%
mutate(Gender = c("Male", "Female")[Gender]) %>%
pivot_wider(names_from = Gender, values_from = Gender, values_fn = length)
-output
# A tibble: 4 x 4
# Occupation Year Male Female
# <chr> <int> <int> <int>
#1 A 2010 2 1
#2 B 2010 1 1
#3 A 2011 1 1
#4 C 2011 1 1
Or use table with unnest
library(tidyr)
data %>%
group_by(Year, Occupation) %>%
summarise(out = list(table(Gender)), .groups = 'drop') %>%
unnest_wider(out)
Or we can use count with pivot_wider
data %>%
count(Gender, Occupation, Year) %>%
pivot_wider(names_from = Gender, values_from = n)
data
data <- structure(list(Gender = c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L),
Occupation = c("A", "A", "A", "B", "B", "A", "A", "C", "C"
), Year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L,
2011L, 2011L)), class = "data.frame", row.names = c(NA, -9L
))
You can also do a count within your groups:
library(dplyr)
df %>%
group_by(Occupation, Year) %>%
summarize(Men = sum(Gender == 1),
Woman = sum(Gender == 2), .groups = "drop")
Output
Occupation Year Men Woman
<chr> <dbl> <int> <int>
1 A 2010 2 1
2 A 2011 1 1
3 B 2010 1 1
4 C 2011 1 1
A data.table option using dcast
dcast(setDT(df), Year + Occupation ~ c("Men", "Woman")[Gender])
gives
Year Occupation Men Woman
1: 2010 A 2 1
2: 2010 B 1 1
3: 2011 A 1 1
4: 2011 C 1 1

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

How to add additional columns using tidyr group_by function in R?

This question is a follow up to my post from this answer.
Data
df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020",
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A",
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF",
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple",
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L,
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Solution
library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
# A tibble: 2 x 7
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20
#2 6/25/2020 A 25 15 10 15 NA
#akrun's solution works well. Now I'd like to know how to add three more columns for quantity sold by salesmen to the existing results so the final output will look like this:
Date Market Revenue Total Cost Apples Sold Bananas Sold Oranges Sold MF RP FR
6/24/2020 A 135 37.5 35 20 20 20 35 20
6/25/2020 A 25 15 15 25 NA 25 NA NA
One option would be to do the group by operations separately as these are done on separate columns and then do a join by the common columns i.e. 'Date', 'Market'
library(dplyr)
library(tidyr)
out1 <- df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
out2 <- df1 %>%
group_by(Date, Market, Salesman) %>%
summarise(SalesSold = sum(Quantity)) %>%
pivot_wider(names_from = Salesman, values_from = SalesSold)
left_join(out1, out2)
# A tibble: 2 x 10
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange FR MF RP
# <chr> <chr> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20 20 20 35
#2 6/25/2020 A 25 15 10 15 NA NA 25 NA

How do I sample single (random) rows that can be grouped by a column's values? [duplicate]

This question already has answers here:
Random row selection in R
(2 answers)
Closed 6 years ago.
Here is a sample of the data
p <- structure(list(name = structure(1:5, .Label = c("Alice", "Bob",
"Charlie", "Dennis", "Earl"), class = "factor"), cohort = structure(c(3L,
3L, 2L, 2L, 1L), .Label = c("X", "Y", "Z"), class = "factor"),
group = structure(c(1L, 1L, 2L, 2L, 1L), .Label = c("A",
"B"), class = "factor"), var = c(1L, 2L, 1L, 3L, 4L)), .Names = c("name",
"cohort", "group", "var"), class = "data.frame", row.names = c(NA,
-5L))
that looks like
name cohort group var
1 Alice Z A 1
2 Bob Z A 2
3 Charlie Y B 1
4 Dennis Y B 3
5 Earl X A 4
and I need something like the following, based on the cohort column. I need to sample one row in each cohort (possibly randomly) so that I don't have multiple people belonging to the same cohort.
name cohort group var
2 Bob Z A 2
3 Charlie Y B 1
5 Earl X A 4
I can group_by cohort, but then I'm not sure how to proceed to create a new data frame with only the rows that I need.
You can try to use aggregate with sample to choose which value to keep, first, changing name and group columns from factor to character:
p$name <- as.character(p$name) ; p$group <- as.character(p$group)
aggregate(.~cohort, data=p, FUN=function(x) x[sample(seq_along(x), 1)])
# cohort name group var
#1 X Earl A 4
#2 Y Dennis B 1
#3 Z Bob A 2
You can group by cohort and pipe it to sample_n where 1 indicates that you want one sample per group
library(dplyr)
p %>% group_by(cohort) %>% sample_n(1)
Source: local data frame [3 x 4]
Groups: cohort [3]
name cohort group var
(fctr) (fctr) (fctr) (int)
1 Earl X A 4
2 Dennis Y B 3
3 Alice Z A 1
Second run:
name cohort group var
(fctr) (fctr) (fctr) (int)
1 Earl X A 4
2 Charlie Y B 1
3 Bob Z A 2
"Possibly random but not necessarily" happens to be, what SQL gives:
library(sqldf)
sqldf("SELECT * FROM p GROUP BY cohort")
in this case I get
> sqldf("SELECT * FROM p GROUP BY cohort")
name cohort group var
1 Earl X A 4
2 Dennis Y B 3
3 Bob Z A 2
You can use the sampling library:
library(sampling)
n <- length(unique(p$cohort))
s <- strata(p, 'cohort', rep(1, n), method = 'srswor')
p$id <- row.names(p)
p[p$id %in% s$ID_unit, ]

Combine result from top_n with an "Other" category in dplyr

I have a data frame dat1
Country Count
1 AUS 1
2 NZ 2
3 NZ 1
4 USA 3
5 AUS 1
6 IND 2
7 AUS 4
8 USA 2
9 JPN 5
10 CN 2
First I want to sum "Count" per "Country". Then the top 3 total counts per country should be combined with an additional row "Others", which is the sum of countries which are not part of top 3.
The expected outcome therefore would be:
Country Count
1 AUS 6
2 JPN 5
3 USA 5
4 Others 7
I have tried the below code, but could not figure out how to place the "Others" row.
dat1 %>%
group_by(Country) %>%
summarise(Count = sum(Count)) %>%
arrange(desc(Count)) %>%
top_n(3)
This code currently gives:
Country Count
1 AUS 6
2 JPN 5
3 USA 5
Any help would be greatly appreciated.
dat1 <- structure(list(Country = structure(c(1L, 5L, 5L, 6L, 1L, 3L,
1L, 6L, 4L, 2L), .Label = c("AUS", "CN", "IND", "JPN", "NZ",
"USA"), class = "factor"), Count = c(1L, 2L, 1L, 3L, 1L, 2L,
4L, 2L, 5L, 2L)), .Names = c("Country", "Count"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Instead of top_n, this seems like a good case for the convenience function tally. It uses summarise, sum and arrange under the hood.
Then use factor to create an "Other" category. Use the levels argument to set "Other" as the last level. "Other" will then will be placed last in the table (and in any subsequent plot of the result).
If "Country" is factor in your original data, you may wrap Country[1:3] in as.character.
group_by(df, Country) %>%
tally(Count, sort = TRUE) %>%
group_by(Country = factor(c(Country[1:3], rep("Other", n() - 3)),
levels = c(Country[1:3], "Other"))) %>%
tally(n)
# Country n
# (fctr) (int)
#1 AUS 6
#2 JPN 5
#3 USA 5
#4 Other 7
You can use fct_lump from the forcats library
dat1 %>%
group_by(fct_lump(Country, n = 3, w = Count)) %>%
summarize(Count = sum(Count))
This should do it, also you can change the "Other" label using the other_level param inside fct_lump
We could do this in two steps: first create a sorted data.frame, and then rbind the top three rows with a summary of the last rows:
d <- df %>% group_by(Country) %>% summarise(Count = sum(Count)) %>% arrange(desc(Count))
rbind(top_n(d,3),
slice(d,4:n()) %>% summarise(Country="other",Count=sum(Count))
)
output
Country Count
(fctr) (int)
1 AUS 6
2 JPN 5
3 USA 5
4 other 7
Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(dat1)), grouped by 'Country we get the sum of 'Count', then order by 'Count', we rbind the first three observations with the list of 'Others' and the sum of 'Count' of the rest of the observations.
library(data.table)
setDT(dat1)[, list(Count=sum(Count)), Country][order(-Count),
rbind(.SD[1:3], list(Country='Others', Count=sum(.SD[[2]][4:.N]))) ]
# Country Count
#1: AUS 6
#2: USA 5
#3: JPN 5
#4: Others 7
Or using base R
d1 <- aggregate(.~Country, dat1, FUN=sum)
i1 <- order(-d1$Count)
rbind(d1[i1,][1:3,], data.frame(Country='Others',
Count=sum(d1$Count[i1][4:nrow(d1)])))
You could even use xtabs() and manipulate the result. This is a base R answer.
s <- sort(xtabs(Count ~ ., dat1), decreasing = TRUE)
setNames(
as.data.frame(as.table(c(head(s, 3), Others = sum(tail(s, -3)))),
names(dat1)
)
# Country Count
# 1 AUS 6
# 2 JPN 5
# 3 USA 5
# 4 Others 7
A function some might find useful:
top_cases = function(v, top, other = 'other'){
cv = class(v)
v = as.character(v)
v[factor(v, levels = top) %>% is.na()] = other
if(cv == 'factor') v = factor(v, levels = c(top, other))
v
}
E.g..
> table(state.region)
state.region
Northeast South North Central West
9 16 12 13
> top_cases(state.region, c('South','West'), 'North') %>% table()
.
South West North
16 13 21
iris %>% mutate(Species = top_cases(Species, c('setosa','versicolor')))
For those interested in the case for getting categories consisting of greater than some percentage placed into an 'other' category, here's some code.
For this, any values less than 5% go into the 'other' category, the 'other' category is summed, and it includes a label of the number of categories aggregated into the 'other' category.
othernum <- nrow(sub[(sub$value<.05),])
sub<- subset(sub, value >.05)
toplot <- rbind(sub,c(paste("Other (",othernum," types)", sep=""), 1-sum(sub$value)))

Resources