Aggregate/Group_by second minimum value in R - r

I have used either group_by() in dplyr or the aggregate() function to aggregate across columns in R. For my current problem I want to group by an individual but finding the second lowest of one column (Number) and the lowest of another (Year). So, if my data looks like this:
Number Individual Year Value
123 M. Smith 2010 234
435 M. Smith 2011 346
435 M. Smith 2012 356
524 M. Smith 2015 432
119 J. Jones 2010 345
119 J. Jones 2012 432
254 J. Jones 2013 453
876 J. Jones 2014 654
I want it to become:
Number Individual Year Value
435 M. Smith 2011 346
254 J. Jones 2013 453
Thank you.

We can use the dplyr package. dt2 is the final output. The idea is to filter out the minimum in the Number column, then arrange the data frame by Individual, Number, and Year. Finally, select the first row of each group.
# Load package
library(dplyr)
# Create example data frame
dt <- read.table(text = "Number Individual Year Value
123 'M. Smith' 2010 234
435 'M. Smith' 2011 346
435 'M. Smith' 2012 356
524 'M. Smith' 2015 432
119 'J. Jones' 2010 345
119 'J. Jones' 2012 432
254 'J. Jones' 2013 453
876 'J. Jones' 2014 654",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
group_by(Individual) %>%
filter(Number != min(Number)) %>%
arrange(Individual, Number, Year) %>%
slice(1)

We can use dplyr
library(dplyr)
df1 %>%
group_by(Individual) %>%
arrange(Individual, Number) %>%
filter(Number != max(Number)) %>%
slice(which.max(Number))
# A tibble: 2 x 4
# Groups: Individual [2]
# Number Individual Year Value
# <int> <chr> <int> <int>
#1 254 J. Jones 2013 453
#2 435 M. Smith 2011 346

Related

How to assign new variables after group_split by automatically?

I try do split a dataframe by two variables, year and sectors. I did split them with group_split but everytime I need them, I have to call them with $ operator. I want to give them a name automatically so I do not need to use $ for every usage. I know I can assign them to new names by hand but I have more than 70 values so it's a bit time consuming
dummy <- data.frame(year = rep(2014:2020, 12),
sector = rep(c("auto","retail","sales","medical"),3),
emp = sample(1:2000, size = 84))
dummy%>%
group_by(year)%>%
group_split(year)%>%
set_names(nm = unique(dummy$year)) -> dummy_year
head(dummy_year$2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
I want to call them like
some_kind_of_function(dummy_year, assign new variable by date)
head(year_2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
maybe a for loop?
Maybe you want something like this:
library(dplyr)
dummy %>%
split(f = paste0("year_", as.factor(.$year)))
group_split wouldn't create named list. We can use split from base R
lst1 <- split(dummy, dummy$year)
names(lst1) <- paste0('year_', names(lst1))
If we want to create objects (not recommended), use list2env
list2env(lst1, .GlobalEnv)
-output
> year_2014
year sector emp
1 2014 auto 740
8 2014 medical 123
15 2014 sales 700
22 2014 retail 166
29 2014 auto 323
36 2014 medical 653
43 2014 sales 986
50 2014 retail 1814
57 2014 auto 1381
64 2014 medical 661
71 2014 sales 1362
78 2014 retail 641

R: fill out a new colum based on sevaral variables in another dataset

I have a first dataframe with 4 columns (ID, Year, X and Y)
Id Year X Y
1 2017 20_24
1 2016 45_49
2 2017 30_34
2 2014 20_24
4 2014 14_19
4 2015 20_24
I would like to fill out the Y column using another dataset.
The second dataset got the same variables ID and year, the other columns are the items of the column X in the first dataset.
Id Year 14_19 20_24 30_34 45_49
1 2017 123 122 5555 4444
1 2016 456 543 8888 333
1 2015 5644 0908 0987 5456
1 2014 5642 767 233 323
2 2017 123 123 5666 989
2 2016 456 876 55 45
2 2015 786 789 324 77
2 2014 633 543 334 34
3 2017 123 123 321 44
3 2016 456 345 45645 23
3 2015 876 4556 6554 23
So I would like Y to be filled out when ID, Year and items of the X variables are matching the columns of the second dataset.How is this possible ?
Thanks !
Try this dplyr and tidyr solution:
library(dplyr)
library(tidyr)
result <- df2 %>%
gather("X", "Y", -c("ID", "Year")) %>%
right_join(df1, by = c("ID", "Year", "X"))
Or with the use of pivot_longer()
result <- df2 %>%
pivot_longer(cols = 3:4,
names_to = "X",
values_to = "Y") %>%
right_join(df1, by = c("ID", "Year", "X"))

Group By and Summarize

I have a quick question, I did a group by and summarize function for the following data by doing this. However, How do I summarize the length of the variable (Trump, Obama, McConnell) individually.
dta.subset.tabluea = dta.subset %>%
group_by(variable,catvalue2) %>%
summarize(value = length(catvalue))
the output i got was
variable catvalue2 value
1 Trump Slightly Warm 216
2 Trump Very Cold 778
3 Trump Very Warm 311
4 Trump <NA> 176
5 Obama Slightly Warm 251
6 Obama Very Cold 427
7 Obama Very Warm 676
8 Obama <NA> 224
9 McConnell Slightly Warm 248
10 McConnell Very Cold 731
11 McConnell Very Warm 60
12 McConnell <NA> 444
However, How do I summarize the length of the variable (Trump, Obama, McConnell) in another column. I need this info so I can make percentages.
If i do the following, I would get the same answers as the first column.
summarize(value = length(catvalue), varvalue = length(vatvalue))

Gathering data by paired columns

I'm having trouble in shaping my dataframe.
Here's an example:
id institution name1 id1 name2 id2
1 usp Miles Davis 123 Arturo Sandoval 111
2 unb Chet Baker 321 Clifford Brown 121
3 usp Wayne Shorter 222 Hermeto Pascoal 322
4 Puc-rio John Coltrane 333 Charlie Parker 112
I need to keep the id and institution columns and gather the other ones like this:
id institution name_all id_all
1 usp Miles Davis 123
1 usp Arturo Sandoval 111
2 unb Chet Baker 321
2 unb Clifford Brown 121
3 usp Wayne Shorter 222
3 usp Hermeto Pascoal 322
4 Puc-rio John Coltrane 333
4 Puc-rio Charlie Parker 112
I'm using the gather function from the dplyr:
df %>%
gather(name_all, id_all, -id, -institution)
but it comes like this:
id institution name id
1 usp name1 Miles Davis
1 usp id1 123
2 unb name1 Chet Baker
2 unb id2 121
Any ideas on how to pair those values? I have more than 5 columns to do so, I think that I'm missing an argument to specify which one of them are paired. I hope I've made myself clear.
For a tidyverse solution, you can:
library(dplyr)
library(tidyr)
df %>%
gather(ColType, ColValue, -id, -institution) %>%
mutate(id_number = gsub("^(\\D*)(\\d*)$", "\\2", ColType, ignore.case = TRUE, perl = TRUE),
ColType = gsub("^(\\D*)(\\d*)$", "\\1", ColType, ignore.case = TRUE, perl = TRUE)
) %>%
spread(ColType, ColValue) %>%
select(-id_number)
I'm sure that there is a more elegant solution, but you can try:
df %>%
gather(var, name_all, -matches("id|institution")) %>%
gather(var2, val, -c(id, institution, var, name_all)) %>%
mutate(id_all = ifelse(parse_number(var) == parse_number(var2), val, NA)) %>%
na.omit() %>%
select(-var, -var2, -val) %>%
arrange(id)
id institution name_all id_all
1 1 usp Miles_Davis 123
2 1 usp Arturo_Sandoval 111
3 2 unb Chet_Baker 321
4 2 unb Clifford_Brown 121
5 3 usp Wayne_Shorter 222
6 3 usp Hermeto_Pascoal 322
7 4 Puc-rio John_Coltrane 333
8 4 Puc-rio Charlie_Parker 112
First, it transforms the data from wide to long, excluding the variables that are named institution or id. Second, it performs a second wide-to-long transformation to have all the numbered "id" variables and their values as separate rows. Third, it checks whether the "name" variable has the number as the "id variable. If so, it assigns the appropriate value, otherwise NA. Finally, it removes the rows with NAs, the redundant variables and arranges the data.
Sample data:
df <- read.table(text = "
id institution name1 id1 name2 id2
1 usp Miles_Davis 123 Arturo_Sandoval 111
2 unb Chet_Baker 321 Clifford_Brown 121
3 usp Wayne_Shorter 222 Hermeto_Pascoal 322
4 Puc-rio John_Coltrane 333 Charlie_Parker 112", header = TRUE, stringsAsFactors = FALSE)

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Resources