Joining two data frames using range of values - r

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.

Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.

Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Related

R: Loop through all unique values and count them

I have a dataset with staff information. I have a column that lists their current age and a column that lists their salary. I want to create an R data frame that has 3 columns: one to show all the unique ages, one to count the number of people who are that age and one to give me the median salary for each particular age. On top of this, I would like to group those who are under 21 and over 65. Ideally it would look like this:
age
number of people
median salary
Under 21
36
26,300
22
15
26,300
23
30
27,020
24
41
26,300
etc
Over65
47
39,100
The current dataset has hundreds of columns and thousands of rows but the columns that are of interest are like this:
ageyears
sal22
46
28,250
32
26,300
19
27,020
24
26,300
53
36,105
47
39,100
47
26,200
70
69,500
68
75,310
I'm a bit lost on the best way to do this but assume some sort of loop would work best? Thanks so much for any direction or help.
library(tidyverse)
sample_data <- tibble(
age = sample(17:70, 100, replace = TRUE) %>% as.character(),
salary = sample(20000:90000, 100, replace = TRUE)
)
# A tibble: 100 × 2
age salary
<chr> <int>
1 56 35130
2 56 44203
3 20 28701
4 47 66564
5 66 60823
6 54 36755
7 66 30731
8 68 21338
9 19 80875
10 61 44547
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
sample_data %>%
mutate(age = case_when(age <= 21 ~ "Under 21",
age >= 65 ~ "Over 65",
TRUE ~ age)) %>%
group_by(age) %>%
summarise(count = n(),
median_salary = median(salary))
# A tibble: 38 × 3
age count median_salary
<chr> <int> <dbl>
1 22 4 46284.
2 23 3 55171
3 25 3 74545
4 27 1 37052
5 28 3 66006
6 29 1 82877
7 30 2 40342.
8 31 2 27815
9 32 1 32282
10 33 3 64523
# … with 28 more rows
# ℹ Use `print(n = ...)` to see more rows

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

similarity score between vectors and creating column vectors based on a function

I have a sample on which I want to create an aggregate measure based on similarity scores of the person's movie interests. For example consider the following data.
person <- c( 'John', 'John', 'Vikram', 'Kris', 'Kris', 'Lara', 'Mohi', 'Mohi', 'Mohi')
year<- c(2010, 2011,2010,2010, 2011, 2010, 2010, 2011, 2012)
sciencefiction <- c( 4, 5, 0, 44,32, 5, 32, 43,33)
romantic <- c( 19, 28, 56, 7, 4, 33, 2,1,2)
comedy<- c(22,34, 22,34,44, 54, 54,32,44)
timespent<- c(30,40, 100,33, 22, 80, 96, 22,34)
df<- data.frame(person, year, sciencefiction, romantic, comedy, timespent)
I want to variable called similarity score which is basically given by the sum of a persons[i] distance from person[j] multiplied by the time spent by j and is summed over all the combinations for one year. For example for person John for year 2010 it would be
score[john, 2010]= 0.8 * 100+ 0.6 * 33+ .98 * 80 + .73* 96 = 248.28
The 0.8 is the distance (cosine distance calculated by a.b/|a| |b|) between the john and vikram determined by the cosine angle (as shown above) between two vectors formed by sciencefiction+ romantic+comedy (see here (v[i] = 4i+19j+22k and v[j]= 0i+7j+34k)) and 100 is the time spent by Vikram in watching the movies in 2010. In a similar way the comparisons are made and aggregated for John. Is there a way I do this operation in R to create a row called score with the above procedure? Thanks
I'll step through this solution. Skip down to the bottom for the overall result.
Up front: because 2012 only has one person (Mohi), there is no output. You can easily capture this either by not filtering out self-comparisons (which should score 0) or re-merging in missing person/year rows.
Update 2: your df$person needs to be character, so either create your data with
df <- data.frame(..., stringsAsFactors = FALSE)
or modify it in-place with
df$person <- as.character(df$person)
Dependencies
I'm using dplyr here primarily because I think it clearly communicates what is going on. There is nothing in the code that could not be replaced with base functions (or even data.table).
library(dplyr)
One could use tidyr::crossing instead of expand.grid and purrr::pmap instead of mapply. They have strengths but are mostly drop-in replacements, so I leave it up to the reader.
A simple geometric angle-calculation function, for simplicity/reference
angle <- function(a, b, zero = NaN) {
num <- (a %*% b)
denom <- sqrt(sum(a^2)) * sqrt(sum(b^2))
if (denom == 0) zero else (num / denom)
}
Update: if either of the vectors is all-0, then R calculates 0/0 as NaN. Depending on your use, it may make sense to change this to 0 or NA.
Identify unique combinations (not permutations)
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
distinct() %>%
select(-person, -person2)
# # A tibble: 13 × 3
# year p1 p2
# <dbl> <chr> <chr>
# 1 2010 John Vikram
# 2 2010 John Kris
# 3 2010 John Lara
# 4 2010 John Mohi
# 5 2010 Kris Vikram
# 6 2010 Lara Vikram
# 7 2010 Mohi Vikram
# 8 2010 Kris Lara
# 9 2010 Kris Mohi
# 10 2010 Lara Mohi
# 11 2011 John Kris
# 12 2011 John Mohi
# 13 2011 Kris Mohi
If you did up through (but stopping at) the expand.grid, you'd end up with redundant pairs, e.g. "John, Vikram" and "Vikram, John". Because I'm inferring you are interested in pairwise combinations vice permutations, the rest of that code block removes redundant rows.
Bring in each person's data
(continuing in the pipe with the previous data)
... %>%
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2"))
# # A tibble: 13 × 11
# year p1 p2 sciencefiction1 romantic1 comedy1 timespent1 sciencefiction2 romantic2 comedy2 timespent2
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 4 19 22 30 0 56 22 100
# 2 2010 John Kris 4 19 22 30 44 7 34 33
# 3 2010 John Lara 4 19 22 30 5 33 54 80
# 4 2010 John Mohi 4 19 22 30 32 2 54 96
# 5 2010 Kris Vikram 44 7 34 33 0 56 22 100
# 6 2010 Lara Vikram 5 33 54 80 0 56 22 100
# 7 2010 Mohi Vikram 32 2 54 96 0 56 22 100
# 8 2010 Kris Lara 44 7 34 33 5 33 54 80
# 9 2010 Kris Mohi 44 7 34 33 32 2 54 96
# 10 2010 Lara Mohi 5 33 54 80 32 2 54 96
# 11 2011 John Kris 5 28 34 40 32 4 44 22
# 12 2011 John Mohi 5 28 34 40 43 1 32 22
# 13 2011 Kris Mohi 32 4 44 22 43 1 32 22
Calculate the angle per pair
... %>%
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f), zero=NA),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
select(year, p1, p2, starts_with("timespent"), angle)
# A tibble: 13 × 6
# year p1 p2 timespent1 timespent2 angle
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 30 100 0.8768294
# 2 2010 John Kris 30 33 0.6427461
# 3 2010 John Lara 30 80 0.9851037
# 4 2010 John Mohi 30 96 0.7347653
# 5 2010 Kris Vikram 33 100 0.3380778
# 6 2010 Lara Vikram 80 100 0.7948679
# 7 2010 Mohi Vikram 96 100 0.3440492
# 8 2010 Kris Lara 33 80 0.6428056
# 9 2010 Kris Mohi 33 96 0.9256539
# 10 2010 Lara Mohi 80 96 0.7881070
# 11 2011 John Kris 40 22 0.7311130
# 12 2011 John Mohi 40 22 0.5600843
# 13 2011 Kris Mohi 22 22 0.9533073
Finally, the score
... %>%
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()
# # A tibble: 6 × 3
# year person score
# <dbl> <chr> <dbl>
# 1 2010 John 258.23933
# 2 2010 Kris 174.09501
# 3 2010 Lara 155.14507
# 4 2010 Mohi 34.40492
# 5 2011 John 28.40634
# 6 2011 Kris 20.97276
I'm guessing the difference between my 258.24 and your 248.28 is due to the second vector (Vikram's values).
TL;DR
All at once:
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
select(-person, -person2) %>%
distinct() %>%
# p-wise lookups
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2")) %>%
# calc angles
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f)),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
# calc scores
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Reducing rows and expanding columns of data.frame in R

I have this data.frame in R.
> a <- data.frame(year = c(2001,2001,2001,2001), country = c("Japan", "Japan","US","US"), type = c("a","b","a","b"), amount = c(35,67,39,45))
> a
year country type amount
1 2001 Japan a 35
2 2001 Japan b 67
3 2001 US a 39
4 2001 US b 45
How should I transform this into a data.frame that looks like this?
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Basically I want the number of rows to be the number of (year x country) pairs, and I want to create additional columns for each type.
base solution, but requires renaming columns and rows
reshape(a, v.names="amount", timevar="type", idvar="country", direction="wide")
year country amount.a amount.b
1 2001 Japan 35 67
3 2001 US 39 45
reshape2 solution
library(reshape2)
dcast(a, year+country ~ paste("type", type, sep="."), value.var="amount")
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Another way would be to use spread in the tidyr package and rename in the dplyr package to deliver the expected outcome.
library(dplyr)
library(tidyr)
spread(a,type, amount) %>%
rename(type.a = a, type.b = b)
# year country type.a type.b
#1 2001 Japan 35 67
#2 2001 US 39 45

Resources