My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)
Related
site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
The final dataframe looks like this
For each site, I want to summarize various groups. Two for example: "A.low / A.high" = "sp.1/sp.1"; "A.low/ A.mix" = "sp.1/sp.2". As you will notice, there are two for each site and I want all permutations of that in my final columns. My final product would resemble something like:
site rep treatment value
1. 1/3. A.low/A.high. Inf
1. 1/4. A.low/A.high. 1
I started to use dplyr but I am really not sure how to proceed especially with all the combinations
df.dummy %>%
group_by(site) %>%
summarise(value.1 = sp.1[treatment = "A.low"] / sp.1[treatment = "A.high"])
You could use reshape2 to get the data in a format that is easier to work with.
The code below separates out the sp.1 and sp.2 data. acast is used so that each dataframe consists of a single row per site, and each column is a unique sample with the values being from sp.1 and sp.2.
Name the columns something unique and combine the dataframes with cbind.
Now each column can be compared based on your requirements.
library(dplyr)
library(reshape2)
##your setup
site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
##create unique ids and create a dataframe containing 1 value column
sp1 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.1)
sp2 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.2)
##reshape the data so that each treament and replicate is assigned a single column
##each row will be a single site
##each column will contain the values from sp.1 or sp.2
sp1 <- reshape2::acast(data = sp1, formula = site ~ id)
sp2 <- reshape2::acast(data = sp2, formula = site ~ id)
##rename columns something sensible and unique
colnames(sp1) <- c("low.1.sp1", "low.2.sp1", "high.3.sp1", "high.4.sp1",
"mix.5.sp1", "mix.6.sp1", "mix.7.sp1", "mix.8.sp1")
colnames(sp2) <- c("low.1.sp2", "low.2.sp2", "high.3.sp2", "high.4.sp2",
"mix.5.sp2", "mix.6.sp2", "mix.7.sp2", "mix.8.sp2")
##combine datasets
dat <- sp1 %>% cbind(sp2)
##choose which columns to compare. Some examples shown below
dat <- dat %>% mutate(low.1.sp1/high.3.sp1, low.1.sp1/high.4.sp1,
low.2.sp1/high.3.sp2)
Haven't been able to find an exact Q/A to match this problem, though there are several related. Im trying to calculate a distance matrix for all points groups defined by an ID column. Then select the two points that are farthest apart from each group, retaining the original group id. The number of points in each group varies from 2, 4 or 6.
My sf df:
df <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
I've tried code such as:
a <- df %>%
group_by(id) %>%
st_distance(.)
Though this just returns a distance matrix of all points.
The below gives me, what I want, though I fear it would be slow on large datasets:
maxMin <- do.call(rbind,lapply(unique(allInts$id), function(x) {
df <- allInts %>% filter(id == x)
d <- st_distance(df)
df %>% slice(unique(as.vector(which(d == max(d),arr.ind=T))))
}))
You can use dplyr::group_split to split your data frame into a list per group. You can then apply whatever function you want to that list using map/lapply.
Script below keeps the 2 points which are furthest apart in each group.
library(sf)
library(tidyverse)
# dummy data
data <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
# split it into a list per ID
data_group <- data %>%
group_by(id) %>%
group_split()
#apply a function to each list
distance_per_group <- map(data_group, function(x){
distance_matrix <- st_distance(x)
biggest_distance <- as.numeric(which(distance_matrix == max(distance_matrix), arr.ind = TRUE)[1,])
farthest_apart <- x[biggest_distance,]
})
I have 3 data sets: one of people, one of stores, and one of farmers markets, each with lat/long points.
PeopleEx <- data.frame(replicate(2,sample(0:10,10,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:10, type = "people")
StoresEx <- data.frame(replicate(2,sample(0:10,5,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:5, type = "store")
MarketsEx <- data.frame(replicate(2,sample(0:10,1,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1, type = "market")
Each have a column for "Lat","Long","ID","type", where type can be "people", "market, or "store". I am trying to find the distance between each person and the closest store as well as the closest market. To do so I made a data frame with all 3 data sets.
DfExample <- rbind(PeopleEx, StoresEx, MarketsEx)
I've been using the sp and rgeos packages to calculate distance between points.
sp.DfExample <- DfExample
coordinates(sp.DfExample) <- ~long+lat
The first 10 entries are the people with the last 6 being stores/markets so I've used this code:
Distances <- gDistance(sp.DfExample, byid=T)
DistanceStore <- Distances[1:16, 10:16]
MaxDist <- DistanceStore
apply(MaxDist, 1, FUN=min)
This works in that it reports the minimum distance from each person to a store/market, however it drops the type. Any ideas on how to keep the type from dropping or different packages/approaches to use instead?
I have a data frame which contains a 'year' variable with values between 1 and 100000 repeating multiple times. I have another data frame with 1000 'loss amounts' with an associated probability for each loss. I'd like to merge loss amounts onto the year data frame by sampling from the loss amounts table. I want to sample without replacement within each level of the year variable e.g. within each level of the year variable the loss amounts should be unique.
Reproducible example below where I can only get it to sample without replacement across the full 'year' dataset and not just within the different levels of the year variable as required. Is there a way of doing this (ideally without using loops as I need the code to run quickly)
#mean frequency
freq <- 100
years <- 100000
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
#lookup table with loss amounts
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(number){
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- sample_from_lookup(nrow(year))
year <- tibble(year = year$year, amount = amounts)
According to your description, maybe you can try replicate within your sample_from_lookup, i.e.,
sample_from_lookup <- function(number){
amount <- replicate(number,
sample(lookup$amount,
1,
replace = FALSE,
prob = lookup$pdf))
}
In this case, you need to set size 1 to your sample function.
I ended up using split to break the 'year' data into groups within a list. Then running the(slightly amended) sample_from_lookup function on each element of the list using map. Amended code below.
#mean frequency
freq <- 5
years <- 100
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
year_split = split(year, year$year)
#lookup table
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(x){
number = NROW(x)
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- map(year_split, sample_from_lookup) %>% unlist() %>% tibble(amount = .)
year <- tibble(year = year$year, amount = amounts$amount)
I am trying to run an iterative for loop to calculate correlations for levels of a factor variable. I have 16 rows of data for each of 32 teams in my data set. I want to correlate year with points for each of the teams individually. I can do this one by one but want to get better at looping.
correlate <- data %>%
select(Team, Year, Points_Game) %>%
filter(Team == "ARI") %>%
select(Year, Points_Game)
cor(correlate)
I made an object "teams" by:
teams <- levels(data$Team)
A little help in using [i] to iterate over all 32 teams to get each teams correlation of year and points would be greatly helpful!
require(dplyr)
# dummy data
data = data.frame(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each team
# r - correlation coefficient
correlate <- data %>%
group_by(Team) %>%
summarise(r = cor(Year, Points_Game))
The data.table way:
library(data.table)
# dummy data (same as #Aleksandr's)
dat <- data.table(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each Team
result <- dat[ , .(r = cor(Year, Points_Game)), by = Team]