I'm not exactly sure how to describe this, but I'll gladly edit the title and/or post to reflect comments and answers.
Problem
I have two data.frames that I would like to merge with a combination of a left join, an outer join, and a rolling join.
One of the key columns (year) is for the rolling join.
Another key column (cat) is common to both data.frames. In the example below I've only supplied exemplary subsets of the full data, which has thousands of values for cat.
The first data.frame, X, has an additional key column cnty (county), and the second data.frame, Y, has an additional key column pol (pollutant).
For each group defined by cat and year, I would like the final result to contain a cartesian product of cnty and pol, with value columns emfac (from X) and tput (from Y). The goal is to be able to compute emfac * tput.
Here is an exemplary subset of X:
cat year cnty tput
1 29 2011 ALA 67852
2 29 2011 CC 33893
3 29 2011 MRN 11319
... and here is an exemplary subset of Y:
cat year pol emfac
1 29 1975 TOG 2.4
2 29 1975 PM 5.3
Closest attempt so far
I can almost, but not quite, get the output I want:
X <- structure(list(
cat = c(29L, 29L, 29L),
year = c(2011L, 2011L, 2011L),
cnty = c("ALA", "CC", "MRN"),
tput = c(67852, 33893, 11319)),
.Names = c("cat", "year", "cnty", "tput"),
class = c("data.frame"), row.names = c(NA, -3L))
Y <- structure(list(
cat = c(29L, 29L),
year = c(1975, 1975),
pol = c("PM", "TOG"),
emfac = c(2.4, 5.3)),
.Names = c("cat", "year", "pol", "emfac"),
class = c("data.frame"), row.names = c(NA, -2L))
library(data.table)
X <- data.table(X, key = c("cat", "cnty", "year"))
Y <- data.table(Y, key = c("cat", "pol", "year"))
Y[X, roll = TRUE]
cat year pol emfac cnty tput
1: 29 2011 PM 5.3 ALA 67852
2: 29 2011 PM 5.3 CC 33893
3: 29 2011 PM 5.3 MRN 11319
This is my "nearest miss". Most of my other attempts are much more wrong.
Expected result
cat year pol emfac cnty tput
1: 29 2011 PM 5.3 ALA 67852
2: 29 2011 PM 5.3 CC 33893
3: 29 2011 PM 5.3 MRN 11319
4: 29 2011 TOG 2.4 ALA 67852
5: 29 2011 TOG 2.4 CC 33893
6: 29 2011 TOG 2.4 MRN 11319
What am I doing wrong?
Related
I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
I need some help understanding the concept of joining.
I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:
Animal
Weight
Age
Dog
12
5
Cat
4
19
Fish
2
4
Mouse
1
2
Animal
Award
Dog
1st
Cat
1st
Fish
3rd
Mouse
5th
These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.
But I don't understand it when its something like this:
Mortality Rate (Heart Attack)
Year
Place
Death Rate (Heart Attack)
2011
Paris
200
2011
Paris
94
2011
Rome
23
2009
London
15
Mortality Rate (Car Crash)
Year
Place
Death Rate (Car Crash)
2011
London
987
2012
London
34
2012
Paris
09
2007
Melbourne
12
The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?
I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.
Thank you!
If you do
merge(heart, car, all=TRUE)
# Year Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne NA 12
# 2 2009 London 15 NA
# 3 2011 London NA 987
# 4 2011 Paris 200 NA
# 5 2011 Paris 94 NA
# 6 2011 Rome 23 NA
# 7 2012 London NA 34
# 8 2012 Paris NA 9
merge automatically looks for matching names and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do
merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))
which is actually what happens in this case.
Data:
heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris",
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L,
15L)), class = "data.frame", row.names = c(NA, -4L))
car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London",
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L,
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))
I need to 'multiply' two df's together to create all possible solutions, to use in a Tableau scenario.
The scenario is as follows:
I have a df1 of cars and their associated MPGs, and a df2 of zipcodes, and their associated distance from a fixed point (calculating carbon footprint). Once I get the df3 created I can do more math over the entire df to get to my final metric.
I tried my best below to represent a sample of each df, and the resulting df3 that I am looking to create. df1 is 15,000 rows, and df2 is 535 rows, meaning df3 will have 8m rows.
There may be a better way to do this in tableau; however, I am more comfortable in R.
DF1
mpg|year|make |model
--------------------
21|1985|dodge|charger
19|1993|Audi |100
DF2
zipcode|distace
---------------
20015 | 8.91
20020 | 12.72
DF3
mpg|year|make |model |zipcode|distance
-----------------------------------------
21|1985|dodge|charger| 20015 |8.91
19|1993|Audi |100 | 20015 |8.91
21|1985|dodge|charger| 20020 |12.72
19|1993|Audi |100 | 20020 |12.72
We can use crossing
library(tidyr)
crossing(DF1, DF2)
# mpg year make model zipcode distance
#1 21 1985 dodge charger 20015 8.91
#2 21 1985 dodge charger 20020 12.72
#3 19 1993 Audi 100 20015 8.91
#4 19 1993 Audi 100 20020 12.72
data
DF1 <- structure(list(mpg = c(21L, 19L), year = c(1985L, 1993L), make = c("dodge",
"Audi"), model = c("charger", "100")), class = "data.frame", row.names = c(NA,
-2L))
DF2 <- structure(list(zipcode = c(20015L, 20020L), distance = c(8.91,
12.72)), class = "data.frame", row.names = c(NA, -2L))
I know there are already a lot of questions on "sum by group" posed, however, I do not get solved my problem. Here is it:
df1 is my simplified data set
> df1 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
df2 is the desired result (see var2):
> df2 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301),
var2= c(130,130,700,700,35,35,350,350,132,132,702,702) )
So I would like to calculate the sums of var1 grouped by ID and the first two integers of category
So if the first two integers of the variable category is 09 (or 10 and so on), then assign to var2 the sum by group ID and the first two integers of category. Then, equal IDs in the same category should get assigned the same sum.
I tried to achiev that by
> df1$var2 = rep(NA, rep(length(df1$ID)))
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2)
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1)
But here the sums are not assigned to the correct item.
Could somebody help me out?
df1 = data.frame( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
I added NA values in OP's original dataframe to reflect the full specification he desired.
df1$category_sub = substr(df1$category, 1, 2)
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum)
names(df1_aggre)[3] = "var2"
df2 = merge(df1, df1_aggre, all=TRUE)
df2[order(df2$Year),]
Result:
> df2[order(df2$Year),]
ID category_sub Year category var1 var2
1 1621 09 2009 0910 60 60
4 1621 <NA> 2009 <NA> 70 NA
5 1628 09 2009 0911 400 700
6 1628 09 2009 0913 300 700
9 3101 09 2009 0914 15 35
10 3101 09 2009 0910 20 35
11 3105 09 2009 0910 200 200
12 3105 <NA> 2009 <NA> 150 NA
2 1621 10 2010 1014 61 132
3 1621 10 2010 1012 71 132
7 1628 10 2010 1013 301 301
8 1628 <NA> 2010 <NA> 401 NA
I first extracted the first two integers from category and grouped var1 by ID and category_sub. I then renamed var1 to var2 and merged df1 and df1_aggre by ID and category_sub with all=TRUE option. This specifies a full outer join. The resulting dataframe was unsorted, so I sorted df2 by Year to get the desired result.
I have a data frame that looks like this:
ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1 A 80 XX 100 NA NA BB 45
2 XX 90 NA NA AA 80 XX 80
I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:
ID Subject Test Score Year
1 rd A 80 2011
1 mt XX 100 2012
1 mt BB 45 2012
2 rd XX 90 2011
2 rd AA 80 2012
2 mt XX 80 2012
I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:
library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")
I've had more success (gotten closer) with reshape:
y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE),
timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
new.row.names=NULL)
This will get your data into the right format:
df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
timevar="Year", varying=list(grep("Test", names(df), value=TRUE),
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL)
Then omitting NA:
df.long = df.long[!is.na(df.long$Test),]
Then splitting Year to remove Test_:
df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
And ordering by ID:
df.long[order(df.long$ID),]
ID Year Test Score
1 1 2011 A 80
5 1 2012 XX 100
2 2 2011 XX 90
9 2 2013 AA 80
6 3 2012 A 10
3 4 2011 A 50
7 4 2012 XX 60
10 4 2013 AA 99
4 5 2011 C 50
8 5 2012 A 75
Using reshape:
dat.long <- reshape(dat, direction="long", varying=list(c(2, 4,6), c(3, 5,7)),
times=2011:2013,timevar='Year',
sep="_", v.names=c("Test", "Score"))
dat.long[complete.cases(dat.long),]
ID Year Test Score id
1.2011 1 2011 A 80 1
2.2011 2 2011 XX 90 2
4.2011 4 2011 A 50 4
5.2011 5 2011 C 50 5
1.2012 1 2012 XX 100 1
3.2012 3 2012 A 10 3
4.2012 4 2012 XX 60 4
5.2012 5 2012 A 75 5
2.2013 2 2013 AA 80 2
4.2013 4 2013 AA 99 4
Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.
The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.
However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.
For this, we can use some gsub to rearrange the existing names.
library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID" "test_rd_2011" "score_rd_2011" "test_mt_2011"
# [5] "score_mt_2011" "test_rd_2012" "score_rd_2012" "test_mt_2012"
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
# ID Subject Year test score
# 1: 1 mt 2011 XX 100
# 2: 1 mt 2012 BB 45
# 3: 1 rd 2011 A 80
# 4: 2 mt 2012 XX 80
# 5: 2 rd 2011 XX 90
# 6: 2 rd 2012 AA 80
For the benefit of others, "mydf" in this answer is defined as:
mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"),
rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA),
mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"),
rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"),
mt_score_2012 = c(45L, 80L)),
.Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011",
"mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012",
"mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))