How to execute a left join in R? - r

Below is the sample data and one manipulation. The first data set is employment specific to an industry. The second data set is overall employment and unemployment rate. I am seeking to do a left join (or at least that's what I think it should be) to achieve the desired result below. When I do it, I get a one to many issue with the row count growing. In this example, it goes from 14 to 18. In the larger data set, it goes from 228 to 4348. Primary question is if this can be done with only a properly written join script or is there more to it?
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
month<-c(1,2,3,4,5,6,7,8,9,10,11,12,1,2)
emp1 <-c(10,11,12,13,14,15,16,17,20,21,22,24,26,28)
firstset<-data.frame(area1,periodyear,month,emp1)
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear1<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
period<-c(01,02,03,04,05,06,07,08,09,10,11,12,01,02)
rate<-c(3.0,3.2,3.4,3.8,2.5,4.5,6.5,9.1,10.6,5.5,7.8,6.5,4.5,2.9)
emp2<-c(1001,1002,1005,1105,1254,1025,1078,1106,1099,1188,1254,1250,1301,1188)
secondset<-data.frame(area2,periodyear1,period,rate,emp2)
secondset <- secondset%>%mutate(month = as.numeric(period))
secondset <- left_join(firstset,secondset, by=c("month"))
Desired Result (14 rows with below being the first 3)
area1 periodyear month emp1 rate emp2
000000 2020 1 10 3.0 1001
000000 2020 2 11 3.2 1002
000000 2020 3 12 3.4 1005

We may have to add 'periodyear' as well in the by
library(dplyr)
left_join(firstset,secondset, by=c("periodyear" = "periodyear1",
"area1" = "area2", "month"))
-output
area1 periodyear month emp1 period rate emp2
1 0 2020 1 10 1 3.0 1001
2 0 2020 2 11 2 3.2 1002
3 0 2020 3 12 3 3.4 1005
...

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

How to calculate the average of a column using other columns as a reference using excel

I have three columns in excel year, month value.
I want to average value considering month and year. In R language this function is done by group_by(). In excel how could this be done?
year month value
2019 1 12
2019 1 34
2019 2 56
2019 2 15
2020 1 16
2020 3 67
2020 4 89
2018 6 123
2018 6 45
2018 7 98
2019 3 53
2019 1 23
2020 1 12
2020 3 1
If one has Office 365 we can use:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(CHOOSE({1,2},y,m)),{1,2}),
CHOOSE({1,1,2},u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))
Put this in the first cell and it will spill the results.
Once the HSTACK is release we can replace the CHOOSE with it:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(HSTACK(y,m)),{1,2}),
HSTACK(u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))
Averageifs would do what you want, but you might want to review using the Filter function to duplicate the Group_By() method for other similar procedures. Once grouped, you can sum/average/sort, etc.
Averageifs:
=AVERAGEIFS(C:C,A:A,2018,B:B,6)
Filter:
=filter(C:C,(A:A=2018)*(B:B=6))
=Average(filter(C:C,(A:A=2018)*(B:B=6)))
See this spreadsheet for examples of both. I realize you're using Excel, but these formulas should work on both (though they are not the same)

How to add the value of a row to other rows based on some criteria in R?

I have a panel data for costs, sampled monthly for various product types. I also have "Generic" costs which doesn't belong to any product type. A super simple representative df looks like this:
type <- c("A","A","B","B","C","C","Generic","Generic")
year <- c(2020,2020,2020,2020,2020,2020,2020,2020)
month <- c(1,2,1,2,1,2,1,2)
cost <- c(1,2,3,4,5,6,600,630)
volume <- c(10,11,20,21,30,31,60,63)
df <- data.frame(type,year,month,cost,volume)
type year month cost volume
A 2020 1 1 10
A 2020 2 2 11
B 2020 1 3 20
B 2020 2 4 21
C 2020 1 5 30
C 2020 2 6 31
Generic 2020 1 600 60
Generic 2020 2 630 63
I need to distribute the "Generic" costs to product types according to their "Volume".
For example,
For 2020-1, the volume ratio of
product type A: 10 / (10 + 20 + 30) = 1/6
product type B: 20 / (10 + 20 + 30) = 2/6
product type C: 30 / (10 + 20 + 30) = 3/6
For 2020-2, the volume ratio of
product type A: 11 / (11 + 21 + 31) = 11/63
product type B: 21 / (11 + 21 + 31) = 21/63
product type C: 31 / (11 + 21 + 31) = 31/63
So, I would like to distribute "Generic" costs for 2020-1 to product types like this:
1/6 * 600 = 100 for product type A
2/6 * 600 = 200 for product type B
3/6 * 600 = 300 for product type C
Similarly for 2020-2, I would like to distribute "Generic" costs like:
11/63 * 630 = 110 for product type A
21/63 * 630 = 210 for product type B
31/63 * 630 = 310 for product type C
In the end, I would like to end up with the following data frame:
type year month new_cost volume
A 2020 1 101 10
A 2020 2 112 11
B 2020 1 203 20
B 2020 2 214 21
C 2020 1 305 30
C 2020 2 316 31
I already have the total volume in the original dataframe within the "Generic" type, so there is no need to calculate that seperately.
I was trying to do these calculations via dplyr package's group_by() and mutate() functions, but I couldn't figure out how.
Any help is appreciated.
We can do this using data.table, by first merging in the generic costs separately and spreading them according to the percentage of volume made up by each type in each month/year:
df <- setDT(df)
generic <- df[type == "Generic"]
setnames(generic, "cost", "generic_cost")
df <- df[type !="Generic"]
df[, volume_ratio:=volume/sum(volume), by = c("year", "month")]
df <- merge(df, generic[,c("year", "month", "generic_cost")], by = c("year", "month"))
df[,new_cost:=cost + (generic_cost*volume_ratio)]
Which gives us:
df
year month type cost volume volume_ratio generic_cost new_cost
1: 2020 1 A 1 10 0.1666667 600 101
2: 2020 1 B 3 20 0.3333333 600 203
3: 2020 1 C 5 30 0.5000000 600 305
4: 2020 2 A 2 11 0.1746032 630 112
5: 2020 2 B 4 21 0.3333333 630 214
6: 2020 2 C 6 31 0.4920635 630 316
This has a few extra columns, but new cost seems to be the most important column of interest.

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Resources