Issue with indexing using two data frames in R - r

I have two data frames Table_1 and Table_2 and I need to add a column "index" to Table_2 where value 1 for matching rows from Table_1 and 0 for others.
Basically, I need to match the "Pol", "CTY" ,"STATE" and "CRP" columns from Table_1 and "STATE", "CTY" , "CRP" and "Pol_No" from Table_2.
I prefer the data.table method.
Table_1:
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11
.
.
Table_2:
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522
.
.
Desired output as below.
Table_2:
STATE CTY CRP Pol_No Plan Price Index
AL 1 11 150410 90 4563 0
AL 1 21 45023 90 5402 0
MO 143 11 85010 90 2522 1
.
.
How can I achieve this is R.?
Any help is appreciated.
Thanks.

Here's an entirely data.table solution:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[,c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
#-----
STATE.y CTY Cty CRP.y Pol Plan Price
1: AL 1 NA 21 45023 90 5402
2: MO 143 23 11 85010 90 2522
3: AL 1 NA 11 150410 90 4563
#--------
t3 <- merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[ ,
c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
t3[ , index := as.numeric(!is.na(Cty))]
t3
#--------
STATE.y CTY Cty CRP.y Pol Plan Price index
1: AL 1 NA 21 45023 90 5402 0
2: MO 143 23 11 85010 90 2522 1
3: AL 1 NA 11 150410 90 4563 0
To get column names right after merge(.. I first looked at:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)
Pol Cty Avg STATE.x CRP.x STATE.y CTY CRP.y Plan Price
1: 45023 NA NA <NA> NA AL 1 21 90 5402
2: 85010 23 1123 MO 11 MO 143 11 90 2522
3: 150410 NA NA <NA> NA AL 1 11 90 4563

I think this is a straight-forward multi-column join:
library(dplyr)
t2 %>%
left_join(transmute(t1, CTY=Cty, STATE, Index=1L), by=c("CTY", "STATE")) %>%
mutate(Index = if_else(is.na(Index), 0L, Index))
# STATE CTY CRP Pol_No Plan Price Index
# 1 AL 1 11 150410 90 4563 0
# 2 AL 1 21 45023 90 5402 0
# 3 MO 143 11 85010 90 2522 1
EDIT
I've been trying to learn data.table, thought I'd give this a try. It feels a little clumsy to me, I'm sure there is a way to streamline it.
t1 <- setDT(t1); t2 <- setDT(t2)
For convenience, set the column names to be the same (I'm not sure how to make it happen easily otherwise) ... one is "Cty", the other is "CTY". Make them the same.
colnames(t1)[2] <- "CTY"
Now, the merge.
t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 NA 150410 90 4563
# 2: 1 AL 21 NA 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Notes:
the first bracket-op is selecting just the three joining columns and assigning the fourth, Index;
the actual join is in the second bracket-op, the first is just a selection
typically data.table ops work in side-effect, but not merges or selections like this, so it returns it without modifying the underlying structure; for this, we'll need to store it (back in t2, perhaps)
It's close ... now just update the Index field, since it's either 1 where the data co-exists or NA otherwise.
t2 <- t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
t2[,Index := as.integer(!is.na(Index)),]
t2
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 0 150410 90 4563
# 2: 1 AL 21 0 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Data:
t1 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11')
t2 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522')

This is not a nice solution but it works for data.table. You need sqldf which works for dataframes and data tables.
library(data.table)
df1<-data.table(Pol=c(85010,75022,35014),Cty=c(23,23,143), Avg=c(1123,1123,450),STATE=c("MO","MO","MO"), CRP=c(11,11,11))
df2=data.table(STATE=c("AL","AL","MO"),CTY=c(1,1,143),CRP=c(11,21,11),Pol_No=c(150410,45023,85010),Plan=c(90,90,90),Price=c(4563,5402,2522))
library(sqldf)
#left join
df<-sqldf("select df2.STATE,df2.CTY,df2.CRP,df2.Pol_No,df2.Plan,df2.Price,df1.Pol from df2 left join df1 on df1.Pol=df2.Pol_No")
#create index
df$index<-ifelse(is.na(df$Pol),0,1)
#delete extra column
df$Pol<-NULL
> df
STATE CTY CRP Pol_No Plan Price index
1 AL 1 11 150410 90 4563 0
2 AL 1 21 45023 90 5402 0
3 MO 143 11 85010 90 2522 1

Related

Dividing a column cell with a different number based on number of observations in a panel long format

I have the following data which is of a panel structure. I need to normalize each cell so that the observation for a country is divided by total number of observations for that country divided by total number of observations in the panel structure (here 10 - in my data 1100). Also I have showcased three countries (AL, UK, FR) but I have 92 in total so I need some general formula (mutate: by = country?).
This is my data
df1 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2,3,2,3,2,3,2,NA,1,2,1,2,1,2,1,2,1,2,NA,NA,NA,NA,NA,NA,NA,NA,4,NA))
df1
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 2
4 AL 3
5 AL 2
6 AL 3
7 AL 2
8 AL 3
9 AL 2
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 4
30 FR NA
Now, what I want is to divide each cell with number of observations available for each country / total obs like so,
df2 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,
NA,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,
2*10/10,1*10/10,2*10/10,NA,NA,NA,NA,NA,NA,NA,NA,4*1/10,NA))
df2
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 1.4
4 AL 3.7
5 AL 2.7
6 AL 3.7
7 AL 2.7
8 AL 3.7
9 AL 2.7
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 0.4
30 FR NA
I am interested in solving the problem obviously BUT I would really really appreciate it if you could show me how to do this for multiple columns as my original data needs this same operation done for many columns where the country tickers (AL, UK, FR in example) remains the same.
You can do :
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * sum(!is.na(Obs))/n()) %>%
ungroup
# Country Obs
# <chr> <dbl>
# 1 AL NA
# 2 AL NA
# 3 AL 1.4
# 4 AL 2.1
# 5 AL 1.4
# 6 AL 2.1
# 7 AL 1.4
# 8 AL 2.1
# 9 AL 1.4
#10 AL NA
# … with 20 more rows
sum(!is.na(Obs)) counts number of non-NA values in the Country whereas n() gives the number of rows for the Country.
For multiple columns -
df1 %>%
group_by(Country) %>%
mutate(across(col1:col4, ~. * sum(!is.na(.))/n())) %>%
ungroup
This will be applied to col1 to col4 in your dataframe.
Using data.table
library(data.table)
setDT(df1)[, Obs := Obs * mean(!is.na(Obs)), County]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * mean(!is.na(Obs)))

finding all flights that have at least three years of data in R

I am using the flight dataset that is freely available in R.
flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")
Now, lets say i want to find all flight that have been flying for at least three consecutive years: so there are dates available for three years in the date column. Basically i am only interested in the year part of the data.
i was thinking of the following approach: create a unique list of all plane names and then for each plane get all the dates and see if there are three consecutive years.
I started as follows:
NOyears = 3
planes <- unique(flights$plane)
# at least 3 consecutive years
for (plane in planes){
plane = "N576AA"
allyears <- which(flights$plane == plane)
}
but i am stuck here. This whole approach start looking too complicated to me. Is there an easier/faster way? Considering that i am working on a very large dataset...
Note: I want to be able to specify the number of year later on, that is why i included NOyears = 3 in the first place.
EDIT:
I have just noticed this question on SO. Very interesting use of diff and cumsum which are both new to me. Maybe a similiar approach is possible here using data.table?
dplyr will do the trick here
library(dplyr)
library(lubridate)
flights %>%
mutate(year = year(date)) %>%
group_by(plane) %>%
summarise(range = max(year) - min(year)) %>%
filter(range >= 2)
Though I'm not seeing any planes that meet criteria!
Edit: Per mnist's comment, consecutive years are a little more tricky, but here's a working example with consecutive months (the data you supplied only has one year) - just swap out for years!
nMonths = 6
flights %>%
mutate(month = month(date)) %>% #Calculate month
count(plane, month) %>% #Summarize to one row for each plane/month combo
arrange(plane, month) %>% #Arrange by plane, month so we can look at consecutive months
group_by(plane) %>% #Within each plane...
mutate(consecutiveMonths = c(0, sequence(rle(diff(month))$lengths))) %>% #...calculate the number of consecutive months each row represents
group_by(plane) %>% #Then, for each plane...
summarise(maxConsecutiveMonths = max(consecutiveMonths)) %>% #...return the maximum number of consecutive months
filter(maxConsecutiveMonths > nMonths) #And keep only those planes that meet criteria!
Here is another option using data.table:
#summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]
#add a dummy flight to demonstrate consecutive years
yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))
setkey(yearly, carrier, flight, year)
yearly[, c("rl", "rw") := {
iscons <- cumsum(c(0L, diff(year)!=1L))
.(iscons, rowid(carrier, flight, iscons))
}]
yearly[rl %in% yearly[rw>=3L]$rl]
output:
carrier flight year rl rw
1: ZZ 111 2011 5117 1
2: ZZ 111 2012 5117 2
3: ZZ 111 2013 5117 3
4: ZZ 111 2014 5117 4
Here is a data.table approach (using month, since there is only one year in that file, filtering flights that operated consecutively during 12 months):
library(data.table)
flights <- fread("http://ucl.ac.uk/~uctqiax/data/flights.csv")
flights[, month:=month(date)]
setkey(flights, plane, date)
flights[, max_run:=lapply(.SD, function(x) max(rle(cumsum(c(0, diff(unique(x))) > 1))$lengths)),
.SDcols="month", by="plane"][max_run > 11][]
#> date hour minute dep arr dep_delay arr_delay carrier
#> 1: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 2: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 3: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 4: 2011-01-02 12:00:00 NA NA NA NA NA NA XE
#> 5: 2011-01-02 12:00:00 NA NA NA NA NA NA XE
#> ---
#> 151636: 2011-11-21 12:00:00 10 56 1056 1359 25 37 FL
#> 151637: 2011-12-09 12:00:00 18 36 1836 2126 -5 -4 FL
#> 151638: 2011-12-13 12:00:00 17 27 1727 2013 -3 -7 FL
#> 151639: 2011-12-14 12:00:00 6 28 628 914 -2 -8 FL
#> 151640: 2011-12-14 12:00:00 11 57 1157 1438 -3 -14 FL
#> flight dest plane cancelled time dist month max_run
#> 1: 2174 PNS 1 NA 489 1 12
#> 2: 2277 BRO 1 NA 308 1 12
#> 3: 2811 MOB 1 NA 427 1 12
#> 4: 2204 OKC 1 NA 395 1 12
#> 5: 2570 BTR 1 NA 253 1 12
#> ---
#> 151636: 298 ATL N983AT 0 98 696 11 12
#> 151637: 296 ATL N983AT 0 89 696 12 12
#> 151638: 292 ATL N983AT 0 87 696 12 12
#> 151639: 290 ATL N983AT 0 86 696 12 12
#> 151640: 286 ATL N983AT 0 87 696 12 12
Created on 2020-05-14 by the reprex package (v0.3.0)

group "weighted" mean with multiple grouping variables and excluding own group value

I'm trying to get group "weighted" mean with multiple grouping variables and excluding own group value. This is related to my earlier post Get group mean with multiple grouping variables and excluding own group value, but when I applied it to my actual question (which is getting the weighted mean) I found out that it's much more complicated than getting the simple mean. Here's what I mean by that.
df <- data_frame(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = c(91,46,37,80,33,97,4,19,85,90,56,94),
wt = c(1,4,3,5,1,4,5,1,5,5,4,1)
) %>% arrange(state, year)
For unweighted mean case, the following code (from the accepted answer of my earlier post) should work.
df %>%
group_by(state, year) %>%
mutate(q = (sum(value) - value) / (n()-1))
The desired variable new_val, which is the weighted mean, would be the following. For instance, the first two rows of new_val column are calculated as 37*3/4 + 33*1/4 = 36, 91*1/2 + 33*1/2 = 62.
# A tibble: 12 x 6
state county year value wt new_val
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 AL a 2011 91 1 36
2 AL b 2011 37 3 62
3 AL c 2011 33 1 50.5
4 AL a 2012 46 4 87.6
5 AL b 2012 80 5 71.5
6 AL c 2012 97 4 64.9
7 CA d 2011 4 5 72.1
8 CA e 2011 85 5 27.1
9 CA f 2011 56 4 44.5
10 CA d 2012 19 1 90.7
11 CA e 2012 90 5 56.5
12 CA f 2012 94 1 78.2
I searched for similar posts with weighted mean in mind, but all the available ones were for the simple mean cases. Any comments would be greatly appreciated. Thank you!
We can use map_dbl to exclude current row in the calculation of weighted.mean
library(dplyr)
df %>%
group_by(state, year) %>%
mutate(new_val = purrr::map_dbl(row_number(),
~weighted.mean(value[-.x], wt[-.x])))
# state county year value wt new_val
# <chr> <chr> <int> <dbl> <dbl> <dbl>
# 1 AL a 2011 91 1 36
# 2 AL b 2011 37 3 62
# 3 AL c 2011 33 1 50.5
# 4 AL a 2012 46 4 87.6
# 5 AL b 2012 80 5 71.5
# 6 AL c 2012 97 4 64.9
# 7 CA d 2011 4 5 72.1
# 8 CA e 2011 85 5 27.1
# 9 CA f 2011 56 4 44.5
#10 CA d 2012 19 1 90.7
#11 CA e 2012 90 5 56.5
#12 CA f 2012 94 1 78.2

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

select maximum row value by group

I've been trying to do this with my data by looking at other posts, but I keep getting an error. My data new looks like this:
id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90
I want to select the row with the highest year value by id. So the wanted output is:
id year name gdp
1 1982 Jamie 70
2 1992 Kate 67
3 1996 Joe 90
From Selecting Rows which contain daily max value in R I tried the following but did not work
ddply(new,~id,function(x){x[which.max(new$year),]})
I've also tried
tapply(new$year, new$id, max)
But this didn't give me the wanted output.
Any suggestions would really help!
Another option that scales well for large tables is using data.table.
DT <- read.table(text = "id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90",
header = TRUE)
require("data.table")
DT <- as.data.table(DT)
setkey(DT,id,year)
res = DT[,j=list(year=year[which.max(gdp)]),by=id]
res
setkey(res,id,year)
DT[res]
# id year name gdp
# 1: 1 1982 Jamie 70
# 2: 2 1992 Kate 67
# 3: 3 1996 Joe 90
Just use split:
df <- do.call(rbind, lapply(split(df, df$id),
function(subdf) subdf[which.max(subdf$year)[1], ]))
For example,
df <- data.frame(id = rep(1:10, each = 3), year = round(runif(30,0,10)) + 1980, gdp = round(runif(30, 40, 70)))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 1 1981 47
# 3 1 1987 69
# 4 2 1985 57
# 5 2 1989 41
# 6 2 1988 54
df <- do.call(rbind, lapply(split(df, df$id), function(subdf) subdf[which.max(subdf$year)[1], ]))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 2 1989 41
# 3 3 1989 55
# 4 4 1988 62
# 5 5 1989 48
# 6 6 1990 41
You can do this with duplicated
# your data
df <- read.table(text="id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90" , header=TRUE)
# Sort by id and year (latest year is last for each id)
df <- df[order(df$id , df$year), ]
# Select the last row by id
df <- df[!duplicated(df$id, fromLast=TRUE), ]
ave works here yet again, and will account for a circumstance with multiple rows for the maximum year.
new[with(new, year == ave(year,id,FUN=max) ),]
# id year name gdp
#3 1 1982 Jamie 70
#6 2 1992 Kate 67
#9 3 1996 Joe 90
Your ddply effort looks good to me, but you referenced the original dataset in the callback function.
ddply(new,~id,function(x){x[which.max(new$year),]})
# should be
ddply(new,.(id),function(x){x[which.max(x$year),]})

Resources