Format of time series data with regression variables - r

I am new to R and am attempting an analysis in R with the tslm() function.
Sample data in csv format:
UnitSales GDP GDPPerCap CPI PropInvIndex DispIncTopDecile TransCommSecDecile CivilVehOwn Urban AutoFin
2000 1 1198243.4 949 81.62 4984 10643 618 1609 36.22 0
2001 2 1324337.8 1042 81.38 6344 14219 782 1802 37.66 0
2002 3 1453827.558 1135 80.8 7790.9223 18995.9 991.2 2053.17 39.08978381 0
2003 4 1640958.735 1274 81.83 10153.8009 21837.3 1106 2382.93 40.53022975 0
2004 5 1931644.33 1409 85.02 13158.2516 25377.2 1274.2 2693.71 41.76000862 0
2005 6 2256902.591 1731 86.56 15909.2471 28773.1 1590.3 3160 42.98999663 0
2006 7 2712950.885 2069 87.83 19422.9174 31967.3 1801 3697.3531 44.34301016 0
2007 9 3494055.942 2651 92.02 25288.8373 36784.5 2467.7 4358.355 45.8892446 0.1
2008 11 4521827.271 3414 97.45 31203.1942 43613.8 2632.9 5099.6094 46.98950317 0.12
2009 13 4990233.519 3749 96.76 36241.808 46826.1 3181.9 6280.6086 48.34170101 0.14
2010 15 5930502.27 4433 100 48259.403 51431.6 3630.6 7801.8259 49.94966105 0.16
2011 18 7321891.955 5447 105.45 61796.8858 58841.9 3963 9356.3163 51.27027127 0.18
2012 21 8229490.03 6093 108.22 71803.7869 63824.2 4304.1 10933.0912 52.57008656 0.22
I load the data and then attempt to run:
testc <- tslm(UnitSales~GPD+trend, data=lm0015c)
A simple attempt to model UnitSales from the GDP variable plus the trend.
I get the following error:
Error in tslm(UnitSales ~ GPD + trend, data = lm0015c) :
Not time series data
How do I designate the data as a time series?

You can create a time series formatted version of your data directly with the ts() function.
yourGreatData <- ts(d[,2:length(d)], start = 2000, end = 2012)

I discovered as easier solution to this problem is simply by coercing your dataframe as a time series:
testc <- tslm(UnitSales~GPD+trend, data=as.ts(lm0015c))

Related

plm - industry + year fixed effects with firm-year data

My question is, how do I include industry and year fixed effects in plm, when I have multiple firms in same industry in same year? Repex of my data looks like this:
Year Industry CompanyID CEOID CEO.background MBA.CEO CEO.Tenure Female.CEO CEO.age Capex Log.TA Leverage
2005 6 1075 10739 0 0 6.92 0 55 0.08623238 9.199961396 0.330732917
2006 6 1075 10739 0 0 7.92 0 56 0.097455145 9.334559982 0.26575725
2007 6 1075 10739 0 0 8.92 0 56 0.113033772 9.346263914 0.285439531
2008 6 1075 10739 0 0 9.92 0 57 0.108640177 9.327564318 0.322985772
2009 6 1075 5835 0 0 0.67 0 54 0.08526524 9.360491034 0.333880116
2010 6 1075 5835 0 0 1.67 0 55 0.081452292 9.376545673 0.32197511
2005 6 1743 8379 0 0 17.43 0 65 0.236487293 6.693007633 0.021915227
2006 6 1743 26012 0 1 0.91 0 59 0.319264835 6.820455133 0.023157959
2007 6 1743 26012 0 1 1.91 0 58 0.207384938 6.844512984 0.020087012
2008 6 1743 26012 0 1 2.92 0 59 0.130632264 6.890964093 0.017103795
2009 6 1743 26012 0 1 3.92 0 60 0.112029325 6.879662342 0.017283796
2010 6 1743 30801 0 0 1 1 47 0.02804693 6.767971236 0.044755539
2005 7 1004 9249 0 0 9.65 0 53 0.076370794 6.596094672 0.31534354
2006 7 1004 9249 0 0 10.65 0 54 0.114891589 6.886346743 0.327808308
2007 7 1004 9249 0 0 11.65 0 55 0.097727719 6.973199328 0.307086799
2008 7 1004 9249 0 0 12.65 0 56 0.112119583 7.216716829 0.389800369
2009 7 1004 9249 0 0 13.65 0 57 0.086281135 7.228033526 0.331455792
2010 7 1004 9249 0 0 14.65 0 58 0.298922358 7.313914813 0.291147083
CEO.background, MBA.CEO, and Female.CEO are time-invariant dummies for each CEO and industry time-invariant dummy for firm, while rest are time varying firm/CEO attributes.
I would like to run the following fixed effects for industry/year regression code:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage, data=repexcapex, index = (c("Industry", "Year")), model = "within", effect = "twoways")
However, if I have multiple companies in same industry like above data (company ID 1075/1743 both in industry 6), the code gives an error about duplicates.
Error in pdim.default(index[[1]], index[[2]]) :
duplicate couples (id-time)
In addition: Warning messages:
1: In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
[...]
If I kill the first 5 rows and run it with just 1 firm per industry, the code works.
How should I formulate my regression to be able to include both industry and year fixed effects? Is running the code with industry dummies like below equivalent to industry fixed effects:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage + factor(Industries), data=repexcapex, index = (c("Year")), model = "within", effect = "individual")
this is the formatted data:
repexcapex <- read.table(text="
Year,Industry,CompanyID,CEOID,CEO.background,MBA.CEO,CEO.Tenure,Female.CEO,CEO.age,Capex,Log.TA,Leverage
2005,6,1075,10739,0,0,6.92,0,55,0.08623238,9.199961396,0.330732917
2006,6,1075,10739,0,0,7.92,0,56,0.097455145,9.334559982,0.26575725
2007,6,1075,10739,0,0,8.92,0,56,0.113033772,9.346263914,0.285439531
2008,6,1075,10739,0,0,9.92,0,57,0.108640177,9.327564318,0.322985772
2009,6,1075,5835,0,0,0.67,0,54,0.08526524,9.360491034,0.333880116
2010,6,1075,5835,0,0,1.67,0,55,0.081452292,9.376545673,0.32197511
2005,6,1743,8379,0,0,17.43,0,65,0.236487293,6.693007633,0.021915227
2006,6,1743,26012,0,1,0.91,0,59,0.319264835,6.820455133,0.023157959
2007,6,1743,26012,0,1,1.91,0,58,0.207384938,6.844512984,0.020087012
2008,6,1743,26012,0,1,2.92,0,59,0.130632264,6.890964093,0.017103795
2009,6,1743,26012,0,1,3.92,0,60,0.112029325,6.879662342,0.017283796
2010,6,1743,30801,0,0,1,1,47,0.02804693,6.767971236,0.044755539
2005,7,1004,9249,0,0,9.65,0,53,0.076370794,6.596094672,0.31534354
2006,7,1004,9249,0,0,10.65,0,54,0.114891589,6.886346743,0.327808308
2007,7,1004,9249,0,0,11.65,0,55,0.097727719,6.973199328,0.307086799
2008,7,1004,9249,0,0,12.65,0,56,0.112119583,7.216716829,0.389800369
2009,7,1004,9249,0,0,13.65,0,57,0.086281135,7.228033526,0.331455792
2010,7,1004,9249,0,0,14.65,0,58,0.298922358,7.313914813,0.291147083",
sep=",",header=TRUE)
As your dependent variable Capex seems to be a company-specific measure, likely the unit of observation (= what plm calls the individual dimension) is company (variable CompanyID) which is to be specified in the index argument.
Thus, a basic 2-way model can be estimated by:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage, data=repexcapex, index = (c("CompanyID", "Year")), model = "within", effect = "twoways")
To add industry fixed effects, include +factor(Industry) in the formula. Likely, this variable will drop out of the estimation as it is correlated with the other fixed effects (it is for the small sample data you provided).

Looping over two R dataframes to create a third dataframe

I am trying to backtest a trading strategy.
I am coding in R 3.6
The data is in two dataframes. The first is five years of daily price activity (i.e. fiveyrdaily). The second dataframe is price activity for the same five years only at the one minute level (i.e. fiveyrminutes). This data includes sessions low, high, open, and close.
My strategy is to loop over the respective days of the minute dataframe, while referring to and populating data on the daily dataframe to determine the following:
Condition that warrants the opening of a position (i.e. long or short).
Whether the target or stop price has been achieved.
Log the "trade" on a third dataframe (ORBOorders).
The embedded loops are not working the way I thought they would and I can't figure out why.
I know I can make the code simplier but it keeps getting drawn out so I can figure out where it is I'm going wrong.
I understand that appending to a vector in a loop is not the preferred way to handle vectors but I don't know how large the vector will be after the program is complete. I'd love to hear any ideas.
This is a strategy that seeks to catch price as it moves out of the range established in the first five minutes of the day.
This is the code to identify the direction of price movement after the initial 5 minutes. The counter is not working. The goal is to know the index of a specific event within the subsetted fiveyrminute dataframe (i.e. todaysdate); I will use this in the next batch of code below. The counter starts at 6 for the sixth minute. The output is never higher than 7, so, it's not counting with each iteration of the loop. I can't figure out why.
for (i in 1:length(fiveyrdaily$Date)){
todaysdate <- subset(fiveyrminutes, fiveyrdaily$Date[i] == fiveyrminutes$Date) # subset so I'm only seeing the respective days in the minutes dataframe
counter3 <- 6 # start a counter so I can track which position I'm in within the minutes dataframe, start after ORBO
for (t in counter:length(todaysdate$Date)){ #loop over the minutes dataframe
if ((fiveyrdaily$FiveHighClose[i] == 0) & (todaysdate$Close[t] > fiveyrdaily$FiveMinuteHigh[i])) {
fiveyrdaily$FiveHighClose[i] <- todaysdate$Close[t]
fiveyrdaily$FiveHighAfterClose[i] <- todaysdate$High[t] #record the session's high
counter3 <- counter3 + 1
fiveyrdaily$counterhigh[i] <- counter3 # record the position of this event within the subset of the minute dataframe
}
else if ((fiveyrdaily$FiveLowClose[i] == 0) & (todaysdate$Close[t] < fiveyrdaily$FiveMinuteLow[i])) {
fiveyrdaily$FiveLowClose[i] <- todaysdate$Close[t]
fiveyrdaily$FiveLowAfterClose[i] <- todaysdate$Low[t] # record the session's low
counter3 <- counter3 + 1
fiveyrdaily$counterlow[i] <- counter3
}
counter3 <- counter + 1
}
}
This is the "for" loop to determine if to buy and record the outcome.
#Open a long position
for (i in 1:length(fiveyrdaily$Date)){
if ((fiveyrdaily$counterhigh[i] < fiveyrdaily$counterlow[i]) & openposition < 2) {
#entryprice is the price predetermined ticks above the high
entryprice <- fiveyrdaily$FiveHighAfterClose[i] + (openticksaway * tickvalue)
#create stoploss
stoplossprice <- entryprice - stoplossvalue
#uncover target closest to five minute high
if (possibletargets > fiveyrdaily$FiveHighAfterClose[i]){ fiveyrdaily$ORBOtarget[i] <- min(possibletargets)}
poscounter <- fiveyrdaily$counterhigh[i]
beginforloop <- todaysdate[poscounter]
for (c in beginforloop:length(todaysdate$Date)){ #see where to open position starting from the occurence of the high
if ((entryprice > todaysdate$Low[c]) & (entryprice < todaysdate$High[c]) & (openposition < 2)){ #determine if conditions warrant entry
openposition <- openposition + 1 # open a position
if (openposition > 0){ #trade management
openpos <- 1
if ((stoplossprice < todaysdate$high[c]) & (stoplossprice > todaysdate$Low[c])){ # determine if stoploss has been hit
orderdate <- c(orderdate, todaysdate[c]) #enter data into orders dataframe
orderstrategy <- c(orderstrategy, "ORBO")
ordertype <- c(ordertype, "Long")
ordersymbol <- c(ordersymbol, "ES")
orderentry <- c(orderentry, entryprice)
orderclose <- c(orderclose, stoplossprice)
orderprofit <- c(orderprofit, abs((entryprice - stoplossprice) * pointvalue))
openpos <- 0
}
else if ((fiveyrdaily$ORBOtarget[i] < todaysdate$high[c]) & (fiveyrdaily$ORBOtarget[i] > todaysdate$Low[c])){ # determine if target has been hit
orderdate <- c(orderdate, todaysdate[c]) #enter data into orders dataframe
orderstrategy <- c(orderstrategy, "ORBO")
ordertype <- c(ordertype, "Long")
ordersymbol <- c(ordersymbol, "ES")
orderentry <- c(orderentry, entryprice)
orderclose <- c(orderclose, stoplossprice)
orderprofit <- c(orderprofit, abs((fiveyrdaily$ORBOtarget[i] - entryprice) * pointvalue))
openpos <- 0
}
}
}
}
}
}
Here is the data that is in the daily dataframe - fiveyrdaily:
> tail(fiveyrdaily)
Date Time Open High Low Close Vol OI UpperBand
1254 06/12/2019 16:15 2883.50 2889.75 2875.25 2881.00 1205406 2495060 2919.12
1255 06/13/2019 16:15 2894.75 2900.50 2886.75 2898.50 523312 448119 2925.39
1256 06/14/2019 16:15 2893.00 2899.75 2884.25 2894.75 1318938 951568 2927.99
1257 06/17/2019 16:15 2895.50 2902.75 2892.00 2896.25 1649621 1595842 2932.71
1258 06/18/2019 16:15 2914.00 2936.50 2910.25 2926.25 2257843 2093571 2944.19
1259 06/19/2019 16:15 2925.50 2936.75 2915.25 2933.50 1639495 2093571 2954.61
LowerBand MidLine PP RSI OverBot OverSld SlowK SlowD OverBot.1
1254 2751.13 2835.13 2892.58 56.24 70 30 86.82 87.54 80
1255 2749.21 2837.30 2882.00 59.06 70 30 87.60 88.11 80
1256 2748.24 2838.11 2895.25 58.19 70 30 89.01 87.81 80
1257 2746.94 2839.82 2892.92 58.46 70 30 91.79 89.47 80
1258 2743.68 2843.94 2897.00 63.41 70 30 92.63 91.14 80
1259 2740.02 2847.31 2924.33 64.51 70 30 95.20 93.21 80
OverSld.1 Volume Momentum ZeroLine InOrOut GapUpOrDown TypeOfDay PPTrend
1254 20 1205406 49.25 0 Disregard Disregard Bear Uptrend
1255 20 523312 93.50 0 Disregard Gap Up Bull Uptrend
1256 20 1318938 114.75 0 Disregard Disregard Bull Uptrend
1257 20 1649621 105.75 0 Disregard Disregard Bull Uptrend
1258 20 2257843 173.75 0 Disregard Gap Up Bull Uptrend
1259 20 1639495 184.00 0 Disregard Disregard Bull Uptrend
FiveMinuteLow FiveMinuteHigh FiveMinHtoL DollarFiveHtoL FiveHighClose
1254 2881.25 2889.75 8.50 425.0 0.00
1255 2892.00 2895.75 3.75 187.5 2896.00
1256 2886.25 2893.75 7.50 375.0 2894.75
1257 2892.00 2897.00 5.00 250.0 2897.25
1258 2910.25 2915.50 5.25 262.5 2920.75
1259 2923.25 2927.25 4.00 200.0 2930.00
FiveHighAfterClose FiveLowClose FiveLowAfterClose ORBOtarget counterhigh
1254 0.00 2881.00 2880.75 0 0
1255 2896.00 2891.75 2891.00 0 6
1256 2894.75 2886.00 2885.00 0 7
1257 2897.50 0.00 0.00 0 7
1258 2921.75 0.00 0.00 0 7
1259 2931.50 2922.50 2922.25 0 7
counterlow DollarOpentoHigh DollarOpentoClose DollarOpentoLow
1254 7 312.5 125.0 412.5
1255 7 287.5 187.5 400.0
1256 7 337.5 87.5 437.5
1257 0 362.5 37.5 175.0
1258 0 1125.0 612.5 187.5
1259 7 562.5 400.0 512.5
This is the data that is in the minutes dataframe - fiveyrminutes:
Date Time Open High Low Close Up Down UpperBand
509796 06/19/2019 16:10 2932.25 2932.50 2932.25 2932.50 717 430 2935.66
509797 06/19/2019 16:11 2932.25 2932.50 2932.25 2932.25 125 276 2935.46
509798 06/19/2019 16:12 2932.25 2932.75 2932.25 2932.75 612 604 2934.95
509799 06/19/2019 16:13 2932.50 2933.25 2932.50 2933.00 830 153 2934.66
509800 06/19/2019 16:14 2933.25 2933.25 2932.75 2933.00 676 376 2934.26
509801 06/19/2019 16:15 2932.75 2934.00 2932.75 2933.25 2929 2026 2933.90
LowerBand MidLine PP RSI OverBot OverSld SlowK SlowD OverBot.1
509796 2930.27 2932.96 2932.42 47.94 70 30 45.45 40.66 80
509797 2930.24 2932.85 2932.42 46.22 70 30 53.70 46.21 80
509798 2930.45 2932.70 2932.33 50.07 70 30 63.83 54.33 80
509799 2930.56 2932.61 2932.58 51.92 70 30 72.73 63.42 80
509800 2930.76 2932.51 2932.92 51.92 70 30 83.33 73.30 80
509801 2930.97 2932.44 2933.00 53.90 70 30 85.00 80.35 80
OverSld.1 Volume Momentum ZeroLine
509796 20 1147 -0.50 0
509797 20 401 -1.00 0
509798 20 1216 1.75 0
509799 20 983 1.75 0
509800 20 1052 2.00 0
509801 20 4955 1.75 0
This is the output for the orders dataframe, notice how it's empty - ORBOorders:
ORBOorders
orderdate orderstrategy ordertype ordersymbol orderentry orderclose
1 0 0
orderprofit
1 0
These are the problems:
-counter3 isn't working (to find after which point I should buy)
-The second batch of code gives this error:
Error in beginforloop:length(todaysdate$Date) : argument of length 0
In addition: Warning message: In if (possibletargets >
fiveyrdaily$FiveHighAfterClose[i]) { : the condition has length > 1
and only the first element will be used
-The ORBOorder dataframe has absolutely no data in it.
Thanks for any help in advance!

Selecting pairs of odd even values in R

I have a large dataset as follows:
head(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type
Sampletype
400 32691 535 -28 3382.981 34.74480 1 S3 2011 2 ha
401 32701 536 -28 3375.263 34.86087 1 S3 2011 2 ha
402 32711 537 -28 3308.103 34.83100 1 S3 2011 2 ha
403 32721 538 -28 3368.721 31.58641 1 S3 2011 2 ha
404 32731 539 -28 3368.604 34.72326 1 S3 2011 2 ha
405 32741 540 -28 3314.713 32.83147 1 S3 2011 2 ha
tail(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type Sampletype
5445 70880 3962 -28.4 3390.458 29.12815 34 S4 2016 2 ha
5446 70890 3963 -28.5 3358.861 37.14896 34 S4 2016 2 ha
5447 70900 3964 -28.5 3363.626 26.71573 34 S4 2016 2 ha
5448 70910 3965 -28.5 3408.907 26.69665 34 S4 2016 2 ha
5449 70920 3966 -28.5 3348.463 29.01492 34 S4 2016 2 ha
5450 70930 3967 -28.4 3375.247 26.78261 34 S4 2016 2 ha
I am looking to create a variable to identify pairs of odd and even based on the variable GU.Number. These numbers identify duplicates of the same object - have same d13.C values.
For example,
535 - 536
537 - 538
3963-3964
3965-3966 are pairs.
Note, the column of GU.Number is not a sequence, some numbers are missing.
even.rows <- which(!(humic$GU.Number %% 2))
has.pair <- rep(0,nrow(humic))
for(i in even.rows){
has.pair[i] <- max((humic$GU.Number[i] + c(1,-1)) %in% humic$GU.Number)
}
# add as column of data
humic$has.pair <- has.pair
The has.pair column will be 1 if the GU.Number is even and there exists an odd GU.Number one less or one greater than the given GU.Number. Otherwise it will be 0. As a one-liner:
humic$has.pair <- sapply(1:nrow(humic),
function(x) with(humic,(!(GU.Number[x] %% 2))*max((GU.Number[x] + c(1,-1)) %in% GU.Number)))

How to give a function a specific column of a list using a for-loop but prevent that output is named according to the iterator command

Given the following example:
library(metafor)
dat <- escalc(measure = "RR", ai = tpos, bi = tneg, ci = cpos, di = cneg, data = dat.bcg, append = TRUE)
dat
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
trial author year tpos tneg cpos cneg ablat alloc yi vi
1 1 Aronson 1948 4 119 11 128 44 random -0.8893 0.3256
2 2 Ferguson & Simes 1949 6 300 29 274 55 random -1.5854 0.1946
3 3 Rosenthal et al 1960 3 228 11 209 42 random -1.3481 0.4154
4 4 Hart & Sutherland 1977 62 13536 248 12619 52 random -1.4416 0.0200
5 5 Frimodt-Moller et al 1973 33 5036 47 5761 13 alternate -0.2175 0.0512
6 6 Stein & Aronson 1953 NA NA NA NA 44 alternate NA NA
7 7 Vandiviere et al 1973 8 2537 10 619 19 random -1.6209 0.2230
8 8 TPT Madras 1980 505 87886 499 87892 NA random 0.0120 0.0040
9 9 Coetzee & Berjak 1968 29 7470 45 7232 27 random -0.4694 0.0564
10 10 Rosenthal et al 1961 17 1699 65 1600 42 systematic -1.3713 0.0730
11 11 Comstock et al 1974 186 50448 141 27197 18 systematic -0.3394 0.0124
12 12 Comstock & Webster 1969 5 2493 3 2338 33 systematic 0.4459 0.5325
13 13 Comstock et al 1976 27 16886 29 17825 33 systematic -0.0173 0.0714
Now what i basically want is to iterate with the rma() command (only for mods argument) from - let's say - [7:8] and to store this result in a variable equal to the columnname.
Two problems:
1) When i enter the command:
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
The modname is named as dat[[8]]. But I want the modname to be the columname (i.e. colnames(dat[i]))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 0.5543 1.4045 0.3947 0.7312 -5.4888 6.5975
dat[[8]] -0.0312 0.0435 -0.7172 0.5477 -0.2185 0.1560
2) Now imagine that I have a lot of columns more and I want to iterate from [8:53], such that each result gets stored in a variable named equal to the columnname.
Problem 2) has been solved:
for(i in 7:8){
assign(paste(colnames(dat[i]), i, sep=""), rma(yi, vi, data = dat, mods = ~dat[[i]], subset = (alloc=="systematic"), knha = TRUE))}
To answers 1st part of your question, you can change the names by accessing the attributes of the model object.
In this case
# inspect the attributes
attr(model$vb, which = "dimnames")
# assign the name
attr(model$vb, which = "dimnames")[[1]][2] <- paste(colnames(dat)[8])

How to sum a variable and extract it related variable to appear in the answer

I am new to R and to Stackoverflow and I need an assistant in sorting and extracting information from a data frame I created. I need to extract which IATA and NAME has received the most commission. The result should print: 3301, you pay, 12. I can subset each and every IATA but it is a long process. What will be the best function in R to sort all this information and print out this information.
IATA NAME TICKET_NUM PAX FARE TAX COMM NET
3300 pay more 700 john cohen 10 1.1 2 8
3300 pay more 701 james levy 11 1.2 2 9
3300 pay more 702 jonathan arbel 12 1.2 3 9
3300 pay more 703 gil matan 9 1.0 2 7
3301 you pay 704 ron natan 19 2.0 6 9
3301 you pay 705 don horvitz 18 2.0 6 9
3302 pay by ticket 706 lutter kaplan 9 1.2 0 9
3303 enjoy 707 lutter omega 12 1.2 0 12
3303 enjoy 708 graig daniel 14 1.3 1 13
3303 enjoy 730 orly rotenberg 15 1.0 1 14
3303 enjoy 731 yohan bach 12 1.0 1 11
This seems to return what you requested (using Jeremy's code for the second part):
comm <- read.table(text = '
IATA NAME TICKET_NUM PAX FARE TAX COMM NET
3300 pay.more 700 john.cohen 10 1.1 2 8
3300 pay.more 701 james.levy 11 1.2 2 9
3300 pay.more 702 jonathan.arbel 12 1.2 3 9
3300 pay.more 703 gil.matan 9 1.0 2 7
3301 you.pay 704 ron.natan 19 2.0 6 9
3301 you.pay 705 don.horvitz 18 2.0 6 9
3302 pay.by.ticket 706 lutter.kaplan 9 1.2 0 9
3303 enjoy 707 lutter.omega 12 1.2 0 12
3303 enjoy 708 graig.daniel 14 1.3 1 13
3303 enjoy 730 orly.rotenberg 15 1.0 1 14
3303 enjoy 731 yohan.bach 12 1.0 1 11
', header=TRUE, stringsAsFactors = FALSE)
comm2 <- with(comm, aggregate(COMM ~ IATA + NAME, FUN = function(x) sum(x, na.rm = TRUE)))
comm2
max_comm <- comm2[comm2$COMM == max(comm2$COMM),]
max_comm
IATA NAME COMM
4 3301 you.pay 12
Here is an explanation of the first statement:
The with function identifies the data set to use (here comm). The function aggregate is a general function for performing operations on groups. You want to operate on COMM by IATA and NAME. You write that: COMM ~ IATA + NAME. Next you specify the desired function to perform on COMM (here sum). You do that with FUN = function(x) sum(x). In case there are any missing observations in COMM I added na.rm = TRUE within the sum(x) function.
Call that table comm
max_comm <- comm[comm$COMM == max(comm$COMM),]
or just sort it and look at the head
head(comm[order(-comm$COMM),])
Edit: If you want to sum by IATA first then use data.table
library(data.table)
comm2 <- data.table(comm)
sum_comm <- comm2[, list(COMM_SUM=sum(COMM)), by = c("IATA","NAME")]
data.table has an unusual syntax, you could also try dplyr which is supposed to be roughly as good as data.table now

Resources