I have this script which calculates profit and loss of trades. It works fine
but I think it can be improved. It will be great to get rid of the for loops at least to make the code look compact.
Can anyone please help me out ?
The logic to calculate profit/loss is first to match the sell trades with potential buy trades. A single sell trade can be matched with multiple buys. So the cost might be distributed to multiple buys.
Steps :
separate the trades into buy and sell in increasing dates.
calculate the average cost price
calculate profit/loss = (selling price - cost price)*matching vol
Thanks
Here is the sample data set
> structure(list(AsxCode = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "QAN", class = "factor"), Order.Type = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Buy", "Sell"), class = "factor"), Trade.Date = structure(c(13L, 12L, 12L, 11L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("2014-03-28", "2014-05-22", "2014-11-07", "2014-11-18", "2014-12-04", "2015-03-02", "2015-03-24", "2015-03-27", "2015-05-11", "2015-05-15", "2015-08-21", "2016-04-15", "2016-04-18"), class = "factor"), Price = c(3.75, 4.05, 4.01, 3.55, 3.68, 3.38, 2.9, 2.98, 2.9, 2.05, 1.8, 1.65, 1.25, 1.07), Quantity = c(850L, 1350L, 150L, 1000L, 1500L, 1400L, 1091L, 2000L, 1750L, 600L, 366L, 375L, 500L, 500L), Consideration = c(3198.5, 5456.5, 590.5, 3561, 5531, 4743, 3152.9, 5949, 5086, 1241, 669.8, 629.75, 614, 546), match_status = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), match_vol = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), avg_price = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), profit_loss = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("AsxCode", "Order.Type", "Trade.Date", "Price", "Quantity", "Consideration", "match_status", "match_vol", "avg_price", "profit_loss"), row.names = c(NA, -14L), class = "data.frame")
AsxCode Order.Type Trade.Date Price Quantity Consideration match_status match_vol avg_price profit_loss
1 QAN Buy 2016-04-18 3.75 850 3198.50 NA 0 0 0
2 QAN Sell 2016-04-15 4.05 1350 5456.50 NA 0 0 0
3 QAN Sell 2016-04-15 4.01 150 590.50 NA 0 0 0
4 QAN Buy 2015-08-21 3.55 1000 3561.00 NA 0 0 0
5 QAN Buy 2015-05-15 3.68 1500 5531.00 NA 0 0 0
6 QAN Buy 2015-05-11 3.38 1400 4743.00 NA 0 0 0
7 QAN Sell 2015-03-27 2.90 1091 3152.90 NA 0 0 0
8 QAN Sell 2015-03-24 2.98 2000 5949.00 NA 0 0 0
9 QAN Buy 2015-03-02 2.90 1750 5086.00 NA 0 0 0
10 QAN Buy 2014-12-04 2.05 600 1241.00 NA 0 0 0
11 QAN Buy 2014-11-18 1.80 366 669.80 NA 0 0 0
12 QAN Buy 2014-11-07 1.65 375 629.75 NA 0 0 0
13 QAN Sell 2014-05-22 1.25 500 614.00 NA 0 0 0
14 QAN Buy 2014-03-28 1.07 500 546.00 NA 0 0 0
calculate.profit <- function(trades){
trades$match_vol <- 0
s <- trades[trades$Order.Type== 'Sell', ]
sell.trades <- s[order(s$Trade.Date, decreasing=FALSE),]
b <- trades[trades$Order.Type== 'Buy', ]
buy.trades <- b[order(b$Trade.Date, decreasing=FALSE),]
# Don't want to execute the for loop when there is no sell trades. In other words when there is no profit/loss unless you sell
if(nrow(sell.trades)==0){
return (buy.trades)
}
# for each sell find the associated buys
for(i in 1:nrow(sell.trades))
{
# calculate average price. The Consideration column contains total cost
s.price <- sell.trades[i, 'Consideration']/sell.trades[i,'Quantity']
for(j in 1:nrow(buy.trades))
{
# this part matches sell with a buy trade
# if sell volume and buy volume are same, the sell is fully matched otherwise it has to find the remaining sell units.
s.vol <- sell.trades[i,'Quantity'] - sell.trades[i,'match_vol']
b.vol <- buy.trades[j, 'Quantity'] - buy.trades[j, 'match_vol']
if (b.vol != 0)
{
b.price <- buy.trades[j, 'Consideration']/buy.trades[j, 'Quantity']
# contains the volume which is matched between buy and sell
# trades
match.vol <- min(s.vol, b.vol)
profit <- match.vol * (s.price - b.price)
buy.trades[j, 'match_vol'] <- match.vol + buy.trades[j, 'match_vol']
sell.trades[i, 'profit_loss'] <- profit + sell.trades[i, 'profit_loss']
sell.trades[i, 'match_vol'] <- match.vol + sell.trades[i, 'match_vol']
}
# sell parcel fully processed
if (sell.trades[i ,'match_vol'] == sell.trades[i ,'Quantity'])
{
j=1
break;
}
}
}
return (rbind(buy.trades, sell.trades))
}
There is a number of improvements that could be made.
Preallocating object sizes
The most obvious thing would be to preallocate objet sizes. The received wisdom is that it is inefficient to expand objects in loops. Hence you would do:
# On example of a single column
sell.trades.vec <- vector(mode = "numeric", length = nrow(buy.trades))
in order to avoid objects being expended within loops.
seq_along()
Broadly speaking, it is neat to use seq_along instead of 1:something, have a look:
>> a <- NULL
>> 1:length(a)
[1] 1 0
>> seq_along(a)
integer(0)
>>
as well:
>> 1:0
[1] 1 0
>> seq_along(0)
[1] 1
>>
I'm guessing that you will (most probably) always have some sensible nrow value but seq_along maybe worth reflecting on, in case there is a risk to get some odd data.
Related
I have several datasets.
The first one
lid=structure(list(x1 = 619490L, x2 = 10L, x3 = 0L, x4 = 6089230L,
x5 = 0L, x6 = -10L), class = "data.frame", row.names = c(NA,
-1L))
second dataset
lidar=structure(list(A = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), B = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
C = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
How to subtract the value for each row of the lidar dataset from lid dataset using the formula
(lidar$A-lid$x1)/lid$x3
then
(lidar$B-lid$x4)/lid$x6
So for first row will be result
(lidar$A-lid$x1)/lid$x3=1874,876(but everything after the comma is discarded)=1874(without,876)
(lidar$B-lid$x4)/lid$x6=1122
also in lidar dataset for column lidar$C
subtract the smallest value from the largest value. In this case lidar$c11-lidar$c1=5,5
so desired output for this will be
A B C Intensity ReturnNumber NumberOfReturns row col subs(lidar$Cmax-lidar$Cmin)
638238.76 6078001.09 186.64 13 2 2 1874 1122 5,5
638238.76 6078001.09 186.59 99 1 1 1874 1122 5,5
638239.29 6078001.15 199.28 5 1 3 1874 1122 5,5
638235.39 6078001.15 189.37 2 2 2 1874 1122 5,5
the result of subtraction (lidar$Cmax-lidar$Cmin) for all rows is always the same.
row and col this the result of this arithmetic
(lidar$A-lid$x1)/lid$x3 (row)
then
(lidar$B-lid$x4)/lid$x6 (col)
with the value after the comma, these values(row and col) are different, but we must remove the part after the comma, so they seem to be the same.
How can i get desired output according to such arithmetic operations.
Any of your help is valuable.Thank you
If I understand your purpose correctly, the main question is how to remove the part after comma, which is a decimal separator in your examples.
If that's true, one way of doing that is to split the number into two parts, one which comes before the comma and another one which comes after it, and then extract only the first part. In R you can do this by strsplit(). However, this function requires the input to be characters, not numerics. So, you need to coerce the numbers into characters, do the splitting, coerce the result back to numbers, and then extract its first element.
Here is an example of a function to implement the steps:
remove_after_comma <- function(num_with_comma){
myfun <- function(num_with_comma) {
num_with_comma|>
as.character() |>
strsplit("[,|.]") |>
unlist() |>
as.numeric() |>
getElement(1)
}
vapply(num_with_comma, myfun, FUN.VALUE = numeric(1))
}
Notes:
[,|.] is used to anticipate other systems that use . instead of , as the decimal separator.
vapply is used to make it possible to apply this function to a numeric vectors, such as a numeric column.
Check:
remove_after_comma(c(a = '1,5', b = '12,74'))
# a b
# 1 12
(4:10)/3
#[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
remove_after_comma ((4:10)/3)
#[1] 1 1 2 2 2 3 3
Assuming that lid$x3 = 10L in your example:
(lidar$A-lid$x1)/lid$x3
#[1] 1874.876 1874.876 1874.929 1874.539 1874.386 1874.386 1874.555 1874.197 #1874.191 1873.841
remove_after_comma((lidar$A-lid$x1)/lid$x3)
#[1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1873
I'm not sure if this is what you mean
`
lidar$row <- round((lidar$A-lid$x1)/lid$x3, 0)
lidar$col <- (lidar$B-lid$x4)/lid$x6
lidar$cdif <- max(lidar$C)-min(lidar$C)
`
I have the following social network dataset where participants (ego) were asked who provided social, work, and care support in their lives. Those who provided support (alter) were classified according to their relationship with ego (circle) resulting in the following dataset:
ego alter circle social work care
3400 3403 1 0 0 1
3400 3402 1 0 1 0
3400 3401 1 1 0 0
3500 3504 1 0 0 0
3500 3503 1 0 0 0
3500 3502 1 0 1 1
3500 3501 2 1 0 0
3600 3604 1 0 0 0
3600 3603 3 0 0 1
3600 3602 3 0 1 0
3600 3601 2 1 0 0
3700 3702 1 0 1 1
3700 3703 1 0 0 1
3700 3701 2 1 0 0
…
So, for example, in row 1, alter 3403 of social circle 1, did not provide social or work support but provided care support for ego 3400.
My question for you all is: how can I cross tabulate the variable circle with each of the support variables (social, work, and care) and then calculate the averages with ego?
Below is the resulting cross tabulation with totals and percentages, but I need the averages taking into account each ego.
Crosstab result
First, reproducible data using dput():
social <- structure(list(ego = c(3400L, 3400L, 3400L, 3500L, 3500L, 3500L,
3500L, 3600L, 3600L, 3600L, 3600L, 3700L, 3700L, 3700L), alter = c(3403L,
3402L, 3401L, 3504L, 3503L, 3502L, 3501L, 3604L, 3603L, 3602L,
3601L, 3702L, 3703L, 3701L), circle = c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 3L, 3L, 2L, 1L, 1L, 2L), social = c(0L, 0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L), work = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L), care = c(1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-14L))
Now, counts,
(tbl.count <- aggregate(cbind(social, work, care)~circle, social, sum))
# circle social work care
# 1 1 1 3 4
# 2 2 3 0 0
# 3 3 0 1 1
and means,
(tbl.mean <- aggregate(cbind(social, work, care)~circle, social, mean))
# circle social work care
# 1 1 0.1111111 0.3333333 0.4444444
# 2 2 1.0000000 0.0000000 0.0000000
# 3 3 0.0000000 0.5000000 0.5000000
and percentages,
(tbl.pct <- aggregate(cbind(social, work, care)~circle, social, function(x) mean(x)*100))
# circle social work care
# 1 1 11.11111 33.33333 44.44444
# 2 2 100.00000 0.00000 0.00000
# 3 3 0.00000 50.00000 50.00000
Sorry for asking what might be a very basic question, but I am stuck in a conundrum and cannot seem to get out of it.
I have a code that looks like
Medicine Biology Business sex weights
0 1 0 1 0.5
0 0 1 0 1
1 0 0 1 05
0 1 0 0 0.33
0 0 1 0 0.33
1 0 0 1 1
0 1 0 0 0.33
0 0 1 1 1
1 0 0 1 1
Where the first three are fields of study, and the fouth variable regards gender. Obviously with many more observations.
What I want to get, is the mean level of the the field of study (medicine, biology, business) by the variable sex (so the mean for men and the mean for women). To do so, I have used the following code:
barplot_sex<-aggregate(x=df_dummies[,1:19] , by=list(df$sex),
FUN= function(x) mean(x)
Which works perfectly and gives me what I needed. My problem is that I need to use a weighted mean now, but I canno use
FUN= function(x) weighted.mean(x, weights)
as there are many more observations than fields of study.
The only alternative I managed to do was to edit(boxplot) and change the values manually, but then R doesn't save the changes. Plus, I am sure there must be a trivial way to do exactly what I need.
Any help would be greatly appreciated.
Bests,
Gabriele
Using by.
by(dat, dat$sex, function(x) sapply(x[, 1:3], weighted.mean, x[, "weights"]))
# dat$sex: 0
# Medicine Biology Business
# 0.0000000 0.3316583 0.6683417
# ---------------------------------------------------------------------------------------
# dat$sex: 1
# Medicine Biology Business
# 0.82352941 0.05882353 0.11764706
Data:
dat <- structure(list(Medicine = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L
), Biology = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Business = c(0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), sex = c(1L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), weights = c(0.5, 1, 5, 0.33, 0.33, 1, 0.33,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.
I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!
vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.
Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0