R - Perform two calculations to different parts of a loop - r

Ok so I have written a loop which intends to for the first part of the loop, perform a multiplication calculation, multiplying two columns. Afterwards, for the remainder of the loop the loop is to perform another multiplication using two different columns than the first.
The columns for multiplation are: ocret and clret which multiply against response.
My code for this:
train.set$output[[1]] = if (train.set$response[[1]] == 1) {
apply(train.set[,c('ocret', 'response')], 1, function(x) { (x[1]*x[2])} )
}
for (i in 2:nrow(train.set)){
train.set$output[i] = if(train.set$response[i] == 1) {
apply(train.set[,c('clret', 'response')], 1, function(x) { (x[1]*x[2])})
train.set$output[i-1]
}
}
The idea for this was first finding a response == 1, it was to perform the ocret * response calculation.
For the second part of the loop, it was to start on row 2 as to not overwrite the first part... and continue to loop down the +1 and perform the clret * response calculation.
The logic makes sense to me, this is pretty much my first attempt at a loop. When I run the code, nothing happens, it doesnt make the output column, can anyone give me any pointers? I continue to read it and it makes sense, not sure what im missing, any explanation greatly appreciated.
Example data frame and output below:
ocret clret response output
1 0.00730616 0.003382433 0 0
2 -0.084899894 -0.088067766 0 0
3 0.047208568 0.054174679 1 0.047208568
4 -0.002671414 -0.004543992 0 0
5 -0.039943462 -0.040290793 0 0
6 -0.01428499 -0.013506524 0 0
7 -0.037054965 -0.038517845 0 0
8 -0.058027611 -0.057394837 1 -0.058027611
9 -0.004014491 -0.011332705 1 -0.011332705
10 -0.079419682 -0.076167096 1 -0.076167096
11 -0.003424577 -0.011759287 1 -0.011759287
12 0.099260455 0.115800375 1 0.115800375
13 -0.011841897 -0.005322141 1 -0.005322141
14 -0.087230999 -0.090349775 1 -0.090349775
15 0.040570359 0.042507445 1 0.042507445
16 -0.001846555 -0.006212821 1 -0.006212821
17 0.044398056 0.047684898 1 0.047684898
18 -0.025856823 -0.030799705 0 0
19 -0.057677505 -0.061012471 0 0
20 0.010043567 0.012634046 0 0
21 -0.020609404 -0.034511205 0 0
Line 3: ocret * response
Line 8: ocret * response
Line 9 to 16: clret * response

For loop may not be required. We can use dplyr and data.table to get the desired output (dt2).
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(RunID = rleid(response)) %>%
group_by(RunID) %>%
mutate(output = ifelse(response == 0, 0,
ifelse(row_number() == 1, ocret, clret))) %>%
ungroup() %>%
select(-RunID)
Data Preparation
dt <- read.table(text = " ocret clret response
1 0.00730616 0.003382433 0
2 -0.084899894 -0.088067766 0
3 0.047208568 0.054174679 1
4 -0.002671414 -0.004543992 0
5 -0.039943462 -0.040290793 0
6 -0.01428499 -0.013506524 0
7 -0.037054965 -0.038517845 0
8 -0.058027611 -0.057394837 1
9 -0.004014491 -0.011332705 1
10 -0.079419682 -0.076167096 1
11 -0.003424577 -0.011759287 1
12 0.099260455 0.115800375 1
13 -0.011841897 -0.005322141 1
14 -0.087230999 -0.090349775 1
15 0.040570359 0.042507445 1
16 -0.001846555 -0.006212821 1
17 0.044398056 0.047684898 1
18 -0.025856823 -0.030799705 0
19 -0.057677505 -0.061012471 0
20 0.010043567 0.012634046 0
21 -0.020609404 -0.034511205 0",
header = TRUE, stringsAsFactors = FALSE)

Related

How to correctly merge two files and count values before Fisher's test in R?

I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code:
https://github.com/mhguo1/TRAPD
Run burden testing This script will run the actual burden testing. It
performs a one-sided Fisher's exact test to determine if there is a
greater burden of qualifying variants in cases as compared to controls
for each gene. It will perform this burden testing under a dominant
and a recessive model.
It requires R; the script was tested using R v3.1, but any version of
R should work. The script should be run as: Rscript burden.R
--casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A
--casesize: Number of cases that were tested in Step 2A
--controlfile: Path to the counts file for the controls, as generated in Step 2B
--controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for
total sample size
--output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases
carrying potentially compound heterozygous qualifying variants in a
given gene CASE_COUNT_HOM: Number of cases carrying homozygous
qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a
given gene. CONTROL_COUNT_HET: Approximate number of controls carrying
heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM:
Number of controlss carrying homozygous qualifying variants in a given
gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value
under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!
If you want all row in two file. You can use full join with by = "GENE" and suffix as you wish
library(dplyr)
z <- outer_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
6 ENSG00000174776 NA NA NA NA
7 ENSG00000076864 NA NA NA NA
8 ENSG00000086015 NA NA NA NA
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
6 1 1 0 2
7 1 0 1 13
8 1 0 1 25
If you want only GENE that are in both rows, use inner_join
z <- inner_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7

How to count the number of positive and negative wind direction values in an austral summer that spans two years (October-March)

I'm working with some wind direction data for a potential paper. I am trying to compare the number of days the wind is blowing easterly (negative U) and the number of days it is blowing westerly (positive U). I need to calculate this over an austral summer, so the period between October and March e.g.: October 1993 to March 1994.
Here is a sample of my data frame:
Year Month Day Hour Minutes Seconds Ws U V
1 1993 1 1 0 0 0 3.750620 2.822403 1.281318
2 1993 1 1 6 0 0 4.207054 3.600465 1.719147
3 1993 1 1 12 0 0 5.050543 3.155271 3.243411
4 1993 1 1 18 0 0 3.165194 -0.477054 2.926124
5 1993 1 2 0 0 0 1.529690 -0.721395 -0.503101
6 1993 1 2 6 0 0 1.950233 0.303333 -1.728295
7 1993 1 2 12 0 0 4.548992 -2.868217 3.307519
8 1993 1 2 18 0 0 6.563643 -6.245194 1.744419
9 1993 1 3 0 0 0 5.868992 -5.805969 -0.594031
10 1993 1 3 6 0 0 6.530620 -6.446667 -0.689535
11 1993 1 3 12 0 0 7.085736 -6.657984 1.834884
12 1993 1 3 18 0 0 7.685349 -7.111008 2.571783
13 1993 1 4 0 0 0 6.508760 -6.414574 -0.678837
14 1993 1 4 6 0 0 6.141860 -6.006822 -0.272558
15 1993 1 4 12 0 0 7.388295 -6.744574 1.862868
16 1993 1 4 18 0 0 7.281163 -7.054264 0.896512
17 1993 1 5 0 0 0 4.847287 -4.431628 -0.813643
18 1993 1 5 6 0 0 3.482558 -1.670078 2.048915
19 1993 1 5 12 0 0 5.698992 1.097287 5.433721
20 1993 1 5 18 0 0 4.894031 1.445736 4.440465
21 1993 1 6 0 0 0 1.983411 0.783023 1.556047
22 1993 1 6 6 0 0 2.315891 -1.225891 1.756744
23 1993 1 6 12 0 0 4.525581 -4.016124 1.723721
24 1993 1 6 18 0 0 5.123566 -4.618682 0.759225
25 1993 1 7 0 0 0 3.449147 -2.639457 -1.627442
26 1993 1 7 6 0 0 2.067364 1.185891 -0.760233
27 1993 1 7 12 0 0 5.675814 3.872171 3.419690
28 1993 1 7 18 0 0 6.278450 3.989767 4.684031
29 1993 1 8 0 0 0 6.562636 5.496667 3.329302
30 1993 1 8 6 0 0 7.762636 5.280310 5.516589
31 1993 1 8 12 0 0 9.283953 5.575659 7.294264
>
So far I have manage to do this calculation for one month only (see code below), but I'm unsure of how to do it from October of one year to March of the next year. When I tried filter(wind,Year==1993:1994,Month==10:3,U>0) I got the error Warning message:
In Month == 10:3 :
longer object length is not a multiple of shorter object length
This is what I have done so far with calculating the number of positive and negative directions for October 1993, which has worked. I am new to R and stackoverflow, so I hope I have set this out correctly!
filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos<-filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos
filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg<-filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg
sum(Oct_1993_pos$U>0)
sum(Oct_1993_neg$U<0)
Your first error (Month == 10:3) occurs because you are comparing a vector (Month) with another vector. When you do this, you do an element-wise comparison, i.e. Month[1] == 10, Month[2] == 9, etc. When the vectors are of unequal length, R repeats the shorter one - but only if the longer one is an exact number of multiples longer:
c(1,2,3,1,2,3) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
c(1,2,3,1,2) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE
Warning message:
In c(1, 2, 3, 1, 2) == c(1, 2) :
longer object length is not a multiple of shorter object length
For counting positive and negative U's, you can exploit that summing logicals simply counts the number of TRUEs:
sum(c(FALSE, TRUE, TRUE, FALSE))
[1] 2
And you can obtain such logicals by doing a simply comparison:
sum(U > 0)
For your calculations I would recommend using dplyr. With this you can repeat your counting across any collection of subsets. Try:
# if following fails, run install.packages("dplyr")
library(dplyr)
monthly <- wind %>% group_by(Year, Month) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Edit in response to comment:
Depending on if you need intermediate results or not, we could do a couple of things. Regarding the period October to March, you have to be careful if your data spans several years.
monthly %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>% ungroup %>%
summarise_at(vars(pos, neg, nowind, entries), sum)
or, just filter before you summarise:
wind %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Beware here that I am using single boolean operators (|, &) and not double (||, &&) as we want to keep the element-wise comparisons (the double-variant collapses to a single element).
If you want to see winter vs. summer periods, across multiple years, we have to figure how to group the seasons correctly. For this, I will build a data set of years and months:
library(tidyr)
seasons <- crossing(month=1:12, year=1992:1994) %>% arrange(year, month) %>%
mutate(
season_start = month %in% c(3, 10),
season = cumsum(season_start)
)
With this approach, we've split the problem in two: 1) Define the seasons you wish to summarise over, and 2) summarise it.
inner_join(wind, seasons, by=c('Year'='year','Month'='month')) %>%
group_by(season) %>%
summarise(
seasonstart = paste0(min(Year), '-', min(Month)),
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
So, to summarise over the period October-March, same as before, just define a different grouping.
For exercises, try adding Year and/or Month to the group_by call in the last example.

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0
I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.
max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

How can I calculate an inner product with an arbitrary number of columns using ddply?

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
Update:
This is an updated example as asked for in the comments:
libary(kernlab);
data(spam);
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?
A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
set.seed(1)
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517
If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.

Resources