How can I calculate an inner product with an arbitrary number of columns using ddply? - r

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
Update:
This is an updated example as asked for in the comments:
libary(kernlab);
data(spam);
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?

A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
set.seed(1)
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517

If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.

Related

Apply a function with if inside to a dataframe to take a value in a list in R

Hello everybody and thank you in advance for any help.
I inserted a txt file named "project" in R. This dataframe called "data" and consisted of 12 columns with some information of 999 households.
head(data)
im iw r am af a1c a2c a3c a4c a5c a6c a7c
1 0.00 20064.970 5984.282 0 38 0 0 0 0 0 0 0
2 15395.61 7397.191 0.000 42 30 1 0 0 0 0 0 0
3 16536.74 18380.770 0.000 33 28 1 0 0 0 0 0 0
4 20251.87 14042.250 0.000 38 38 1 1 0 0 0 0 0
5 17967.04 12693.240 0.000 24 39 1 0 0 0 0 0 0
6 12686.43 21170.450 0.000 62 42 0 0 0 0 0 0 0
im=male income
iw=female income
r=rent
am=male age
af=female age
a1c,a2c....a7c takes the value 1 when there is a child in age under 18
and the value 0 when there is not a child in the household.
Now i have to calculate the taxed income seperately for male and female for each houshold based on some criteria, so i am trying to create 1 function which calculate 2 numbers and after that to apply this function on my data frame and return a list with these numbers.
Specificaly I want something like this:
fact<-function(im,iw,r,am,af,a1c,a2c,a3c,a4c,a5c,a6c,a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
how can i fix this function in order to apply on my dataframe?
Can a function return 2 values?
how can i apply the function?
THEN I TRIED THIS:
fact<-function(im=data$im,iw=data$iw,r=data$r,am=data$am,af=data$af,a1c=data$a1c,a2c=data$a2c,a3c=data$a3c,a4c=data$a4c,a5c=data$a5c,a6c=data$a6c,a7c=data$a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
fact(data[1,])
but i have tis error: Error in fact(data[1, ]) : object 'mti' not found
when i tried the function only for "fti" can run but wrongly.
Besides the need to return multiple values using c(mti, fti), your function doesn't have a default value if none of the conditions in the functions are TRUE. So, mti is never created.
Add mti <- NA at the start of your function, so NA is the default value.

Binary representation of breast cancer wisconsin database

I want to produce a binary representation of the well-known breast cancer Wisconsin database.
The initial data set has 31 numerical variables, and one categorical variable.
id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
I want to produce a binary representation of this dataframe by:
transforming the diagnosis column (levels= M , B) to two columns diagnosis_M and diagnosis_B and put 1 or 0 in the relevant row depending on the value in the initial column (M or B).
Looking for the median of each numerical column and split it as two columns depending on whether the values are greater or lower than the mean value. eg: for the column radius_mean, split it in radius_mean_great in-which we put 1 if the values > mean, o else; and a column radius_mean_low inversely.
library(mlbench)
library("RCurl")
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')
breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)
Well there are several ways to binarize the base, I found the following I hope it serves
df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] > mean(df[,i]))
df2[,2*i] <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))
library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
mutate(diagnosis_B = as.numeric(diagnosis == "B"))
df <- cbind(df3[,-2],df2)
df[1:10,1:7]
id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1 842302 1 0 1 0 0 1
2 842517 1 0 1 0 0 1
3 84300903 1 0 1 0 1 0
4 84348301 1 0 0 1 1 0
5 84358402 1 0 1 0 0 1
6 843786 1 0 0 1 0 1
7 844359 1 0 1 0 1 0
8 84458202 1 0 0 1 1 0
9 844981 1 0 0 1 1 0
10 84501001 1 0 0 1 1 0

Rank the dataframe values in R

I want to fetch the row numbers of the first 5 highest values in a column in a dataframe and add a value of 100 on the same row number in another dataframe and rest other values as 0.
I know how to sort / order a column in a dataframe using order() function.
df=data.frame(a=c(345,948,290,189,3848,302,384,456,383,201,35,346,1.46,4.66,3,5,63,43,6432,4336,345,354,1245,342,523,743,248,932.5))
For example, df[order(-df$a),] results in
6432.00 4336.00 3848.00 1245.00 948.00 932.50 743.00 523.00 456.00 384.00 383.00 354.00 346.00 345.00 345.00 342.00 302.00 290.00 248.00 201.00 189.00 63.00 43.00 35.00 5.00 4.66 3.00 1.46
However, I am not able to meet my specific requirement.
I would expect to see df1 as
0 100 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 100 0 0 100 0 0 0 0 0
df$b <- ifelse(df$a %in% sort(df$a, T)[1:5], 100, 0)
We could use the rank function:
df$b <- (rank(-df$a) <= 5) * 100

R Team Roster Optimization w/ lpSolve

I am new to R and have a particular fantasy sports team optimization problem I would like to solve. I have seen other posts use lpSolve for similar problems but I can not seem to wrap my head around the code. Example data table below. Every player is on a team, plays a particular role, has a salary, and has avg points produced per game. The constraints that I need are I need exactly 8 players. No more than 3 players may come from any one team. There must be at least one player for each role (of 5). And cumulative salary must not exceed $10,000.
Team Player Role Avgpts Salary
Bears A T 22 930
Bears B M 19 900
Bears C B 30 1300
Bears D J 25 970
Bears E S 20 910
Jets F T 21 920
Jets G M 26 980
[...]
In R, I write in the following
> obj = DF$AVGPTS
> con = rbind(t(model.matrix(~ Role + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> dir = c(">=",">=",">=",">=",">=","==","<=")
> rhs = c(1,1,1,1,1,8,10000)
> result = lp("max", obj, con, dir, rhs, all.bin = TRUE)
This code works fine in producing the optimal fantasy team without the limitation of no more than 3 players may come from any one team. This is where I am stuck and I suspect it relates to the con argument. Any help is appreciated.
What if you added something similar to the way you did the roles to con?
If you add t(model.matrix(~ Team + 0, DF)) you'll have indicators for each team in your constraint. For the example you gave:
> con <- rbind(t(model.matrix(~ Role + 0,DF)), t(model.matrix(~ Team + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> con
1 2 3 4 5 6 7
RoleB 0 0 1 0 0 0 0
RoleJ 0 0 0 1 0 0 0
RoleM 0 1 0 0 0 0 1
RoleS 0 0 0 0 1 0 0
RoleT 1 0 0 0 0 1 0
TeamBears 1 1 1 1 1 0 0
TeamJets 0 0 0 0 0 1 1
1 1 1 1 1 1 1
930 900 1300 970 910 920 980
We now need to update dir and rhs to account for this:
dir <- c(">=",">=",">=",">=",">=",rep('<=',n_teams),"<=","<=")
rhs <- c(1,1,1,1,1,rep(3,n_teams),8,10000)
With n_teams set appropriately.

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

Resources