R summing row one with all rows - r

I am trying to analyse website data for AB testing.
My reference point is based on experimentName = Experiment 1 (control version)
experimentName UniquePageView UniqueFrequency NonUniqueFrequency
1 Experiment 1 459 294 359
2 Experiment 2 440 286 338
3 Experiment 3 428 273 348
What I need to do is sum every UniquePageView, UniqueFrequency and NonUniqueFrequency row when experimentName = Experiment 1
e.g.
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 2 ',
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 3 '
so on so forth (I could have an unlimted number of experiment #)
then do the same for UniqueFrequency and NonUniqueFrequency (I could have an unlimited number of column as well)
Result expected:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 918 588 718
2 Experiment 2 440 286 338 899 580 697
3 Experiment 3 428 273 348 887 567 707
here is the math behind it:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 459 + 459 294 + 294 359 + 359
2 Experiment 2 440 286 338 459 + 440 294 + 286 359 + 338
3 Experiment 3 428 273 348 459 + 428 294 + 273 359 + 348

In base R, you can do this in one line by column binding (with cbind) the initial data frame to the initial data frame plus a version that is just duplicates of the "Experiment 1" row).
cbind(dat, dat[,-1] + dat[rep(which(dat$experimentName == "Experiment 1"), nrow(dat)), -1])
# experimentName UniquePageView UniqueFrequency NonUniqueFrequency UniquePageView UniqueFrequency
# 1 Experiment 1 459 294 359 918 588
# 2 Experiment 2 440 286 338 899 580
# 3 Experiment 3 428 273 348 887 567
# NonUniqueFrequency
# 1 718
# 2 697
# 3 707
To update the column names at the end (assuming you stored the resulting data frame in res), you could use:
names(res)[4:6] <- c("CombinedPageView", "CombinedUniqueFrequency", "CombinedNonUniqueFrequency")

Do you know how to use dplyr? If you're new to R, this is a pretty good lesson to learn. Dplyr includes the functions filter and summarise, which are all you need to do this problem - very simple!
First, take your data frame
df
Then, filter to only the data you want, in this case when experimentName = Experiment 1
df
df <- filter(df, experimentName == "Experiment 1")
Now, summarise to find the sums of UniquePageView, UniqueFrequency and NonUniqueFrequency
df
df <- filter(df, experimentName == "Experiment 1")
summarise(df, SumUniquePageView = sum(UniquePageView),
SumUniqueFrequency = sum(UniqueFrequency),
SumNonUniqueFrequency = sum(NonUniqueFrequency))
This will return a small table with the answers you're looking for. For a slightly more advanced (but simpler) way to do this, you can use the piping operator %>% from the packages magrittr. That code borrows the object from the previous statement and uses it as the first argument in the proceeding statement, as follows:
df %>% filter(experimentName == "Experiment 1") %>% summarise(SumUniquePageView = sum(UniquePageView), etc)
If you don't yet have those packages, you can get them with install.packages("dpyr"), library(dplyr)

Related

Ratio 1:1 for exact matching using MatchIt package

library(MatchIt)
df <- data.frame(lalonde)
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "exact")
m.data1<-match.data(m.out1)
I would like to know how I can get the same size for both the control and treatment samples after running an exact matching with MatchIt package. Ideally, I would like to randomly pick a control if a treated unit has been matched to more than one control.
My real dataset is not lalonde. It is actually an extremely large one. So I might have many controls associated with a treated unit and I want to draw one randomly for each treated unit.
For exact matching you could use this code.
library(Matching)
data("lalonde")
Y <- lalonde$re78
Tr <- lalonde$treat
X <- lalonde[setdiff(names(lalonde), c('re78', 'treat'))]
set.seed(42) ## comment out for FIXING the ties
rmtch <- Match(Y=Y, Tr=Tr, X=X, exact=TRUE, ties=FALSE)
summary(rmtch)
# Estimate... 1678.6
# SE......... 981
# T-stat..... 1.7111
# p.val...... 0.087055
#
# Original number of observations.............. 445
# Original number of treated obs............... 185
# Matched number of observations............... 55
# Matched number of observations (unweighted). 55
#
# Number of obs dropped by 'exact' or 'caliper' 130
str(rmtch) ## what is stored in Match object
rmtch$index.control ## indices of control units
# [1] 261 254 188 279 288 317 323 280 186 311 305 234 337 302 219 345 234 328
# [19] 271 218 253 249 339 271 339 344 351 253 328 339 255 217 254 197 254 284
# [37] 266 252 253 280 208 226 209 354 204 282 350 296 202 247 219 330 347 280
# [55] 344
If you re-run the code, you will see that the IDs change slightly, which they would probably do more clearly if the dataset was larger.
To fix the randomization of the control units you may use set.seed(). For handling ties deterministically use ties=FALSE (see ?Match help page).
The easiest way is to do 1:1 nearest neighbor matching with exact matching constraints:
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "nearest",
exact = ~ age + race + educ)
If you are doing coarsened exact matching, there is an option already built in to request this which is by setting k2k = TRUE:
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "cem", k2k = TRUE,
cutpoints = 0)
Setting cutpoints = 0 requests exact matching (no coarsening).

Stratifying multiple columns for cross-validation

There are many ways I've seen to stratify a sample by a single variable to use for cross-validation. The caret package does this nicely with the createFolds() function. By default it seems that caret will partition such that each fold has roughly the same target event rate.
What I want to do though is stratify by the target rate and by time. I've found a function that can partially do this, it's the splitstackshape package and uses the stratified() function. The issue with that function though is it returns a single sample, it doesn't split the data into k groups under the given conditions.
Here's some dummy data to reproduce.
set.seed(123)
time = rep(seq(1:10),100)
target = rbinom(n=100, size=1, prob=0.3)
data = as.data.frame(cbind(time,target))
table(data$time,data$target)
0 1
1 60 40
2 80 20
3 80 20
4 60 40
5 80 20
6 80 20
7 60 40
8 60 40
9 70 30
10 80 20
As you can see, the target event rate is not the same across time. It's 40% in time 1 and 20% in time 2, etc. I want to preserve this when creating the folds used for cross-validation. If I understand correctly, caret will partition by the overall event rate.
table(data$target)
0 1
710 290
This rate of ~30% will be preserved overall, but target event rate over time will not.
We can get one sample like this:
library(splitstackshape)
train.index <- stratified(data,c("target","time"),size=.2)
I need to repeat this though 4 more times for a 5-fold cross validation and it needs to be done such that once a row is assigned it can't be assigned again. I feel like there should be a function designed for this already. Any ideas?
I know this post is old but I just had the same problem and I couldn't find another solution. In case anyone else needs an answer, here's the solution I'm implementing.
library(data.table)
mystratified <- function(indt, group, NUM_FOLDS) {
indt <- setDT(copy(indt))
if (is.numeric(group))
group <- names(indt)[group]
temp_grp <- temp_ind <- NULL
indt[, `:=`(temp_ind, .I)]
indt[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
samp_sizes <- indt[, .N, by = group]
samp_sizes[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
inds <- split(indt$temp_ind, indt$temp_grp)[samp_sizes$temp_grp]
z = unlist(inds,use.names=F)
model_folds <- suppressWarnings(split(z, 1:NUM_FOLDS))
}
Which is basically a rewriting of splitstackshape::stratified. It works like the following, giving as output a list of validation indeces for each fold.
myfolds = mystratified(indt = data, group = colnames(data), NUM_FOLDS = 5)
str(myfolds)
List of 5
$ 1: int [1:200] 1 91 181 261 351 441 501 591 681 761 ...
$ 2: int [1:200] 41 101 191 281 361 451 541 601 691 781 ...
$ 3: int [1:200] 51 141 201 291 381 461 551 641 701 791 ...
$ 4: int [1:200] 61 151 241 301 391 481 561 651 741 801 ...
$ 5: int [1:200] 81 161 251 341 401 491 581 661 751 841 ...
So, for instance the train and validation data for each fold are:
# first fold
train = data[-myfolds[[1]],]
valid = data[myfolds[[1]],]
# second fold
train = data[-myfolds[[2]],]
valid = data[myfolds[[2]],]
# etc...

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

How do I make sure numbers are numeric from a .txt?

I'm setting up a script to extract the thickness and voltages from a single column text file and perform a Weibull distribution on it. When I try to use fitdistr() I get an error stating "'x' must be a non-empty numeric vector". R is supposed to interpret numbers in text files as numeric but that doesn't seem to be happening. Any thoughts?
filename <- "SampleBreakdownSet.txt"
d <- read.table(filename, header = FALSE, sep = "")
#Extract thickness from the dataset; set to variable t
t = d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV = tail(d,(nrow(d)-1))
#Calculates the breakdown field from the thickness and BDV
BDF = (BDV*10000) / t
#Calculates the Weibull parameters from the input breakdown voltages.
fitdistr(BDF, densfun ="weibull", lower = 0)
fitdistr(BDF, densfun ="weibull", lower = 0)
Error in fitdistr(BDF, densfun = "weibull", lower = 0) :
'x' must be a non-empty numeric vector
Sample data I'm using:
2
200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243
You are passing a data.frame to fitdistr, but you should be passing the vector itself.
Try this:
d <- read.table(text='200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243', header=FALSE)
t <- d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV <- d[-1, 1]
BDF <- (BDV*10000) / t
library(MASS)
fitdistr(BDF, densfun ="weibull", lower = 0)
You could also refer to the relevant column when calling fitdistr, e.g.:
fitdistr(BDF$V1, densfun ="weibull", lower = 0)
# shape scale
# 2.745485e+00 1.997509e+04
# (3.716797e-01) (1.283667e+03)

Multiple scatterplot figure in R

I have a slightly complicated plotting task. I am half way there, quite sure how to get it. I have a dataset of the form below, with multiple subjects, each in either Treatgroup 0 or Treatgroup 1, each subject contributing several rows of data. Each row corresponds to a single timepoint at which there are values in columns "count1, count2, weirdname3, etc.
Task 1. I need to calculate "Days", which is just the visitdate - the startdate, for each row. Should be an apply type function, I guess.
Task 2. I have to make a multiplot figure with one scatterplot for each of the count variables (a plot for count1, one for count2, etc). In each scatterplot, I need to plot the value of the count (y axis) against "Days" (x-axis) and connect the dots for each subject. Subjects in Treatgroup 0 are one color, subjects in treatgroup 1 are another color. Each scatterplot should be labeled with count1, count2 etc as appropriate.
I am trying to use the base plotting function, and have taken the approach of writing a plotting function to call later. I think this can work but need some help with syntax.
#Enter example data
tC <- textConnection("
ID StartDate VisitDate Treatstarted count1 count2 count3 Treatgroup
C0098 13-Jan-07 12-Feb-10 NA 457 343 957 0
C0098 13-Jan-06 2-Jul-10 NA 467 345 56 0
C0098 13-Jan-06 7-Oct-10 NA 420 234 435 0
C0098 13-Jan-05 3-Feb-11 NA 357 243 345 0
C0098 14-Jan-06 8-Jun-11 NA 209 567 254 0
C0098 13-Jan-06 9-Jul-11 NA 223 235 54 0
C0098 13-Jan-06 12-Oct-11 NA 309 245 642 0
C0110 13-Jan-06 23-Jun-10 30-Oct-10 629 2436 45 1
C0110 13-Jan-07 30-Sep-10 30-Oct-10 461 467 453 1
C0110 13-Jan-06 15-Feb-11 30-Oct-10 270 365 234 1
C0110 13-Jan-06 22-Jun-11 30-Oct-10 236 245 23 1
C0151 13-Jan-08 2-Feb-10 30-Oct-10 199 653 456 1
C0151 13-Jan-06 24-Mar-10 3-Apr-10 936 25 654 1
C0151 13-Jan-06 7-Jul-10 3-Apr-10 1147 254 666 1
C0151 13-Jan-06 9-Mar-11 3-Apr-10 1192 254 777 1
")
data1 <- read.table(header=TRUE, tC)
close.connection(tC)
# format date
data1$VisitDate <- with(data1,as.Date(VisitDate,format="%d-%b-%y"))
# stuck: need to define days as VisitDate - StartDate for each row of dataframe (I know I need an apply family fxn here)
data1$Days <- [applyfunction of some kind ](VisitDate,ID,function(x){x-data1$StartDate})))
# Unsure here. Need to define plot function
plot_one <- function(d){
with(d, plot(Days, Count, t="n", tck=1, cex.main = 0.8, ylab = "", yaxt = 'n', xlab = "", xaxt="n", xlim=c(0,1000), ylim=c(0,1200))) # set limits
grid(lwd = 0.3, lty = 7)
with(d[d$Treatgroup == 0,], points(Days, Count1, col = 1))
with(d[d$Treatgroup == 1,], points(Days, Count1, col = 2))
}
#Create multiple plot figure
par(mfrow=c(2,2), oma = c(0.5,0.5,0.5,0.5), mar = c(0.5,0.5,0.5,0.5))
#trouble here. I need to call the column names somehow, with; plyr::d_ply(data1, ???, plot_one)
Task 1:
data1$days <- floor(as.numeric(as.POSIXlt(data1$VisitDate,format="%d-%b-%y")
-as.POSIXlt(data1$StartDate,format="%d-%b-%y")))
Task 2:
par(mfrow=c(3,1), oma = c(2,0.5,1,0.5), mar = c(2,0.5,1,0.5))
plot(data1$days, data1$count1, col=as.factor(data1$Treatgroup), main="count1")
plot(data1$days, data1$count2, col=as.factor(data1$Treatgroup), main="count2")
plot(data1$days, data1$count3, col=as.factor(data1$Treatgroup), main="count3")

Resources