I'm looking to create a model where my outcome variable is positive and continuous - number of conversions from Facebook ads.
Here's a sample of the data. "Results_Total" is the outcome variable -
| Team | Opp_Team | Channel_Market | Week | Thuuz_Rating | Team_EloRating | Opp_Team_EloRating | Divisional | Spend | Results_Total |
+-----------------+----------------------+----------------+------+--------------+----------------+--------------------+------------+---------+---------+
| Atlanta Falcons | Tampa Bay Buccaneers | Fox_In | 1 | 46 | 1486 | 1412 | 1 | 4681.63 | 48 |
+-----------------+----------------------+----------------+------+--------------+----------------+--------------------+------------+---------+---------+
| Atlanta Falcons | Carolina Panthers | Fox_In | 4 | 68 | 1510 | 1604 | 1 | 5373.1 | 45 |
+-----------------+----------------------+----------------+------+--------------+----------------+--------------------+------------+---------+---------+
| Atlanta Falcons | Denver Broncos | Fox_In | 5 | 66 | 1541 | 1690 | 0 | 5339.04 | 15 |
+-----------------+----------------------+----------------+------+--------------+----------------+--------------------+------------+---------+---------+
| Atlanta Falcons | Seattle Seahawks | Fox_In | 6 | 68 | 1576 | 1654 | 0 | 6304.21 | 41 |
I have installed the AER package and tried this code:
library(AER)
nfltobit = tobit(Results_Total ~ Team + Opp_Team + Channel_Market + Week + Spend + Team:Spend + Opp_Team:Spend + Channel_Market:Spend + Week:Spend, left = 0 , right = Inf, data = nfltrain)
For some reason I get an error:
unused arguments (left = 0, right = Inf, data = nfltrain)
Perhaps I'm not approaching this correctly, or the tobit function isn't the correct method. Any ideas?
Related
I used the function tableby from the package arsenal to calculate percentages, but percentages are calculated by column by default. What I want is to calculate percentages by row
library(arsenal)
summary(tableby(
décès~Sexe+`Lieu de vie`+`Nombre de principes actifs sur l'ordonnance`+
`Clairance de la créatinine`+Immunodepression+IMC+Fièvre+Toux+
`Perte de poids récente` + `localisation3c` +
`Evenement indésirable lié au trt` + resistance, data=analytiqueDC
))
To include percentages by row, add argument cat.stats="countrowpct" to your tableby function.
Here is a complete example:
library(arsenal)
set.seed(100)
nsubj <- 90
mdat <- data.frame(Response=sample(c(1,2,3),nsubj, replace=TRUE),
Sex=sample(c("Male", "Female"), nsubj,replace=TRUE),
Age=round(rnorm(nsubj,mean=40, sd=5)),
HtIn=round(rnorm(nsubj,mean=65,sd=5)))
out <- tableby(Response ~ Sex + Age + HtIn, data=mdat, cat.stats="countrowpct")
labels(out) <- c(Age="Age (years)", HtIn="Height (inches)")
summary(out, stats.labels=c(meansd="Mean-SD", q1q3 = "Q1-Q3"), text=TRUE)
Output
| | 1 (N=25) | 2 (N=31) | 3 (N=34) | Total (N=90) | p value|
|:---------------|:---------------:|:---------------:|:---------------:|:---------------:|-------:|
|Sex | | | | | 0.232|
|- Female | 17 (34.0%) | 14 (28.0%) | 19 (38.0%) | 50 (100.0%) | |
|- Male | 8 (20.0%) | 17 (42.5%) | 15 (37.5%) | 40 (100.0%) | |
|Age (years) | | | | | 0.547|
|- Mean-SD | 40.200 (4.021) | 40.161 (3.796) | 39.265 (3.671) | 39.833 (3.796) | |
|- Range | 29.000 - 48.000 | 33.000 - 51.000 | 30.000 - 48.000 | 29.000 - 51.000 | |
|Height (inches) | | | | | 0.093|
|- Mean-SD | 63.360 (5.322) | 66.516 (4.878) | 65.000 (5.684) | 65.067 (5.402) | |
|- Range | 52.000 - 78.000 | 57.000 - 78.000 | 50.000 - 79.000 | 50.000 - 79.000 | |
I have two tables.
Table1:
+-----+--------+-------------+
| Key | region | region_name |
+-----+--------+-------------+
| ABC | NT | NORTH |
| ABC | ST | SOUTH |
| XYZ | NT | NORTH |
| XYZ | ST | SOUTH |
| DEF | ST | SOUTH |
+-----+--------+-------------+
Table2:
+-----+-------+------+--------+
| KEY | Sales | cost | profit |
+-----+-------+------+--------+
| ABC | 130 | 100 | 30 |
| XYZ | 120 | 95 | 25 |
| DEF | 110 | 90 | 20 |
+-----+-------+------+--------+
I want the final output be like below.
+-----+-------+------+--------+--------+-------------+
| KEY | Sales | cost | profit | region | region_name |
+-----+-------+------+--------+--------+-------------+
| ABC | 130 | 100 | 30 | NT | NORTH |
| ABC | 130 | 100 | 30 | ST | SOUTH |
| XYZ | 120 | 95 | 25 | NT | NORTH |
| XYZ | 120 | 95 | 25 | ST | SOUTH |
| DEF | 110 | 90 | 20 | ST | SOUTH |
+-----+-------+------+--------+--------+-------------+
Thanks in advance..!
We can use left_join
library(dplyr)
left_join(df1, df2, by = 'Key')
Or with merge in base R
merge(df1, df2, by = 'Key', all.x = TRUE)
I am having a hard time to figure this out in R.
This is what I would like to do.
In a data frame like below, I would like to do if Name and Class duplicates add two row's score and if not, leave it as it is.
+------------------+-----------+-------+
| Name | Class | Score |
+------------------+-----------+-------+
| Sara | Sophomore | 10 |
| John | Freshman | 20 |
| Taylor | Sophomore | 30 |
| Tyler | Junior | 10 |
| Keith | Junior | 20 |
| Andrew | Senior | 30 |
| Victor | Senior | 10 |
| Nancy |Sophomore | 20 |
| Taylor | Junior | 30 |
| John | Senior | 10 |
| Victor | Freshman | 20 |
| Sara | Sophomore | 30 |
| John | Freshman | 10 |
| Taylor | Sophomore | 20 |
| John | Senior | 30 |
+------------------+-----------+-------+
So basically, the end result should look like:
+--------+-----------+-------+--+--+--+--+
| Name | Class | Score | | | | |
+--------+-----------+-------+--+--+--+--+
| Sara | Sophomore | 40 | | | | |
| John | Freshman | 30 | | | | |
| Taylor | Sophomore | 50 | | | | |
| Tyler | Junior | 10 | | | | |
| Keith | Junior | 20 | | | | |
| Andrew | Senior | 30 | | | | |
| Victor | Senior | 10 | | | | |
| Nancy | Sophomore | 20 | | | | |
| Taylor | Junior | 30 | | | | |
| John | Senior | 40 | | | | |
| Victor | Freshman | 20 | | | | |
+--------+-----------+-------+--+--+--+--+
As you see if name is the only duplicated value, it does not change (Example of John Freshman and John Senior). If class is the only duplicated value, it does not change either... Two columns in a row have to be duplicated to make the change.
My try is as below, but it is not working and am getting error message
'Error in if ((experiment[i, 1] == experiment[j, 1]) & (experiment[i, 2] == : missing value where TRUE/FALSE needed'
My code:
# creating an empty data frame
experiment1<-data.frame(matrix(ncol=3, nrow=15))
for(i in 1: nrow(experiment)){
for(j in i+1: nrow(experiment)){
if((experiment[i,1] == experiment[j,1]) & (experiment[i,2] == experiment[j,2])){
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3] + experiment[j,3]}
else{
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3]}}}
Could anyone help fixing my code or figuring out "nobler" code?
Aggregation is like the first argument explained in any basic R tutorial, I suggest you go and follow some.
base R
aggregate(formula = Score ~ Name + Class, data = mydf, FUN = sum)
dplyr
mydf %>% group_by(Name, Class) %>% summarize(scoreSum = sum(Score))
data.table
setDT(mydf)[ , .(scoreSum = sum(number)), by = .(Name, Class)]
I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.
Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)