RFM analysis - using ddply in R. Missing column - r

I am trying to use the code mentioned for RFM modelling in R from the blog here. However, grouping the data frame into “Buy” and “No Buy” has not been explained clearly. As a result, when I try to execute the function getPercentages, I get error like:
object "Buy" not found.
I am trying to add a Buy column as follows:
df$Buy <- ifelse(df$Frequency > 1, 1, 0)
before executing the function.
I do not know if this is right way to get the values.
My head for df after getDataframe is
ID Date Amount Recency Frequency Monetary
1207779 2016-06-22 2112.00 8 20 1576.7725
2455590 2016-06-26 1064.00 4 16 1074.8400
2660337 2016-06-21 1870.00 9 20 1616.1700
257997 2016-06-22 616.00 8 22 684.8968
963883 2016-06-27 703.12 3 16 626.1125
1124489 2016-06-21 594.15 9 18 752.2011

Try this :
Buy<-rep(0,nrow(dftry))
dftry<-cbind(dftry,Buy)

Related

Retrieve Stata variable notes in R

I have imported a Stata dta file into R using readstata13 package.
The variables have notes which contain full length of questions. I found the attr() function with which I can do a few things such as extract variable names (attr(df, name)), extract variable labels (attr(df, "var")), and label values (attr(df, "label")). However, I have not found a way to extract variable notes.
Is there a way to do so?
Below are a few lines of Stata code that produce a dta file with two variables and variable notes, which can be imported into R.
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: Mileage (mpg)
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
EDIT:
You can actually do this directly in newer versions of readstata13() as follows:
df = read.dta13("~/mpg_weight.dta")
notes = attr(df, "expansion.fields")
This will generate a list providing variable name, characteristic name and the contents of the Stata characteristic field.
Here's a quick workaround using your toy example:
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: this is the first note
note mpg: and this is the second
note mpg: here's a third
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
ds
local varlist `r(varlist)'
foreach var of local varlist {
generate notes_`var' = ""
forvalues i = 1 / ``var'[note0]' {
replace notes_`var' = "``var'[note`i']'" in `i'
}
}
export delimited notes_* using notes_mpg_weight.dta.csv, replace
You can then simply import everything in R as strings and go from there.

Is there anyway to delete rownames in R?

I made a table like this table name a.
variable relative_importance scaled_importance percentage
1 x005 68046.078125 1.000000 0.195396
2 x004 63890.796875 0.938934 0.183464
3 x007 48253.820312 0.709134 0.138562
4 x012 43492.117188 0.639157 0.124889
5 x008 43132.035156 0.633865 0.123855
6 x013 32495.070312 0.477545 0.093310
7 x009 18466.910156 0.271388 0.053028
8 x015 10625.453125 0.156151 0.030511
9 x010 8893.750977 0.130702 0.025539
10 x014 4904.361816 0.072074 0.014083
11 x002 1812.269531 0.026633 0.005204
12 x001 1704.574585 0.025050 0.004895
13 x006 1438.692139 0.021143 0.004131
14 x011 1080.584106 0.015880 0.003103
15 x003 10.152302 0.000149 0.000029
and use this code to order that table.
setorder(a,variable)
and want to get only second column.
a[2]
relative_importance
12 380.4296
11 645.4594
15 10.1440
4 8599.7715
2 10749.5752
13 263.7065
5 8434.3760
6 7443.8530
7 3602.8850
10 935.6713
14 256.7183
3 9160.4062
1 12071.1826
9 1173.0701
8 1698.0955
I want to copy "relative_importance" and paste in Excel.
But, I couldn't delete the rownames. (12,11,15...,9,8)
Is there any way to print only "relative_importance"? (print without rownames or hide rownames)
Thank you :)
You could simply use writeClipboard( as.character(a$relative_importance) ) and paste it in Excel
You could create a csv file, which you can open with Excel.
write.csv(a[2], "myfile.csv", row.names = FALSE, col.names = FALSE.
Note that the file will be created in your current working directory, which you can find by running the following code: getwd().
On a different note, are you trying to get the column into Excel for further analysis? If you are, I encourage you to learn how to do that analysis in R.

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Compile all data produced by rolling regression into one

I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.

How to Find difference between two values of last two dates in R program

DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))

Resources