Find same observations of a list/matrix in another matrix - r

For a marketing class I have to write a function that calculates the retention rate of the customers (probability that a customer still is a customer). I've come so far that I isolated the ids of the individual customers and stored them in the matrix first.transactions.data. I then split them into cohorts (group of customers by time) with split() and stored them in the list cohort.
Now comes my problem: I calculated another sub-matrix from the full data set called final.period.data where I will calculate the retention rate. However, therefore I have to isolate the ids in final.period.data for each cohort. My instructor told me that I should create an additional column in final.period.data that shows TRUE or FALSE depending on whether the cohort's id and final.period.data's id are the same. For this I tried to use exists, but I always receive error messages. I tried the following:
final.period.data <- if(exists(cohort$'1'$id, where = final.period.data$id) final.period.data$same = TRUE)
but always receive error messages such as: unexpected symbol or invalid first argument. I also tried to convert the list cohort into a matrix but this didn't help either. How do I have to change the exist command or is there a simpler way to locate cohort's ids in final.period.data?
Thank you for your help.

You can just create a function that does what you want:
funct <-(final.period.data){
if (final.period.data$cohort =='1' & final.period.data$id ==<condition2>){
#Change the number for the TRUE condition}
else{ #If it doesn't fit the two conditions
#Change the number for the FALSE condition}
}
vector <- c(nrow(final.period.data))
final.period.data <- cbind(vector)
And use it as the apply function. Here will you find more information about apply
But I usually do it with a for loop, first creating the new column and then adding it to the data frame.

Related

Can I regress rows in R against a set of constants?

I have a very large set of claims data (called data2) with 1 row per enrollee and columns enrolid (enrollment id), jan16allwd,...,dec16allwd, as well as some other fields that aren't relevant to this. For each enrollee I'm looking to extract the coefficient of the regression for (allowed claims~month). I've tried this:
allowed <- c(data2$jan16allwd, data2$feb16allwd, data2$mar16allwd, data2$apr16allwd,
data2$may16allwd, data2$jun16allwd, data2$jul16allwd, data2$aug16allwd,
data2$sept16allwd, data2$oct16allwd, data2$nov16allwd, data2$dec16allwd)
months <- (1:12)
betas.allwd <- unlist(lapply(split(data2,data2$enrolid),function(chunk)
{return(coef(lm(allowed~months, data=chunk))[[2]])}))
but keep getting an error about the lengths of the datasets being different. I know it's due to the allowed fields not being split up by enrolid. How can I fix this and return the vector I need?

R code for whatsapp average word length per person

I am new to R. Currently, I have parsed messages from a Whatsapp chat group and now I am trying to visualize data for average word length per member.
I am using this code to calculate the number of words for every time "Eddy" message
for(i in grep("Eddy",chatcsv[,2],fixed=TRUE)){
length(which(!is.na(chatcsv[i,4:111])))
}
This does not return any output or any error message.
My intention is to then sum up the total length and then divide by the number of times a person message. Lastly, I plan to place the average as a vector and visualize it as a bar graph.
Thank you
Your syntax is wrong. You should use:
allnames <- chatcsv[,2] #or cimilar
eddyindexes <- grep("Eddy",allnames,fixed=TRUE) #return indexes of eddys chats
eddyschats <- chatcsv[eddyindexes, 4:100]
eddysavgcharacters <- apply[eddyschats,function(x) mean(nchar(x))] #average nchars of eddys chats
I'm thinking you are coming from a non-functional language. (Not a language that is dysfunctional, but rather one that is not a "functional language".) Your expression length(which(!is.na(chatcsv[i,4:111]))) would do nothing, because it is inside a for loop but was not assigned to any name. It just disappears. You would have needed to create a named vector (let's say res) with res <-numeric(0) before your loop and then within your loop done:
res[i] <- length(which(!is.na(chatcsv[i,4:111])))
The earlier answerer was confusing grep and grepl in his comment. The grep function returns integer values; the grepl function returns logical vectors. They can both be used for indexing.
Whether that expression would give you the basis for furhter efforts is no clear. It would depend on the contents of chatcsv[i,4:111]. If the contents are single words then perhaps it would succeed. If they are sentences then it would not. The length function would just return the number of non-NA values in the row-vector. Only if your prior (undescribed) operations had created a clean set of "words" in that set of columns would you be getting meaningful results.

Create new column in dataframe using if {} else {} in R

I'm trying to add a conditional column to a dataframe, but not getting the results I'm expecting.
I have a dataframe with values recorded for the column "steps" across 5-minute intervals over various days. I'm trying to impute missing values in the 'steps' column by using the mean number of steps for a given 5-minute interval on the days that do have measurements. n.b. I tried using the MICE package for this but it just crashed my computer so I opted for a more manual workaround.
As an intermediate stage, I have bound an additional column to the existing dataframe with the mean number of steps for that interval. What I want to do next is create a column that returns that mean if the raw number of steps is NULL, and just uses the raw value if not null. Here's my code for that part:
activityTimeAvgs$stepsImp <- if(is.na(activityTimeAvgs$steps)){
activityTimeAvgs$avgsteps
} else {
activityTimeAvgs$steps
}
What I expected to happen is that the if statement would evaluate as TRUE if 'steps' is NA and consequently give 'avgsteps'; in cases where 'steps' is not NA I would expect it to just use the raw value for 'steps'. However, the output just gives the value for 'avgsteps' in every row, which is not much use. I also get the following warning:
Warning message:
In if (is.na(activityTimeAvgs$steps)) { :
the condition has length > 1 and only the first element will be used
Any ideas where I'm going wrong?
Thanks in advance.
The if statement is not suitable for this. You need to use ifelse:
activityTimeAvgs$stepsImp <- ifelse(is.na(activityTimeAvgs$steps), activityTimeAvgs$avgsteps, activityTimeAvgs$steps)

R, variable name missing in returned object, ddply, function(x)

When this code is run why does the variable name result_var_all not show up in v? Its just the variable name itself I'm asking about. Otherwise the code works fine.
v<-ddply(y1,.(metric_name), result_var_all<-function(x) {
y1a<-x[match(unique(x[,2]),x[,2]),]
y1b<-(y1a$event)
y1c<-subset(x,x[,1] %in% (y1b))
var(y1c$result_value)
})
I also tried this variation, which again runs to completion, but gives the same result for every value of field = metric_name in the sample data (NA for this small data set and a numeric value in the larger data set). Why is that?
z1<-split(y1,y1[,3])
lapply(z1, function(x) {
y1a<-x[match(unique(x[,2]),x[,2]),]
y1b<-(y1a$event)
y1c<-subset(x,x[,1] %in% (y1b))
x["result_var_all"]<-var(y1c$result_value)
out<-x
return(out)})
m<-rep(c("a1","a2","a3","b1","b2","b3","c1","c2","c3"),2)
n<-rep(c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),2)
p<-rep(c("width","depth","count"),6)
r<-c(sample((100:200),9),sample((20:50),9))
y1<-data.frame(m,n,p,r)
colnames(y1)<-c("event","site","metric_name","result_value")
This is a module in a longer script in which the filters subset to the target record set, for about 300 metrics. Not all sites have the same metrics and some sites have multiple events.

Create a new data frame of the means of randomly selected rows - looped

Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!
Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.

Resources