I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))
Related
I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]
I have two data frames. One data frame is called Measurements and has 500 rows. The columns are PatientID, Value and M_Date. The other data frame is called Patients and has 80 rows and the columns are PatientID, P_Date.
Each patient ID in Patients is unique. For each row in Patients, I want to look at the set of measurements in Measurements with the same PatientID (there are maybe 6-7 per patient).
From this set of measurements, I want to identify the one with M_Date closest to P_Date. I want to append this value to Patients in a new column. How do I do this? I tried using ddplyr but can't figure out how to access two data frames at once within this function.
you probably want to install the install.packages("survival") and the neardate function within it to solve your problem.
It has a good example in the documentation
I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want
I have a data set with a number of survey variables. I am looking at the data of one column, but need to split it by the factors in another column. The survey asked gender, and also asked how much the person smoked. I need to compare how much males smoke vs females, but I cannot figure out how to split the data in the column based on the information in another column.
Can someone help?
I don't know if this is what you want:
#some reproduciable data
survey <- data.frame(gender=c('m','m','f','m','f','f','m','m','f'),
smoke=c('little','much','much','little','no','some','little','little','much'))
#gender VS smoke:
table(survey)
#or:
table(survey$gender,survey$smoke)
I'm having some difficulty executing a conditional operation on two dataframes. For problem illustration, I have three variables: Price, State, and Item, which are stored in a data frame (data1) with those column names. I use ddply to generate a dataframe (data2) that includes columns State and Item, and the average price(or some other function) for that State/Item combination.
What I then want to do is fill in a column in the originating data frame(i.e. a simple prediction vector), where the column's value is the mean value for a given observations combination of State and Item in data1. (e.g., if an observation in data1 has state="Arizona" and item="pen", I then want to retrieve the average price stored in data2 that corresponds to that state/item combination, and insert it into the column.)
Thank you for any help.
The plyr package comes with a great little function called join. You can use this to complete your task.
join(dat1,dat2, by=c('State','Item'))
Review ?join to see the different types of joins possible. I'm pretty sure you want a left join.