Conditional operation on two data frames (R) - r

I'm having some difficulty executing a conditional operation on two dataframes. For problem illustration, I have three variables: Price, State, and Item, which are stored in a data frame (data1) with those column names. I use ddply to generate a dataframe (data2) that includes columns State and Item, and the average price(or some other function) for that State/Item combination.
What I then want to do is fill in a column in the originating data frame(i.e. a simple prediction vector), where the column's value is the mean value for a given observations combination of State and Item in data1. (e.g., if an observation in data1 has state="Arizona" and item="pen", I then want to retrieve the average price stored in data2 that corresponds to that state/item combination, and insert it into the column.)
Thank you for any help.

The plyr package comes with a great little function called join. You can use this to complete your task.
join(dat1,dat2, by=c('State','Item'))
Review ?join to see the different types of joins possible. I'm pretty sure you want a left join.

Related

Obtaining proportions within subsets of a data frame

I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))

How to do two sorting in r when order matters

I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]

Assigning a Value to All Points in a List

I have a set of true/false data I need to prepare for a chi-squared analysis in R. Currently it's organized by time of day in several lists. What would be the best way to add a variable to each of these lists for time of day, fill in each list's points with the time they were collected, then combine them into one table?

Find closest datapoint to a date in another dataframe

I have two data frames. One data frame is called Measurements and has 500 rows. The columns are PatientID, Value and M_Date. The other data frame is called Patients and has 80 rows and the columns are PatientID, P_Date.
Each patient ID in Patients is unique. For each row in Patients, I want to look at the set of measurements in Measurements with the same PatientID (there are maybe 6-7 per patient).
From this set of measurements, I want to identify the one with M_Date closest to P_Date. I want to append this value to Patients in a new column. How do I do this? I tried using ddplyr but can't figure out how to access two data frames at once within this function.
you probably want to install the install.packages("survival") and the neardate function within it to solve your problem.
It has a good example in the documentation

Create new numeric columns from 1 string column

I'm a beginner. I have a dataset taken from here which consists of people profiles with different attributes, while profession is of them. There are 12 professions: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
I'd like to apply K-NN to that dataset, so I'd like to distribute the profession column into 12 new columns, and attribute 1 to the corresponding profession, and 0 to all the other 11 professions that don't belong to that person.
I tried foreach package and for loops, unsuccessfully. I'm not being able to work with foreach, and I don't know what to do next, from the following code:
jobs <- data[,2]
jobs
for (job in jobs) {
print(job)
#No idea how to create the new columns here, based on if conditionals
}
How would be the best way to do this?
Thanks.
You can certainly solve the problem using a for loop, but may I suggest a solution that is more efficient in the long run: reshape2 package (https://cran.r-project.org/web/packages/reshape2/).
I have the data from bank-full.csv read into R in object bank. Next reshape2 package needs to be downloaded, installed, and loaded:
install.packages("reshape2")
library(reshape2)
The data can then be shaped into a format where observations are on rows and jobs on columns. An accessory id column is first added to the data:
bank$id<-1:nrow(bank)
Then, taking the columns 2 and 18 (job and id) from the data frame bank and casting them into the aforementioned form can be done as:
tmp<-dcast(bank[,c(2, 18)], id~job, length)
That should give a new data frame tmp, where each job has it's own column. Since every id is present in the data only once, the length function used in the dcast function to aggregate the data puts just zeros and ones in every column.
Last, these new columns can be added to the original data set:
bank<-cbind(bank[,-18], tmp[,-1])
Negative subscripts inside the square brackets delete the columns from the dataset, so this simultaneously let's you get rid off the id column.
Another, even more efficient way to do this is to use the function model.matrix:
bank2<-cbind(bank, model.matrix( ~ 0 + job, bank))
This should give you a data frame with each job as a new column. Note however that it changes the column names a bit (adds job to the beginning of the job columns).

Resources