How to merge multi level datasets and repeat country-year observations? - r

I have a survey data frame with individual-level observations. I want to merge this data frame with another data frame with country-level variables (one row per country-year). This latter should repeat the rows according to the number of rows in the individual-level data frame. I have been tried this using the rep() function but it didn't work. I don't know if I was clear, but I would thank you so much if anyone can help.

Related

Make a new dataset that has the same variables and value types, but random observations based on a source dataset

I have five tables, each with between 20 to 30 variables and 1,500 to 2,300 observations. I want to make a new dataset similar in values to the original dataset but not the same. Meaning if one variable is gender. I want the new dataset to have gender data, but the values would differ.
I'm unsure what functions to use or even how to search for some methods. My Google searches are coming up with how to make subsets of data, but I need new datasets with the same number of variables and observations with random values based on the values of the source dataset.
Any advice would be helpful.

Obtaining proportions within subsets of a data frame

I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))

Find closest datapoint to a date in another dataframe

I have two data frames. One data frame is called Measurements and has 500 rows. The columns are PatientID, Value and M_Date. The other data frame is called Patients and has 80 rows and the columns are PatientID, P_Date.
Each patient ID in Patients is unique. For each row in Patients, I want to look at the set of measurements in Measurements with the same PatientID (there are maybe 6-7 per patient).
From this set of measurements, I want to identify the one with M_Date closest to P_Date. I want to append this value to Patients in a new column. How do I do this? I tried using ddplyr but can't figure out how to access two data frames at once within this function.
you probably want to install the install.packages("survival") and the neardate function within it to solve your problem.
It has a good example in the documentation

In R, How do you separate data in a column of a data frame based on data from another column?

I have a data set with a number of survey variables. I am looking at the data of one column, but need to split it by the factors in another column. The survey asked gender, and also asked how much the person smoked. I need to compare how much males smoke vs females, but I cannot figure out how to split the data in the column based on the information in another column.
Can someone help?
I don't know if this is what you want:
#some reproduciable data
survey <- data.frame(gender=c('m','m','f','m','f','f','m','m','f'),
smoke=c('little','much','much','little','no','some','little','little','much'))
#gender VS smoke:
table(survey)
#or:
table(survey$gender,survey$smoke)

Shuffle the rows of a data frame to obtain 10 shuffled data frames

I have a data frame with 12 continuous variables and one grouping categorical response factor, containing two classes (G8 and V4).
I want to shuffle the rows in the data frame 10 times, so I acquire 10 different variations of the data frame to test. I want to use each version of the data frame to test a classifier algorithm. The code I am using is:-
Data(LDA.scores)
shuffle.cross.validation<-LDA.scores[sample(nrow(LDA.scores[2:13])),]
However, when I use this code, the categorical response factor strings transform into zero values when the data frame is shuffled. This defeats the object because the response variable is the grouping factor to classify the continuous variables. Thank you if anyone has a solution.

Resources