How do I merge 2 data frames on R based on 2 columns? - r

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.

Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

mutate function in R

I am trying to add a column from one dataframe to another. The data is long repeated measures data, with each ID having two rows. Both my main dataset (d) and my secondary dataset (d2) use the same column (ID) to link cases to participants. However, when I use mutate like this d <- mutate(d, x = d2$x) the column binds to the dataframe but the values are not tied to the ID. This means that data gets mixed up between participants.
Is there a way to make sure that the values are referenced by ID when I add the column?

How to analyse row's with similar ID's in PySpark?

I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.
One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))

Merging data frames into another dataframe

I'm working with R statistics. I'm trying to make a data frame that merges other three data frames. Those three data frames have different column names & different row numbers (they don't have row names).
I tried originally to do:
Namenewdf <- data.frame(dataframe1, dataframe2, dataframe3)
R marked an error because of differing number of rows.
Then I tried with the merge function but it also didn't work.
How do I merge the data frames so that the resulting data frames include the original information of the data frames used as arguments, not filling the 'void' rows from the data frames that have fewer rows?
library(rowr)
finaldataframe<-cbind.fill(dataframe1,dataframe2, dataframe3,fill = NA)
finaldataframe[is.na(finaldataframe)]<-""

How do I loop through multiple Data Frames in r to create a vector?

This is the code I am currently using to move data from multiple data frames into a time-ordered vector which I then perform analysis on and graph:
TotalLoans <- c(
sum(as.numeric(HCD2001$loans_all)), sum(as.numeric(HCD2002$loans_all)),
sum(as.numeric(HCD2003$loans_all)), sum(as.numeric(HCD2004$loans_all)),
sum(as.numeric(HCD2005$loans_all)), sum(as.numeric(HCD2006$loans_all)),
sum(as.numeric(HCD2007$loans_all)), sum(as.numeric(HCD2008$loans_all)),
sum(as.numeric(HCD2009$loans_all)), sum(as.numeric(HCD2010$loans_all)),
sum(as.numeric(HCD2011$loans_all)), sum(as.numeric(HCD2012$loans_all)),
sum(as.numeric(HCD2013$loans_all)), sum(as.numeric(HCD2014$loans_all)),
sum(as.numeric(HCD2015$loans_all)), sum(as.numeric(HCD2016$loans_all))
)
I do this four more times with similar data frames that also are similarly formatted as:
Varname$year
Is there a way to loop through these 16 data frames, select an individual column, perform a function on it, and put it into a vector? This is what I have tried so far:
AllList <- list(HCD2001, HCD2002, HCD2003, HCD2004, HCD2005, HCD2006, HCD2007, HCD2008, HCD2009, HCD2010, HCD2011, HCD2012, HCD2013, HCD2014, HCD2015, HCD2016)
TotalLoans <- lapply(AllList,
function(df){
sum(as.numeric(df$loans_all))
return(df)
}
)
However, it returns a Large List with every column from the data frames. All the other posts related to this were for modifying data frames, not creating a new vector with modified values of the data frames.

Resources