transposing multiple specific elements onto another dataframe - r

I have a challenging problem related to 2 large datasets that has had me stumped for awhile now and would love to see some other approaches.
I have two dataframes that I would like to merge without creating duplicated rows. The dataframes consist of tx (treatment) and ae(adverse effects), each with a column named id which correlates data from a particular person .
Data frame for tx:
id<-c("A","A","A","B","B")
stage<-c("peak","trough","roll","peak","trough")
start<-c("01-01-2011","01-03-2011","01-05-2011","01-01-2008","01-07-2008")
end<-c("01-02-2011","01-04-2011","01-05-2012","01-05-2008","01-08-2008")
tx<-data.frame(id,stage,start,end)
tx[] <- lapply(tx, as.character)
tx$start<-as.Date(start, format="%d-%m-%Y")
tx$end<-as.Date(end, format="%d-%m-%Y")
Data frame for ae:
id<-c("A","A","A","A","B","B","B","B")
onset<-c("05-01-2011","10-01-2011","14-02-2011","25-01-2011","11-01-2008","11-04-2008","14-07-2008","22-07-2008")
ae<-data.frame(id,onset)
ae[]<-lapply(ae, as.character)
ae$onset<-as.Date(onset, format="%d-%m-%Y")
So far I have tried merge(tx,ae) which does the job but I get duplicates and they are not aligned correctly.
What I need is multiple ae$onset dates for a specific person between the dates tx$start <= ae$onset <= tx$end to appear in a new column as ae1, ae2....etc in the tx data frame.
Ultimately I need a transpose function for specific elements that match the specified selection criteria.
Any help would be greatly appreciated as I have not been able to locate an example of this.

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Merging DataFrames without Duplicate Columns in R

I have three data sets that I would like to merge. My data is on companies in the SP500 and their corporate political activity. Of my datasets, one is named PAC, one is named Lobby and one is named BoardData. The datasets all have three columns in common: "ultorg", "sector", and "subind" as well as other columns unique to each dataset.
I would like to merge the three excel documents so that there is only one of each of those columns that has all of the other variables appended to it.
I have tried doing this on my own but I get a few problems. Specifically, I get several columns for ultorg/sector/subind (the variables the datasets have in common) and there are entries that are repeating in places where they shouldn't. For example, my board data only goes until 2015 but my lobbying data goes until 2000. Using the incorrect/incomplete code below, I have rows where company's board data from 2015 is being put in for years 2000-2015. I would just like the years without a Board entry for them (2000-2015)to just have NA entered in.
Here's the current code.
library(tidyverse)
library(janitor)
library(glue)
setwd("~/Desktop/thesis")
library(readxl)
PAC <- read_excel("PAC.xlsx")
library(readxl)
Lobby <- read_excel("Lobby.xlsx")
library(readxl)
BoardData <- read_excel("BoardData.xlsx")
alldata <- left_join(PAC, Lobby, by="ultorg")
alldata <- left_join(alldata, BoardData, by=“ultorg”)
Thank you so much for any help you are able to give me! I really appreciate it and am able to answer any questions regarding my data.
Merging by ultorg, sector, subind will work and if there is column that indicates about date, and it's common, then you should consider to add that column while joining them. Choice between full_join and left_join or etc are up to your purpose. Code below is one of example that you may try.
BoardData %>%
full_join(PAC, by = c("ultorg", "sector", "subind")) %>%
full_join(Lobby, by = c("ultorg", "sector", "subind"))

How to find and replace values of a data frame in R?

I have two sets of Data frames. One contains the region names for different regional code. Another data frames has certain GTU data's with regional code mentioned. I want to replace regional codes of second data with region names based on the 1st data Using R. Please help !
If your data.frames are called df1 and df2 then you could try
try df_final <- merge(df1,df2,by="regional code").
Then df_final will contain everything (also region names and regional code).
Later you can delete the regional code column if you want by using
df_final[, !(colnames(df_final) %in% c("regional code"))]

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Appending a variable to existing data depending on rows

So I have two columns. I need to add a third column. However this third column needs to have A for the first amount of rows, and B for the second specified amount of rows.
I tried adding this data_exercise_3 ["newcolumn"] <- (1:6)
but it didn't work. Can someone tell me what I'm doing wrong please?
Looks like you're having a problem with subsetting a data frame correctly. I'd recommend reviewing this concept before you proceed much further, either via a Coursera course or on a website like this UCLA R learning module on subsetting data frames. Subsetting is a crucial component of data wrangling with R, and you'll go much faster with a solid foundation of the basics!
You can assign values to a subset of a data frame by using [row, column] notation. Since your data frame is called data_exercise_3 and the column you'd like to assign values to is called 'newcolumn', then assuming you want the first 6 rows as 'A' and the next 3 as 'B', you could write it like this:
data_exercise_3[1:6,'newcolumn'] <- 'A'
data_exercise_3[7:9,'newcolumn'] <- 'B'
data_exercise_3$category <- c(rep("A",6),rep("B",6))

Resources