I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.
Related
I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.
I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.
One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))
i am trying to download some time series data about euro swaps (EUSA10 Currency for example) in R using the blpapi but i am encountering the following problems:
if i try to download for example 2y, 5y, 10y and 30y swap rates using the include.non.trading.days=FALSE option , the resulting time series are for some reason of different length and i receive a message error about it. If, on the other hand i set the non trading day option to true i have similar length time series that can then be cleaned up using the na.omit() function
the format in which the data is downloaded is messy...i would like to have a data frame in which the first column is the date, second column is the first security, third column is second security and so forth. Instead what i get is [date][security][date][security2]......[date][securityN]. Any suggestions on how to solve this?
Below a quick few lines i wrote as an example
# Load package
library(Rblpapi)
# Connect to Bloomberg
blpConnect()
# Declaring securities
sec<-c("eusa2 curncy", "eusa5 curncy", "eusa10 curncy")
# Declaring field to be dowloaded
flds<-"PX_LAST"
data<-as.data.frame(bdh(sec,flds,start.date=as.Date("2019-08-18"),end.date=as.Date("2020-08-18"), include.non.trading.days=TRUE"))
It's states in the Rblapi manual that the Rblapi::bdh returns
A list with as a many entries as there are entries in securities; each list contains a data.frame with
one row per observations and as many columns as entries in fields. If the list is of length one, it
is collapsed into a single data frame. Note that the order of securities returned is determined by the
backend and may be different from the order of securities in the securities field.
So I'd suggest you rbind the data then reshape it in order to have the result you want. a fast way to do it is use the data.table::rbindlist function it takes a list as input and returns a data.table containing all entries and if idcol=TRUE then it'll append a .id column showing where the data.frame came from. Also this method will work even if you have different number of rows in the data.frames resulting from the Rblapi::bdh call.
# Declaring field to be dowloaded
flds<-"PX_LAST"
# LOADING THE DATA FROM THE API
l <- bdh(sec,flds,start.date=as.Date("2019-08-18"),end.date=as.Date("2020-08-18"), include.non.trading.days=TRUE)
# the names of the securities columns as returned by the api
securities <- paste0("eusa", c(2,5,10,15,30), ".curncy.",flds)
# row binding the resulting list
dt <- data.table::rbindlist(l, idcol=T, use.names=FALSE)
# idcol=T appends an id column (.id) to the resulting data.table
# use.names=F because the columns of the data.frames are different
# remaking the .id column so it reflects the name of the column that it already had
dt[, .id:= securities[.id] ]
# making a wider data.table
data.table::dcast(dt, eusa2.curncy.date ~ .id, value.var=securities[1])
# eusa2.curncy.date is the column that defines a group of observation
# .id the name of the columns
# securities[1] or eusa2.curncy.PX_LAST is the column that contains the values
data used
As I don't have access to a bloomberg api endpoint I created this mock data which resemble the output of dbh
col.names <- paste0("eusa", rep(c(2,5,10,15,30),each=2), ".curncy.", rep(c(flds,"date"), 5))
l<-rep(list(data.frame(rnorm(200), 1:200)), 5)
for (i in 1:length(l)) colnames(l[[i]]) <- col.names[(2*i-1):(2*i)]
BACKGROUND:
I have two data frames that two researchers have used to manually input time data that tracks how a group of participants reach a consensus in making a decision. We are doing this by logging the time of each preference
statement as well as the preference (ranked by priority).
QUESTION:
My question is, what functions or packages can I use to show me the discrepancies in the two data tables.
EXAMPLE:
discrepancies <- show_discrepancies(myData1, myData2)
discrepancies
outputExample1
provides a data frame containing only the entries that do not match
outputExample2
provides a combined data frame, with entries from both myData1 and myData2, and the entries that do not have a match are highlighted red
either output would work but I would prefer outputExample1 if possible
Assuming the two data frames have the same structure, you can get outputExample1 using the following function:
show_discrepancies <- function(data1, data2) {
data <- rbind(data1, data2)
data[!duplicated(data),]
}
Also take a look at the join functions available in the dplyr package.
I have 10 topics. For each topic name I have a results_topic_df data frame. In this data frame there are 2 columns: index, which is a name of another data frame and var_name, which is a name of a variable inside the corresponding data frame (indicated by index).
What I want to do is to take the corresponding original data frame (whos name is indicated by results_topic_df$index), look at the value of results_topic_df$var_name in the same row, go to the original data frame and copy the relevant variable to a data frame named container_df.
Eventually I will have container_df having only the selected variables from all the data frames that appear in results_topic_df.
I want to repeat this procedure for each one of the 10 topics.
I have tried to do this with a loop but because my data frames' names change, I got really confused with all the combinations of assign(),paste0(), and eval(). Is there a simpler way to accomplish my goal? Thanks.