How to analyse row's with similar ID's in PySpark? - r

I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.

One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))

Related

Merging Two Datasets on Matched Column in R

I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example

How to sample a list containing multiple dataframes using lapply in R?

I have this list of data that I created by using split on a dataframe:
dat_discharge = split(dat2,dat2$discharge_id)
I am trying to create a training and test set from this list of data by sampling in order to take into account the discharge id groups which are not at all equally distributed in the data.
I am trying to do this using lapply as I'd rather not have to individually sample each of the groups within the list.
trainlist<-lapply(dat_discharge,function(x) sample(nrow(x),0.75*nrow(x)))
trainL = dat_discharge[(dat_discharge %in% trainlist)]
testL = dat_discharge[!(dat_discharge %in% trainlist)]
I tried emulating this post (R removing items in a sublist from a list) in order to create the testing and training subsets however the training list is entirely empty, which I assume means that is not the correct way to do that for a list of dataframes?
Is what I am looking to do possible without selecting for the individual dataframes in the list like data_frame[[1]]?
You could use map_dfr instead of lapply from purrr library (do have into account that you need to install.package("purr") and the library(purrr) before doing the next steps. But maybe you already have it installed since it's a common package.
Then you could use the next code
dat2$rowid<-1:nrow(dat2)
dat_discharge <- split(dat2,dat2$id)
trainList<- dat_discharge %>% map_dfr(.f=function(x){
sampling <- sample(1:nrow(x),round(0.75*nrow(x),0))
result <- x[sampling,]
})
testL<-dat2[!(dat2$rowid %in% trainList$rowid),]
To explain the above code. First of all, I added a unique rowid to dat2 so I know which rows I am sampling and which not. This will be used in the last line of code to differentiate the Test and Train datasets such as Train dataset doesnt have any rowid that test has.
Then i do the split to create dat_discharge as you did
Then to each dataframe inside the dat_discharge list I apply the function in the map_dfr. The map_dfr fucntion is the same as the lapply, just that it "concatenates" the outputs in a single dataframe instead of putting each output in a list as the lapply does. Provided that the output of each of the iterations of the map_dfr is a dataframe with same columns as the first iteration. Think of it as "Okay, i got this dataframe, im gonna bind its row to the previous dataframe result". So the result is just one big dataframe.
Inside that function you can notice that i am doing the sample a bit different. I am taking 75% of the sequence of numbers of the rows that the iteration dataframe has, then, with that sampled sequence I subset the iteration dataframe with the x[sampling,] and that yields my sampled dataframe for that iteration (which is one of the dataframes from the dat_discharge list). And automatically, the map_dfr joins those sampled dataframes for each result in a single, big dataframe instead of putting them on a list as the lapply does.
So lastly, i just create the test as all the rowids from dat2 that are NOT present in the test set.
Hope this servers you well :)
Do note that, if you want to sample 75% of the observations for each id, then each id should have at least 4 observation for it to make sence. Imagine if you only had 1 observation in a particular id, yikes!. This code would still work (it will simply select that observation), but you really need to think of that implication when you build your statistic model

How do I merge 2 data frames on R based on 2 columns?

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.
Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

Counting unique subsets of data efficiently

I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Resources