Adding a column with function values to Spark dataframes with SparkR - r

I am using SparkR to work with some project that includes R and spark in its technology stack.
I have to create new columns with booleans values returned from validation functions. I can do this job easily with spark dataframes and one expression like:
sdf1$result <- sdf1$value == sdf2$value
The problem is when I have to compare two dataframes of different lengths.
What is the best way to operate sdf1 and sdf2 dataframes with a function and assign the value to a new column of sdf1? Let's suppose that I want to generate a column with the minimum length between sdf1 and sdf2.

If you have dataframes of different lengths, I logically assume that you have some column(s) that determines how to line up the values between the two dataframes. You will have to perform a join between the two dataframes on these columns (see SparkR::merge / SparkR::join) and then do your comparison operation to create your new column on the resulting dataframe.

Related

How to sample a list containing multiple dataframes using lapply in R?

I have this list of data that I created by using split on a dataframe:
dat_discharge = split(dat2,dat2$discharge_id)
I am trying to create a training and test set from this list of data by sampling in order to take into account the discharge id groups which are not at all equally distributed in the data.
I am trying to do this using lapply as I'd rather not have to individually sample each of the groups within the list.
trainlist<-lapply(dat_discharge,function(x) sample(nrow(x),0.75*nrow(x)))
trainL = dat_discharge[(dat_discharge %in% trainlist)]
testL = dat_discharge[!(dat_discharge %in% trainlist)]
I tried emulating this post (R removing items in a sublist from a list) in order to create the testing and training subsets however the training list is entirely empty, which I assume means that is not the correct way to do that for a list of dataframes?
Is what I am looking to do possible without selecting for the individual dataframes in the list like data_frame[[1]]?
You could use map_dfr instead of lapply from purrr library (do have into account that you need to install.package("purr") and the library(purrr) before doing the next steps. But maybe you already have it installed since it's a common package.
Then you could use the next code
dat2$rowid<-1:nrow(dat2)
dat_discharge <- split(dat2,dat2$id)
trainList<- dat_discharge %>% map_dfr(.f=function(x){
sampling <- sample(1:nrow(x),round(0.75*nrow(x),0))
result <- x[sampling,]
})
testL<-dat2[!(dat2$rowid %in% trainList$rowid),]
To explain the above code. First of all, I added a unique rowid to dat2 so I know which rows I am sampling and which not. This will be used in the last line of code to differentiate the Test and Train datasets such as Train dataset doesnt have any rowid that test has.
Then i do the split to create dat_discharge as you did
Then to each dataframe inside the dat_discharge list I apply the function in the map_dfr. The map_dfr fucntion is the same as the lapply, just that it "concatenates" the outputs in a single dataframe instead of putting each output in a list as the lapply does. Provided that the output of each of the iterations of the map_dfr is a dataframe with same columns as the first iteration. Think of it as "Okay, i got this dataframe, im gonna bind its row to the previous dataframe result". So the result is just one big dataframe.
Inside that function you can notice that i am doing the sample a bit different. I am taking 75% of the sequence of numbers of the rows that the iteration dataframe has, then, with that sampled sequence I subset the iteration dataframe with the x[sampling,] and that yields my sampled dataframe for that iteration (which is one of the dataframes from the dat_discharge list). And automatically, the map_dfr joins those sampled dataframes for each result in a single, big dataframe instead of putting them on a list as the lapply does.
So lastly, i just create the test as all the rowids from dat2 that are NOT present in the test set.
Hope this servers you well :)
Do note that, if you want to sample 75% of the observations for each id, then each id should have at least 4 observation for it to make sence. Imagine if you only had 1 observation in a particular id, yikes!. This code would still work (it will simply select that observation), but you really need to think of that implication when you build your statistic model

How to get only a few columns from several data frames obtained from using the lapply function?

I have this function "get_animals" that retrieves data for several specimens of different species of animals. It works by giving a vector with several species names, and it retrieves the data regarding those species (location, dna sequences ...). The thing is that the data base I'm using can't handle a query with too many species names in a single line of code, so I'm trying to use lapply to get one by one.
I tried this:
species_list<-as.list(as.character(unique(df$species_name)))
e<-lapply(species_list, function (x) get_animals(animal_names=x))
The thing is that the lapply returns a series of data frames with too many columns for each species name in "species_list", and what I wanted was only two columns from each data frame, and then I aimed to fuse all those data frame in a single one.
I tried to unlist the result from the lapply function:
e<-unlist(e)
But it didn't work because it just returned all the occurences for the first column of each data frame.
Thanks in advance for any answers
If we need to subset the columns, use either the column index
lapply(species_list, function (x) get_animals(animal_names=x)[c(1, 5)])
Or column name
lapply(species_list, function (x)
get_animals(animal_names=x)[c("species_name", "location")])

How do I merge 2 data frames on R based on 2 columns?

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.
Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

Compare two dataframes to extract the new columns

I have two dataframes. As an example:
iris1<-iris[1:3]
iris2<-iris[1:4]
I want to extract the new column by comparing the two dataframes.
I have tried using the compare function from the eponymous package but no joy- it seems that comparing rows is more common. Is there an easy way to do this?
We can use setdiff
setdiff(union(names(iris1), names(iris2)), names(iris1))
Or if one of the dataset have more columns than the other while including all the columns of the second
setdiff(names(iris2), names(iris1))

How to copy multiple columns to a new dataframe in R

I have a data set (df2) with 400 columns and thousands of rows. The columns all have different names but all have either 'typeP' or 'typeR' at the end of their names. They are not ordered sequentially (eg. P,P,P,P,R,R,R,R) but randomly (P,P,R,R,R,P,R,P etc). I want to create a new data frame with just those columns whose names have 'type P' in their names.
I'm very new to R and so far I have only managed to find the positions of those columns using: grep("typeP",colnames(df2)). Any help would be appreciated!
After we get the index, we can use that to subset the initial dataset
df3 <- df2[grep("typeP",colnames(df2))]

Resources