How to handle null entries in SparkR - r

I have a SparkSQL DataFrame.
Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?
In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.
Thanks.

SparkR Column provides a long list of useful methods including isNull and isNotNull:
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)
Id Age
1 1 21
2 2 18
3 3 NA
> filter(people, isNotNull(people$Age)) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
> filter(people, isNull(people$Age)) %>% head()
Id Age
1 4 NA
Please keep in mind that there is no distinction between NA and NaN in SparkR.
If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:
> fillna(people, 99) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
4 4 99
> dropna(people) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:
> people_with_names_local <- data.frame(
Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
4 4 NA <NA>
> dropna(people_with_names, minNonNulls=2) %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob

It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:
testFrame <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")
resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL
This omits the entire row where the element in column b is missing.

Related

How to change NA into 0 based on other variable / how many times it was recorded

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!
One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA
Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0
This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Remove duplicates making sure of NA values R

My data set(df) looks like,
ID Name Rating Score Ranking
1 abc 3 NA NA
1 abc 3 12 13
2 bcd 4 NA NA
2 bcd 4 19 20
I'm trying to remove duplicates which using
df <- df[!duplicated(df[1:2]),]
which gives,
ID Name Rating Score Ranking
1 abc 3 NA NA
2 bcd 4 NA NA
but I'm trying to get,
ID Name Rating Score Ranking
1 abc 3 12 13
2 bcd 4 19 20
How do I avoid rows containing NA's when removing duplicates at the same time, some help would be great, thanks.
First, push the NAs to last with na.last = T
df<-df[with(df, order(ID, Name, Score, Ranking),na.last = T),]
then do the removing of duplicated ones with fromLast = FALSE argument:
df <- df[!duplicated(df[1:2],fromLast = FALSE),]
Using dplyr
df <- df %>% filter(!duplicated(.[,1:2], fromLast = T))
You could just filter out the observations you don't want with which() and then use the unique() function:
a<-unique(c(which(df[,'Score']!="NA"), which(df[,'Ranking']!="NA")))
df2<-unique(df[a,])
> df2
ID Name Rating Score Ranking
2 1 abc 3 12 13
4 2 bcd 4 19 20

Finding "outliers" in a group

I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.
Imagine I have a data set like this:
Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)
Case_ID Pat_ID Sex Age
1 1 1 33
2 1 1 33
3 1 1 33
4 1 1 76
5 2 2 19
6 2 2 19
7 2 1 49
8 3 1 15
It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.
You could try
library(data.table)
Using Mode from
Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 5 2 19
#5: 2 6 2 19
#6: 3 8 1 15
Testing other cases,
Pat_File$Sex[6] <- 1
Pat_File$Age[4] <- 16
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 6 1 19
#5: 3 8 1 15
This method works, I believe, though I doubt it's the quickest or most efficient way.
Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:
library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)
Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})
Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})
temp.f<-NULL
for(i in 1:length(temp)){
temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}
do.call("rbind", temp.f)
# Case_ID Pat_ID Sex Age
#1 1 1 1 33
#2 2 1 1 33
#3 3 1 1 33
#4 5 2 2 19
#5 6 2 2 19
#6 8 3 1 15
Here is another approach using the sqldf package:
1) Create new dataframe (called data_groups) with unique groups based on Pat_ID, Sex, and Age
2) For each unique group, check Pat_ID against every other group and if the Pat_ID of one group matches another group, select the group with lower count and store in new vector (low_counts)
3) Take new datafame (data_groups) and take out Pat_IDs from new vector (low_counts)
4) Recombine with Pat_File
Here is the code:
library(sqldf)
# Create new dataframe with unique groups based on Pat_ID, Sex, and Age
data_groups <- sqldf("SELECT *, COUNT(*) FROM Pat_File GROUP BY Pat_ID, Sex, Age")
# Create New Vector to Store Pat_IDs with Sex and Age that differ from mode
low_counts <- vector()
# Unique groups
data_groups
for(i in 1:length(data_groups[,1])){
for(j in 1:length(data_groups[,1])){
if(i<j){
k <- length(low_counts)+1
result <- data_groups[i,2]==data_groups[j,2]
if(is.na(result)){result <- FALSE}
if(result==TRUE){
if(data_groups[i,5]<data_groups[j,5]){low_counts[k] <- data_groups[i,1]}
else{low_counts[k] <- data_groups[j,1]}
}
}
}
}
low_counts <- as.data.frame(low_counts)
# Take out lower counts
data_groups <- sqldf("SELECT * FROM data_groups WHERE Case_ID NOT IN (SELECT * FROM low_counts)")
Pat_File <- sqldf("SELECT Pat_File.Case_ID, Pat_File.Pat_ID, Pat_File.Sex, Pat_File.Age FROM data_groups, Pat_File WHERE data_groups.Pat_ID=Pat_File.Pat_ID AND data_groups.Sex=Pat_File.Sex AND data_groups.Age=Pat_File.Age ORDER BY Pat_File.Case_ID")
Pat_File
Which Provides the following results:
Case_ID Pat_ID Sex Age
1 1 1 1 33
2 2 1 1 33
3 3 1 1 33
4 5 2 2 19
5 6 2 2 19
6 8 3 1 15

How do I remove rows from a data.frame where two specific columns have missing values?

Say I write the following code to produce a dataframe:
name <- c("Joe","John","Susie","Mack","Mo","Curly","Jim")
age <- c(1,2,3,NaN,4,5,NaN)
DOB <- c(10000, 12000, 16000, NaN, 18000, 20000, 22000)
DOB <- as.Date(DOB, origin = "1960-01-01")
trt <- c(0, 1, 1, 2, 2, 1, 1)
df <- data.frame(name, age, DOB, trt)
that looks like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
4 Mack NaN <NA> 2
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
How would I be able to remove rows where both age and DOB have missing values for the row? For example, I'd like a new dataframe (df2) to look like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
I've tried the following code, but it deleted too many rows:
df2 <- df[!(is.na(df$age)) & !(is.na(df$DOB)), ]
In SAS, I would just write
WHERE missing(age) ge 1 AND missing(DOB) ge 1 in a DATA step, but obviously R has different syntax.
Thanks in advance!
If you want to remove those rows where two columns (age and DOB) have more than 1 NA (which would mathematically mean that there could only be 2 NAs in such a case), you can do for example:
df[!is.na(df$age) | !is.na(df$DOB),]
which means that either both or one of the columns should be not NA, or
df[rowSums(is.na(df[2:3])) < 2L,]
which means that the sum of NAs in columns 2 and 3 should be less than 2 (hence, 1 or 0) or very similar:
df[rowSums(is.na(df[c("age", "DOB")])) < 2L,]
And of course there's other options, like what #rawr provided in the comments.
And to better understand the subsetting, check this:
rowSums(is.na(df[2:3]))
#[1] 0 0 0 2 0 0 1
rowSums(is.na(df[2:3])) < 2L
#[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE
You were pretty close
df[!(is.na(df$age) & is.na(df$DOB)), ]
or
df[!is.na(df$age) | !is.na(df$DOB), ]
Maybe this could be easier:
require(tidyverse)
df <- drop_na(df, c("age", "DOB"))

Resources