Remove duplicates making sure of NA values R - r

My data set(df) looks like,
ID Name Rating Score Ranking
1 abc 3 NA NA
1 abc 3 12 13
2 bcd 4 NA NA
2 bcd 4 19 20
I'm trying to remove duplicates which using
df <- df[!duplicated(df[1:2]),]
which gives,
ID Name Rating Score Ranking
1 abc 3 NA NA
2 bcd 4 NA NA
but I'm trying to get,
ID Name Rating Score Ranking
1 abc 3 12 13
2 bcd 4 19 20
How do I avoid rows containing NA's when removing duplicates at the same time, some help would be great, thanks.

First, push the NAs to last with na.last = T
df<-df[with(df, order(ID, Name, Score, Ranking),na.last = T),]
then do the removing of duplicated ones with fromLast = FALSE argument:
df <- df[!duplicated(df[1:2],fromLast = FALSE),]

Using dplyr
df <- df %>% filter(!duplicated(.[,1:2], fromLast = T))

You could just filter out the observations you don't want with which() and then use the unique() function:
a<-unique(c(which(df[,'Score']!="NA"), which(df[,'Ranking']!="NA")))
df2<-unique(df[a,])
> df2
ID Name Rating Score Ranking
2 1 abc 3 12 13
4 2 bcd 4 19 20

Related

Accounting for NA using Pivot_longer in R

I'm trying to pivot_longer 34 columns of a data set with about 10,000 rows in R. The data was collected via survey, and each column represents a possible answer to a question. I want to pivot_longer one of the questions, which had 34 possible answers, and account for 34/107 columns. The columns have a value (1) if that answer was selected, and the other 33 rows have NA.
Example subset of data frame for a question with 5 possible answers (df):
ID A B C D E
1 1 NA NA NA NA
2 NA 1 NA NA NA
3 NA NA NA NA 1
4 NA NA NA NA NA
5 NA 1 NA NA NA
I need to get to:
ID Answer
1 A
2 B
3 E
4 NA
5 B
I want to pivot_longer the results to this question, while maintaining all the other columns. The issue occurs because some people didn't answer this question, resulting in all NA's (See row 4).
I'm using the code:
dfNew <- pivot_longer(df, c(A,B,C,D,E), names_to = "Answer", values_drop_na = TRUE)
dfNew
ID Answer
1 A
2 B
3 E
5 B
Which removes ID 4 from the data. Not using values_drop_na results in having a row for every NA value in A:E. How do I get it to maintain ID 4 as part of the data set, and make the value for Answer NA?
You can use complete to fill the missing values :
library(tidyr)
pivot_longer(df, A:E, names_to = "Answer", values_drop_na = TRUE) %>%
complete(ID = unique(df$ID)) %>%
dplyr::select(-value)
# A tibble: 5 x 2
# ID Answer
# <int> <chr>
#1 1 A
#2 2 B
#3 3 E
#4 4 NA
#5 5 B
You can also use max.col here :
cbind(df[1], answer = names(df)[-1][max.col(!is.na(df[-1])) *
NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)])
This might be quite difficult to understand.
max.col(!is.na(df[-1])) returns the index of non-NA value in each row but in case the row has all NA's it returns any index.
NA^ !rowSums(!is.na(df[-1])) this part returns NA for rows where there are all NA's and 1 for rows which has atleast 1 non-NA.
When we multiply 1 * 2 we get NA's for all NA's row and row-index where there is a value.
max.col(!is.na(df[-1])) * NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)
#[1] 1 2 5 NA 2
4 . We use these (above) values to subset column names from df to get answer.
names(df[-1])[max.col(!is.na(df[-1]))*NA^!rowSums(!is.na(df[-1]), na.rm = TRUE)]
#[1] "A" "B" "E" NA "B"

Create a "flag" column in a dataset based on a another table in R

I have two datasets: dataset1 and dataset2.
zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"
dataset1 <- read.table(text=zz, header=TRUE)
yy <- "id_customer order_date
1 2018-10
3 2019-06"
dataset2 <- read.table(text=yy, header=TRUE)
dataset2 is the result of a query where I have two columns: id_customer and date (format YYYY-mm).
Those correspond to customers which have a different status than the others in the source dataset (dataset1), for a specified month.
dataset1 is a list of transactions where I have id_customer, id_order and date (format YYYY-mm as well).
I want to enrich dataset1 with a "flag" column for each line set to 1 if the customer id appears in dataset2, during the corresponding month.
I have tried something as follows:
dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
dataset1$date == dataset2$date,
"1", "0")
But I get a warning message that says 'longer object length is not a multiple of shorter object length'.
I understand that but cannot come up with a solution. Could someone please help?
You can add a flag to dataset2 then use merge(), keeping all rows from dataset1. Borrowing Chris' data:
dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)
ID Date flag
1 1 2018-12 NA
2 1 2019-11 NA
3 2 2018-13 NA
4 2 2019-10 NA
5 2 2019-11 1
6 2 2019-12 NA
7 2 2019-12 NA
8 3 2018-12 1
9 3 2018-12 1
10 4 2018-13 1
EDIT:
This seems to work:
Illustrative data:
set.seed(100)
dt1 <- data.frame(
ID = sample(1:4, 10, replace = T),
Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
ID Date
1 2 2019-12
2 2 2019-12
3 3 2018-12
4 1 2018-12
5 2 2019-11
6 2 2019-10
7 4 2018-13
8 2 2018-13
9 3 2018-12
10 1 2019-11
dt2 <- data.frame(
ID = sample(1:4, 5, replace = T),
Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
ID Date
1 2 2019-11
2 4 2018-13
3 2 2019-13
4 4 2019-13
5 3 2018-12
SOLUTION:
The solution uses ifelse to define a condition upon which to set the 'flag' 1(as specified in the OP). That condition implies a match between dt1and dt2; thus we're using match. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply to paste the rows in the two columns together using paste0 and search for matches in these compound strings:
dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "),
apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")
RESULT:
dt1
ID Date flag
1 2 2019-12 NA
2 2 2019-12 NA
3 3 2018-12 1
4 1 2018-12 NA
5 2 2019-11 1
6 2 2019-10 NA
7 4 2018-13 1
8 2 2018-13 NA
9 3 2018-12 1
10 1 2019-11 NA
To check the results we can compare them with the results obtained from merge:
flagged_only <- merge(dt1, dt2)
flagged_only
ID Date
1 2 2019-11
2 3 2018-12
3 3 2018-12
4 4 2018-13
The dataframe flagged_onlycontains exactly the same four rows as the ones flagged 1 in dt1-- voilĂ !
It is very is to add a corresponding flag in a data.table way:
# Load library
library(data.table)
# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)
# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]
The result looks as follows:
> dataset1
id_customer id_order order_date flag
1: 1 1 2018-10 1
2: 1 2 2018-11 0
3: 2 3 2019-05 0
4: 3 4 2019-06 1
A bit more manipulations would be needed if you would have the full date/time in the datasets.

value of certain column based on multiple conditions in two data frames R

As shown above, there are df1 and df2
If you look at btime one df1 there are NAs
I want to fill up the btime NAs with all unique + stnseq = 1, so only the first NA of each Unique will be filled
the value i would like it to fill is in df2. The condition would be for all unique + boardstation = 8501970 add the value in the departure column.
i have tried the aggregate function but i do not know how to make the condition for only boardstation 8501970.
Thanks anyone for any help
If I understood the question correctly then this might help.
library(dplyr)
df2 %>%
group_by(unique) %>%
summarise(departure_sum = sum(departure[boardstation==8501970])) %>%
right_join(df1, by="unique") %>%
mutate(btime = ifelse(is.na(btime) & stnseq==1, departure_sum, btime)) %>%
select(-departure_sum) %>%
data.frame()
Since the sample data is in image format I cooked my own data as below:
df1
unique stnseq btime
1 1 1 NA
2 1 2 NA
3 2 1 NA
4 2 2 200
df2
unique boardstation departure
1 1 8501970 1
2 1 8501970 2
3 1 123 3
4 2 8501970 4
5 2 456 5
6 3 900 6
Output is:
unique stnseq btime
1 1 1 3
2 1 2 NA
3 2 1 4
4 2 2 200

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

How to handle null entries in SparkR

I have a SparkSQL DataFrame.
Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?
In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.
Thanks.
SparkR Column provides a long list of useful methods including isNull and isNotNull:
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)
Id Age
1 1 21
2 2 18
3 3 NA
> filter(people, isNotNull(people$Age)) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
> filter(people, isNull(people$Age)) %>% head()
Id Age
1 4 NA
Please keep in mind that there is no distinction between NA and NaN in SparkR.
If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:
> fillna(people, 99) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
4 4 99
> dropna(people) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:
> people_with_names_local <- data.frame(
Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
4 4 NA <NA>
> dropna(people_with_names, minNonNulls=2) %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:
testFrame <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")
resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL
This omits the entire row where the element in column b is missing.

Resources