Remove duplicates while keeping NA in R - r

I have data that looks like the following:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
print(a)
ID score
A 1
B 2
C 3
C 3
<NA> 5
<NA> 6
I am trying to remove duplicates without R treating <NA> as duplicates to get the following:
b<-data.frame(ID=c("A","B","C",NA,NA),score=c(1,2,3,5,6),stringsAsFactors=FALSE)
print(b)
ID score
A 1
B 2
C 3
<NA> 5
<NA> 6
I have tried the following:
b<-a[!duplicated(a$ID),]
library(dplyr)
b<-distinct(a,ID)
print(b)
But both treat <NA> as a duplicate ID and remove one, but I want to keep all instances of <NA>. Thoughts? Thank you!

A straight forward approach is to break the original data frame down into 2 parts where ID is NA and where it is not. Perform your distinct filter and then combine the data frames back together:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
aprime<-a[!is.na(a$ID),]
aNA<-a[is.na(a$ID),]
b<-aprime[!duplicated(aprime$ID),]
b<-rbind(b, aNA)
With a little work, one can reduce this down to a 1-2 line lines of code.

using dplyr:
b%>%group_by(ID,score)%>%distinct()
# A tibble: 5 x 2
# Groups: ID, score [5]
ID score
<chr> <dbl>
1 A 1
2 B 2
3 C 3
4 <NA> 5
5 <NA> 6

Found a very simple way to do this simply using the base duplicated() function.
b<-a[!duplicated(a$ID, incomparables = NA),]
Setting incomparables = NA makes R read NA duplicates as FALSE, therefore including them in the result dataset.

Related

copy values from different columns based on conditions (r code)

I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!
Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).
The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

R Order only one factor level (or column if after) to affect order long to wide (using spread)

I have a problem after changing my dataset from long to wide (using spread, from the tidyr library on the Result_Type column). I have the following example df:
Group<-c("A","A","A","B","B","B","C","C","C","D", "D")
Result_Type<-c("Final.Result", "Verification","Test", "Verification","Final.Result","Fast",
"Verification","Fast", "Final.Result", "Test", "Final.Result")
Result<-c(7,1,8,7,"NA",9,10,12,17,50,11)
df<-data.frame(Group, Result_Type, Result)
df
Group Result_Type Result
1 A Final.Result 7
2 A Verification 1
3 A Test 8
4 B Verification 7
5 B Final.Result NA
6 B Fast 9
7 C Verification 10
8 C Fast 12
9 C Final.Result 17
10 D Test 50
11 D Final.Result 11
In the column Result_type there are many possible result types and in some datasets I have Result_Type 's that will not occur in other datasets. However, one level: Final.Resultdoes occur in every dataset.
Also: This is example data but the actual data has many different columns, and as these differ across the datasets I use, I used spread (from the tidyr library) so I don't have to give any specific column names other than my target columns.
library("tidyr")
df_spread<-spread(df, key = Result_Type, value = Result)
Group Fast Final.Result Test Verification
1 A <NA> 7 8 1
2 B 9 NA <NA> 7
3 C 12 17 <NA> 10
4 D <NA> 11 50 <NA>
What I would like is that once I convert the dataset from long to wide, Final.Result is the first column, how the rest of the columns is arranged doesn't matter, so I would like it to be like this (without calling any names of the other columns that are spread, or using order index numbers):
Group Final.Result Fast Test Verification
1 A 7 <NA> 8 1
2 B NA 9 <NA> 7
3 C 17 12 <NA> 10
4 D 11 <NA> 50 <NA>
I saw some answers that indicated you can reverse the order of the spreaded columns, or turn off the ordering of spread, but that doesn't make sure that Final.Result is always the first column of the spread levels.
I hope I am making myself clear, it's a little complicated to explain. If someone needs extra info I will be happy to explain more!
spread creates columns in the order of the key column's factor levels. Within the tidyverse, forcats::fct_relevel is a convenience function for rearranging factor levels. The default is that the level(s) you specify will be moved to the front.
library(dplyr)
library(tidyr)
...
levels(df$Result_Type)
#> [1] "Fast" "Final.Result" "Test" "Verification"
Calling fct_relevel will put "Final.Result" as the first level, keeping the rest of the levels in their previous order.
reordered <- df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result"))
levels(reordered$Result_Type)
#> [1] "Final.Result" "Fast" "Test" "Verification"
Adding that into your pipeline puts Final.Result as the first column after spreading.
df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result")) %>%
spread(key = Result_Type, value = Result)
#> Group Final.Result Fast Test Verification
#> 1 A 7 <NA> 8 1
#> 2 B NA 9 <NA> 7
#> 3 C 17 12 <NA> 10
#> 4 D 11 <NA> 50 <NA>
Created on 2018-12-14 by the reprex package (v0.2.1)
One option is to refactor Result_Type to put final.result as the first one:
df$Result_Type<-factor(df$Result_Type,levels=c("Final.Result",as.character(unique(df$Result_Type)[!unique(df$Result_Type)=="Final.Result"])))
spread(df, key = Result_Type, value = Result)
Group Final.Result Verification Test Fast
1 A 7 1 8 NA
2 B NA 7 NA 9
3 C 17 10 NA 12
4 D 11 NA 50 NA
If you'd like you can use this opportunity to also sort the rest of the columns whichever way you want.

Use a vector to create empty columns in a data frame

I currently have a integer vector exampleVector, and a data frame exampleDF, and I would like to add each element of exampleVector as a new column, whose elements are NA, in the data frame exampleDF. For illustration, I currently have:
exampleVector <- 1:6
exampleDF <- data.frame(First=1:4, Second=4:7,Third=7:10)
exampleDF
First Second Third
1 1 4 7
2 2 5 8
3 3 6 9
4 4 7 10
And what I would like to be able to create is
exampleDF
First Second Third 1 2 3 4 5 6
1 1 4 7 <NA> <NA> <NA> <NA> <NA> <NA>
2 2 5 8 <NA> <NA> <NA> <NA> <NA> <NA>
3 3 6 9 <NA> <NA> <NA> <NA> <NA> <NA>
4 4 7 10 <NA> <NA> <NA> <NA> <NA> <NA>
Where exampleDF[4:9] are character vectors.
I am aware that I would be able to do this using variations of the below commands:
exampleDF$"1" <- as.character(NA)
exampleDF[["1"]] <- as.character(NA)
exampleDF[c("1","2","3","4","5","6")] <- as.character(NA)
But I need something that is more flexible. Everything that I've been able to find online has been about adding one column to multiple data frames, and to do so they suggest mapply and cbind.
My apologies if I'm missing something obvious here - I am very new to the R language and I'm trying to do this without a for loop, if possible, as recent interactions have led me to believe that this is mostly considered a hack in R scripts, as the apply functions are typically sufficient.
Since your exampleVector is numeric, you should convert it to
characters when you enter it into the code. Otherwise it will be
interpreted as a selection based on indexes.
exampleVector <- 1:6
exampleDF <- data.frame(First=1:4, Second=4:7,Third=7:10)
exampleDF[as.character(exampleVector)] <- NA_character_
Note that there's no restriction in this setup that protects you
against getting a data frame with the same name occurring several
times. That might create problems later on (if you want to subset
your data frame by names), so I would have added a sanity check to
ensure that you do get unique names.

How to get na.omit with data.table to only omit NAs in each column

Let's say I have
az<-data.table(a=1:6,b=6:1,c=4)
az[b==4,c:=NA]
az
a b c
1: 1 6 4
2: 2 5 4
3: 3 4 NA
4: 4 3 4
5: 5 2 4
6: 6 1 4
I can get the sum of all the columns with
az[,lapply(.SD,sum)]
a b c
1: 21 21 NA
This is what I want for a and b but c is NA. This is seemingly easy enough to fix by doing
az[,lapply(na.omit(.SD),sum)]
a b c
1: 18 17 20
This is what I want for c but I didn't want to omit the values of a and b where c is NA. This is a contrived example in my real data there could be 1000+ columns with random NAs throughout. Is there a way to get na.omit or something else to act per column instead of on the whole table without relying on looping through each column as a vector?
Expanding on my comment:
Many base functions allow you to decide how to treat NA. For example, sum has the argument na.rm:
az[,lapply(.SD,sum,na.rm=TRUE)]
In general, you can also use the function na.omit on each vector individually:
az[,lapply(.SD,function(x) sum(na.omit(x)))]

Resources