Eliminate row from data.frame based on partially duplicate values - r

I have a relatively large data.frame with 205K observations and 54 variables. This data.frame is the result of appending three different data.frames. The original data.frames all have the columns date, time, lat and lon, but each data.frame carries accessory information which i need to retain. In the final data.frame I have therefore sets of three rows where date, time, lat, lon, are exactly the same but the values of var1, var2 and so forth are different and some are NA. A simplified version of my data.frame could look like the following:
mydf
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
3 <NA> 1 2 <NA> <NA> <NA> <NA> <NA> 8 9
In particular, I would like highlight in my data.frame those sets of rows with the same date, time, lat and long, but only retain the ones where, as an instance, var1 is not NA so that the final data.frame should look like:
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
I know that I can use the
distinct(mydf, ..., .keep_all = TRUE)
but I can't figure out to use the arguments properly. Any help is greatly appreciated.

With dplyr:
First identify duplicated rows from top to bottom to get all the duplicated rows with the selected variables. Then filter where this new variable is TRUE and where var1 is other than NA:
library(dplyr)
mydf %>% mutate(dup = duplicated(select(., date, time, lat, lon)) |
duplicated(select(., date, time, lat, lon), fromLast = TRUE)) %>%
filter(dup == TRUE & !is.na(var1))

I have found a less direct and most likely less elegant solution to deal with the duplicate rows in my specific data.frame by first adding a "datetime" column to all my data.frames, subsequently by "subtracting" one data.frame to another and finally by appending them toa final data.frame
#my data.frames
#df1
#df2
df1<- mutate(df1, datetime = paste(df1$date, df1$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df2<- mutate(df2, datetime = paste(df2$date, df2$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df1<-anti_join(df1, df2, by ="datetime") #delete from df1 those rows that occur in df2 based on the column "datedime"
df.f<-rbind.fill(df1, df2) #append the two data.frames in "df.f"
Possibly not the fastest approach but it works fine with my data.

Related

pivot_wider into an array instead of a table - R

I have data in long format (one line has a specific date, ID and several variables - see code below) and I would like to build an array in R from it.
test_df <- data.frame("dates"=c(19801230,19801231,19801231,19810101), "ID"=c(101,101,102,102), "var1"=0:3, "var2"=5:8)
If I focus on a single variable only, I can always create a wide table, having a row for each date and a column for each ID reporting the relative value; but I would like to build automatically an array out of it, so to have all variables in one object where I can work with an ordered time dimension.
In the example of test_df, I would like to obtain two tables binded together (an array), where the first table has values of var1, the second one of var2 but both tables have dates 19801230 19801231 and 19810101 as row indeces and 101 and 102 column indeces, which allows them to be binded together in an array (with NAs in missing values).
I could run a lapply by the ID indeces or by the date indeces and marge the output lists into an array, but it seems complicated to make dimensions match (different IDs are present in different dates). Do you have suggestions?
The only other close question I have seen around is the other way around here, but it did not help me much.
nm1 <- grep("var", names(test_df), value = TRUE)
test_df[nm1][test_df[nm1] == 0] <- -999
out <- simplify2array(lapply(nm1, \(x) xtabs(test_df[[x]]~ dates + ID,
data = test_df[c("dates", "ID")])))
out[out ==0] <- NA
out[out == -999] <- 0
out
-output
> out
, , 1
ID
dates 101 102
19801230 1 NA
19801231 2 3
19810101 NA 4
, , 2
ID
dates 101 102
19801230 5 NA
19801231 6 7
19810101 NA 8
Or with tidyverse
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
test_df %>%
pivot_longer(cols = starts_with('var')) %>%
pivot_wider(names_from = ID, values_from = value) %>%
{split(.[setdiff(names(.), "name")], .$name)} %>%
map(~ .x %>%
column_to_rownames('dates') %>%
as.matrix) %>%
simplify2array
-output
, , var1
101 102
19801230 1 NA
19801231 2 3
19810101 NA 4
, , var2
101 102
19801230 5 NA
19801231 6 7
19810101 NA 8

Removing all occurrences from a dataframe that occur in a second dataframe

I have two dataframes. I am looking to drop all the rows that match a second dataframe. I know there are similar questions out there but the solutions didnt work for me
dates <- rep(seq(as.Date("2004/01/01"), as.Date("2020/12/31"), "days"), each=20)
Animal_id <- rep(1:20, times=length(unique(dates)))
df1 <- data.frame(dates=dates, id=Animal_id)
dates2<-rep(seq(as.Date("2004/01/01"), as.Date("2020/12/31"), "days"), each=2)
Animal_id2<-rep(1:2, times=length(unique(dates2)))
df2<- data.frame(dates=dates2, id=Animal_id2)
df2 <- df2[-4, ]
df2 <- df2[-6, ]
##I would like to ensure that there any animal in df2 is removed from df1
df1$remove<-paste(df1$dates,df1$id,sep="-")
df2$remove<-paste(df2$dates,df1$id,sep="-")
dim(df1)
dim(df2)
anti_join(df1, df1, by = "remove")
I have also found the following and tried but it does not work
df1[!(df1$remove %in% df2$remove),]
I do not get any error messages, it just simply does not remove the columns (the dimensions of the data do not change). My actual dataset is quite large and I am hoping to avoid having to type out every date+ID combo I would like to filter out.
Is there a way I can get R to go through and remove matches between two dataframes when I need to do this over multiple columns (i.e. I can't just use ID because there will be differences in dates between the two)
If I understand you correct this should be the correct code (as inidcated by #Waldi) in the comments:
anti_join....return all rows from x where there are not matching values in y, keeping just columns from x.
library(dplyr)
anti_join(df1, df2, by="id")
output:
A tibble: 111,780 x 2
dates id
<date> <int>
1 2004-01-01 3
2 2004-01-01 4
3 2004-01-01 5
4 2004-01-01 6
5 2004-01-01 7
6 2004-01-01 8
7 2004-01-01 9
8 2004-01-01 10
9 2004-01-01 11
10 2004-01-01 12
# ... with 111,770 more rows

How to keep one instance or more of the values in one column when removing duplicate rows?

I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a

How to subset dataframe based on multiple variables in R

I have a dataframe of 286 columns and 157355 rows. I wish to subset rows that contain one or more of several defined factor variables such as F32, F341 etc.
Once this has been completed, I wish to identify which other factor variables are most common in the subset rows.
I have tried to filter for values of interest but an error messages appears saying the data must be numeric, logical or complex, for example;
d<- a %>%
filter_at(vars(f.41202.0.0:f.41202.0.65), all_vars('F32'))
I also tried this, but the resulting dataframe had no values present;
f <- a %>%
rowwise() %>%
filter(any(c(1:280) %in% c('F32', 'F320', 'F321', 'F322', 'F323',
'F328', 'F329', 'F330', 'F331', 'F332',
'F333', 'F334', 'F338', 'F339')))
the same occurred when I tried to place all relevant variables into an ICD object;
f <- b %>%
rowwise() %>%
filter(any(c(1:286) %in% ICD))
I would greatly appreciate any suggestions, thanks
my data looks like this (sorry I can't find a way to format it better on this page);
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
2 NA NA M87 NA
3 NA F3 NA K17
4 NA NA F3 M87
After sub-setting rows based on F3 it should look like this;
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
3 NA F3 NA K17
4 NA NA F3 M87
so the same variable columns are retained, but rows without F3 are removed
then I would hope to list the other variables (other than F3) based on how common they are within that subset, in this case that would be
most common: M87
2nd most common: K17
If it helps, I am trying to identify individuals with a particular disease, then I will try to find out which other diseases those individuals most commonly have
thanks for the help
If you wish to use tidyverse, you can use filter_all to look at all of the columns. Then, check if any_vars are in a vector of diagnostic codes. In my example, I look at F3 and F320.
Afterwards, if you want to count up the number of diagnosis codes, you could reshape your data from wide to long, and then count frequencies. If you wish, you can remove NA by filter. Let me know if this is what you had in mind.
df <- data.frame(
Var1 = c("F3", NA, NA, NA),
Var2 = c(NA, NA, "F3", NA),
Var3 = c(NA, "M87", NA, "F3"),
Var4 = c("M87", NA, "K17", "M87")
)
library(tidyverse)
df %>%
filter_all(any_vars(. %in% c("F3", "F320"))) %>%
pivot_longer(cols = starts_with("Var"), names_to = "Var", values_to = "Code") %>%
filter(!is.na(Code)) %>%
count(Code, sort = TRUE)
After the filter, you should have:
Var1 Var2 Var3 Var4
1 F3 <NA> <NA> M87
2 <NA> F3 <NA> K17
3 <NA> <NA> F3 M87
After pivot_longer and count:
# A tibble: 3 x 2
Code n
<fct> <int>
1 F3 3
2 M87 2
3 K17 1
Side note: if you wish to filter based on only some of your variables (instead of selecting all variables), you can use filter_at instead, such as:
filter_at(vars(starts_with("Var")), any_vars(. %in% c("F3", "F320")))

Merging two dataframes on one matching variable and retaining only one value for other disjoint variables

I have two dataframes I need to merge. The dataframes share all of the same columns. I am merging based on one shared variable, worker_ID. However, the other variables are often disjoint: one dataframe will have an "NA" and the other will have another value for a given variable. How can I merge in such a way that the output only retains the non-NA value?
x = worker_ID Var_1 Var_2 Var_3
1 33 NA NA
2 NA 46 NA
y = worker_ID Var_1 Var_2 Var_3
1 NA 75 NA
2 NA NA 66
z <- merge(x,y,by="worker_ID", all = TRUE)
This method does not work because instead of my desired output, z, I get a dataframe with two columns for each variable (one for the value of the variable in x and another for y). My desired output is z.
z = worker_ID Var_1 Var_2 Var_3
1 33 75 NA
2 NA 46 66
How can I tell R to let any non-NA entries supersede NA ones?
As Ben suggested, you can use coalesce(). Based on your present sample data, I did the following. For each pair of columns in a same position in x and y, I used coalesce() and created a vector. I converted the result of sapply() to a data frame and added worker_ID in the end. Note that I used as.numeric() for Var_3. I am not sure how your data is like, but Var_3 in x can be logical rather than numeric. I made sure that Var_3 in x and Var_3 in 'y` are both numeric.
library(tidyverse)
sapply(2:ncol(x), function(whatever){
coalesce(as.numeric(pull(x, whatever)),
as.numeric(pull(y, whatever))) -> foo
return(foo)
}) %>%
as_tibble %>%
bind_cols(work_ID = pull(x, 1), .)
# A tibble: 2 x 4
# work_ID V1 V2 V3
# <int> <dbl> <dbl> <dbl>
#1 1 33 75 NA
#2 2 NA 46 66
UPDATE
Taking akrun's advice, I think the following code works well. map_dfc() loops through each column pair just as sapply() does. The good thing is that map_dfc() creates a data frame; no need to use as_tibble().
map_dfc(2:ncol(x), ~ coalesce(as.numeric(pull(x, .x)),
as.numeric(pull(y, .x)))) %>%
bind_cols(work_ID = pull(x, 1), .)

Resources