fuzzy_full_join over multiple variables duplicates columns in R - r

I do a fuzzy_full_join of two tables in R requiring multiple keys to match. Some
rows do not match. The output has duplicated the keys. This does not happen
with a non-fuzzy full join. What is the best way to remove the duplicates? I
have a solution, but it seems cumbersome.
Example:
x<-data.frame("id"=c(1,1,2,2), "time" = c(1,2,1,2), "meas1" = c(1,2,3,4))
y<-data.frame("id"=c(1,1,2,2), "time" =c(1,3,2,4),"meas2"=c(-1,-2,-3,-4))
# compare full_join output with fuzzy_full_join
full_join(x,y,by=c('id'='id','time'='time'))
fuzzy_full_join(x,y,by=c('id'='id','time'='time'),match_fun=list(`==`,`==`))
# make fuzzy_full_join output match full_join output
fuzzy_full_join(x,y,by=c('id'='id','time'='time'),match_fun=list(`==`,`==`)) %>%
mutate(id=if_else(is.na(id.x),id.y,id.x)) %>%
select(-id.x,-id.y) %>%
mutate(time=if_else(is.na(time.x),time.y,time.x)) %>%
select(-time.y,-time.x)

We can use coalesce which might help reduce the code.
library(dplyr)
library(fuzzyjoin)
fuzzy_full_join(x,y,by=c('id'='id','time'='time'),match_fun=list(`==`,`==`)) %>%
mutate(id=coalesce(id.x, id.y), time = coalesce(time.x, time.y)) %>%
select(-matches('\\.x$|\\.y$'))
# meas1 meas2 id time
#1 1 -1 1 1
#2 4 -3 2 2
#3 2 NA 1 2
#4 3 NA 2 1
#5 NA -2 1 3
#6 NA -4 2 4

Related

Dataframe NA conversion to specific items

I have a data frame like;
dataframe <- data.frame(ID1=c(NA,2,3,1,NA,2),ID2=c(1,2,3,1,2,2))
Now I want to convert the NA value to the valuable which is the same to the next column valuable like;
dataframe <- data.frame(ID1=c(1,2,3,1,2,2),ID2=c(1,2,3,1,2,2))
I think I should use the if function, but I want use %>% for simplification.
Please teach me.
An ifelse solution
dataframe <- within(dataframe, ID1 <- ifelse(is.na(ID1),ID2,ID1))
such that
> dataframe
ID1 ID2
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 2 2
A straightforward solution is to find out NA values in ID1 and replace them with corresponding values from ID2.
inds <- is.na(dataframe$ID1)
dataframe$ID1[inds] <- dataframe$ID2[inds]
However, since you want a solution with pipes you can use coalesce from dplyr
library(dplyr)
dataframe %>% mutate(ID1 = coalesce(ID1, ID2))
# ID1 ID2
#1 1 1
#2 2 2
#3 3 3
#4 1 1
#5 2 2
#6 2 2
A dplyr (using %>%) solution:
sanitized <- dataframe %>%
mutate(ID1 = ifelse(is.na(ID1), ID2, ID1))

How to take the latest entry from a data.frame and store it in new dataframe

I have a data.frame that is full of data, and where the data for parameters repeat itself, but I want to use the latest information that is stored.
Thankfully I have an index in the files that tells me which duplicate is he current row in the data.frame.
Example for my problem is the following:
A B C D
1 1 2 3 1
2 1 2 2 2
3 3 4 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
A small explanation ... A and B columns can be considered key, and the C column represents value for that key ... the column D represents the index of the measurement .. but it does not have to start from 1 ... it can start from 3,6, ... any integer. This is happening because the data is incomplete
So at the end the output should be like:
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
Can you please help me program a make an R program, or point me to the right direction, that is going to save all the keys with the their latest index ...
I have tried using for loops but it didn't work ....
Sincerely thanks
If you have any question feel free to ask
Using duplicated and subsetting in base R, you can do
dat[!duplicated(dat[,1:2], fromLast=TRUE),]
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
duplicated returns a logical vector indicating whether a row (here the first two columns) has been duplicated. The fromLast argument initiates this process from the bottom of the data.frame.
You can use dplyr verbs to group your data group_by, then sort arrange. The do verb allows you to operate at the group-level. tail grabs the last row of each group...
library(dplyr)
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
do(tail(.,1)) %>%
ungroup()
Thanks to Frank's suggestion, you could also use slice
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
slice(n()) %>%
ungroup()

combine datasets by the value of multiple columns

I'm trying to enter values based on the value of multiple columns from two datasets.
I have my main dataset (df1), with lists of a location and corresponding dates and df2 consists of a list of temperatures at all locations on every possible date. Eg:
df1
Location Date
A 2
B 1
C 1
D 3
B 3
df2
Location Date1Temp Date2Temp Date3Temp
A -5 -4 0
B 2 0 2
C 4 4 5
D 6 3 4
I would like to create a temperature variable in df1, according to the location and date of each observation. Preferably I would like to carry this out with all Temperature data in the same dataframe, but this can be separated and added 'by date' if necessary. With the example data, I would want this to create something like this:
Location Date Temp
A 2 -4
B 1 2
C 1 4
D 3 4
B 3 2
I've been playing around with merge and ifelse, but haven't figured anything out yet.
is it what you need?
library(reshape2)
library(magrittr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),d1t=c(-5,5,4,6),d2t=c(-4,0,4,3),d3t=c(0,2,5,4))
merge(df1,df2) %>% melt(id.vars=c("Location","Date"))
Here's how to do that with dplyr and tidyr.
Basically, you want to use gather to melt the DateXTemp columns from df2 into two columns. Then, you want to use gsub to remove the "Date" and "Temp" strings to get numbers that are comparable to what you have in df1. Since DateXTemp were initially characters, you need to transform the remaining numbers to numeric with as.numeric. I then use left_join to join the tables.
library(dplyr);library(tidyr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),Date1Temp=c(-5,5,4,6),
Date2Temp=c(-4,0,4,3),Date3Temp=c(0,2,5,4))
df2_new <- df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))
df1%>%left_join(df2_new)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2
EDIT
As suggested by #Sotos, you can do that in one piping like so:
df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))%>%
left_join(df1,.)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2

How to integrate set of vector in multiple data.frame into one without duplication?

I have position index vector in data.frame objects, but in each data.frame object, the order of position index vector are very different. However, I want to integrate/ merge these data.frame object object in one common data.frame with very specific order and not allow to have duplication in it. Does anyone know any trick for doing this more easily? Can anyone propose possible approach how to accomplish this task?
data
v1 <- data.frame(
foo=c(1,2,3),
bar=c(1,2,2),
bleh=c(1,3,0))
v2 <- data.frame(
bar=c(1,2,3),
foo=c(1,2,0),
bleh=c(3,3,4))
v3 <- data.frame(
bleh=c(1,2,3,4),
foo=c(1,1,2,0),
bar=c(0,1,2,3))
initial output after integrating them:
initial_output <- data.frame(
foo=c(1,2,3,1,2,0,1,1,2,0),
bar=c(1,2,2,1,2,3,0,1,2,3),
bleh=c(1,3,0,3,3,4,1,2,3,4)
)
remove duplication
rmDuplicate_output <- data.frame(
foo=c(1,2,3,1,0,1,1),
bar=c(1,2,2,1,3,0,1),
bleh=c(1,3,0,3,4,1,2)
)
final desired output:
final_output <- data.frame(
foo=c(1,1,1,1,2,3,0),
bar=c(0,1,1,1,2,2,3),
bleh=c(1,1,2,3,3,0,4)
)
How can I get my final desired output easily? Is there any efficient way for doing this sort of manipulation for data.frame object? Thanks
You could also use use mget/ls combo in order to get your data frames programmatically (without typing individual names) and then use data.tables rbindlist and unique functions/method for great efficiency gain (see here and here)
library(data.table)
unique(rbindlist(mget(ls(pattern = "v\\d+")), use.names = TRUE))
# foo bar bleh
# 1: 1 1 1
# 2: 2 2 3
# 3: 3 2 0
# 4: 1 1 3
# 5: 0 3 4
# 6: 1 0 1
# 7: 1 1 2
As a side note, it usually better to keep multiple data.frames in a single list so you could have better control over them
We can use bind_rows from dplyr, remove the duplicates with distinct and arrange by 'bar'
library(dplyr)
bind_rows(v1, v2, v3) %>%
distinct %>%
arrange(bar)
# foo bar bleh
#1 1 0 1
#2 1 1 1
#3 1 1 3
#4 1 1 2
#5 2 2 3
#6 3 2 0
#7 0 3 4
Here is a solution:
# combine dataframes
df = rbind(v1, v2, v3)
# remove duplicated
df = df[! duplicated(df),]
# sort by 'bar' column
df[order(df$bar),]
foo bar bleh
7 1 0 1
1 1 1 1
4 1 1 3
8 1 1 2
2 2 2 3
3 3 2 0
6 0 3 4

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Resources