Merge data frames and overwrite values - r

How do I merge 2 similar data frames but have one with greater importance?
For example:
Dataframe 1
Date Col1 Col2
jan 2 1
feb 4 2
march 6 3
april 8 NA
Dataframe 2
Date Col2 Col3
jan 9 10
feb 8 20
march 7 30
april 6 40
merge these by Date with dataframe 1 taking precedence but dataframe 2 filling blanks
DataframeMerge
Date Col1 Col2 Col3
jan 2 1 10
feb 4 2 20
march 6 3 30
april 8 6 40
EDIT - SOLUTION
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != "key"]
dfmerge<- merge(df1,df2,by="key",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}

merdat <- merge(dfrm1,dfrm2, by="Date") # seems self-documenting
# explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]
Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.
In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep() that has the same length as sum(logical.vector). The goal is both instances is to have the same length (and order) for assignment as the items being replaced.

Update using v1.9.6 of data.table's on= argument (which allows for adhoc joins:
setDT(df1)[df2, `:=`(Col2 = ifelse(is.na(Col2), i.Col2, Col2),
Col3 = i.Col3), on="Date"][]
Here's a data.table solution. Make sure your df1 and df2's Date column is factor with desired levels (for ordering)
require(data.table)
dt1 <- data.table(df1, key="Date")
dt2 <- data.table(df2, key="Date")
# Col2 refers to the Col2 of dt1 and i.col2 refers to that of dt2
dt1[dt2, `:=`(Col3 = Col3, Col1 = Col1,
Col2 = ifelse(is.na(Col2), i.Col2, Col2))]
# the result is stored in dt1
> dt1
# Date Col1 Col2 Col3
# 1: jan 2 1 10
# 2: feb 4 2 20
# 3: march 6 3 30
# 4: april 8 6 40

Here is a dplyr solution. Credit to #docendo discimus
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
y x1
1 A 1
2 B 2
3 C NA
4 D 4
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
y x1
1 A 5
2 B 6
3 C 7
dplyr
left_join(df1, df2, by="y") %>%
transmute(y, x1 = ifelse(is.na(x1.y), x1.x, x1.y))
y x1
1 A 5
2 B 6
3 C 7

Consider this example:
> d1 <- data.frame(x=1:4, a=2:5, b=c(3,4,5,NA))
> d1
x a b
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 NA
> d2 <- data.frame(x=1:4, b=c(6,7,8,9), c=11:14)
> d2
x b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
Now use merge and within, with ifelse:
> within(merge(d1, d2, by="x"), {b <- ifelse(is.na(b.x),b.y,b.x); b.x <- NULL; b.y <- NULL})
x a c b
1 1 2 11 3
2 2 3 12 4
3 3 4 13 5
4 4 5 14 9

Related

Subsetting dataframe in grouped data

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C

Filter using multiple rows in two dataframes in R [duplicate]

This question already has answers here:
Matching a sequence in a larger vector
(2 answers)
Closed 4 years ago.
Data
df1
col1
1 a
2 a
3 b
4 e
df2
col1 col2
1 1 a
2 1 c
3 1 c
4 1 e
5 2 a
6 2 b
7 2 b
8 2 e
9 3 a
10 3 a
11 3 b
12 3 e
I want to filter df2 using df1. So far I have this code.
filter(df2, any(col2==df1$col1[1]))
This allows me to filter row by row.
But I want to filter by multiple rows. Not the whole df1 at once. I want to filter df2 using df1$col1[1:2]. So "a" followed by "a". I tried the following code but got this message.
filter(df2, col2==df1$col1[1] & col2==df1$col1[2])
[1] col1 col2 <0 rows> (or 0-length row.names)
Ideally output:
df2
col1 col2
1 3 a
2 3 a
3 3 b
4 3 e
You could use package Biostrings.
df1 <- data.frame(col1=c("a", "a", "b", "e"))
df2 <- data.frame(col1=c(rep(1, 4), rep(2, 4), rep(3, 4)),
col2=letters[c(1, 3, 3, 5, 1, 2, 2, 5, 1, 1, 2, 5)])
aabe <- paste0(df1$col1, collapse = "")
cand <- paste0(df2$col2, collapse = "")
# # Install the package
# source("https://bioconductor.org/biocLite.R")
# biocLite("Biostrings")
library(Biostrings)
match <- matchPattern(aabe, cand)
str(matchPattern(aabe, cand))
x1 <- match#ranges#start
x2 <- x1 + match#ranges#width - 1
> df2[x1:x2, ]
col1 col2
9 3 a
10 3 a
11 3 b
12 3 e
Using the same approach as in #jaySf's answer, you can also use gregexpr.
matchpattern <- unlist(gregexpr(pattern = paste(df1$col1, collapse = ""),
paste(df2$col2, collapse = "")))
df2[matchpattern:(matchpattern + nrow(df1) - 1),]
# col1 col2
#9 3 a
#10 3 a
#11 3 b
#12 3 e
Or stri_locate from stringi.
library(stringi)
index <- unlist(stri_locate(paste(df2$col2, collapse = ""),
fixed = paste(df1$col1, collapse = "")))
df2[index[1]:index[2],]
# col1 col2
#9 3 a
#10 3 a
#11 3 b
#12 3 e

Creating an ordered id by group in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

create sequence vector for each level of other column [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

What is the equivalent of the SumIf function in R

I am new to R and this site, but I searched and didn't find the answer I was looking for.
If I have the following data set "total":
names <- c("a", "b", "c", "d", "a", "b", "c", "d")
x <- cbind(x1 = 3, x2 = c(3:10))
total <- data.frame(names, x)
total
names x1 x2
1 a 3 3
2 b 3 4
3 c 3 5
4 d 3 6
5 a 3 7
6 b 3 8
7 c 3 9
8 d 3 10
How can I create a new data set that works like the SumIf Excel function with just unique rows?
The answer should be a new data set "summary" that is 4 x 3.
names <- unique(names)
summary <- data.frame(names)
summary$Sumx1 <- ?????
summary$Sumx2 <- ?????
summary
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
In base R:
aggregate(. ~ names, data=total, sum)
You can use ddply from the plyr package:
library(plyr)
ddply(total, .(names), summarise, Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
You can also use data.table:
library(data.table)
DT <- as.data.table(total)
DT[ , lapply(.SD, sum), by = "names"]
names x1 x2
1: a 6 10
2: b 6 12
3: c 6 14
4: d 6 16
With the new dplyr package, you can do:
library(dplyr)
total %>%
group_by(names) %>%
summarise(Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 d 6 16
2 c 6 14
3 b 6 12
4 a 6 10

Resources