Merging data frames and combining columns into one - r

I've got the following three dataframes:
df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Anne", "Christine"),
age=c(26, 54),
height=c(175, 160),
group=c(3,5),
group2=c("Teacher",6), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Christine"),
age=c(54),
height=c(160),
group=c(5),
group2=c(6),
group3=c("Scientist"), stringsAsFactors=FALSE)
I'd like to combine them so that I get the following result:
df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student", "Teacher", "Scientist", "Employer"))
At the moment I'm doing it this way:
df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"),
df3[,c(1,6)], all=TRUE, by="name")
row.ind <- which(df.all$group %in% c(6,5))
df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")]
row.ind2 <- which(df.all$group2 %in% c(6))
df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")]
This isn't generalisable and it is really messy. Maybe there would be a way to use merge_all or merge_recurse for the merging step (especially as there might be more than two dataframes to be merged), but I haven't figured out how. These two don't produce the right result:
df.all <- merge_all(list(df1, df2, df3))
df.all <- merge_recurse(list(df1, df2, df3), by=c("name"))
Is there a more general and elegant way to solve this problem?

Here is another possible approach, if I understand what you're ultimately after. (It is not clear what the numeric values in the "group" columns are, so I'm not sure this is exactly what you're looking for.)
Use Reduce() to merge your multiple data.frames.
temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping
temp
# name age height group1 group2 group3
# 1 Andy 48 168 Employer <NA> <NA>
# 2 Anne 26 175 3 Teacher <NA>
# 3 Christine 54 160 5 6 Scientist
# 4 John 31 180 Student <NA> <NA>
Use reshape() to reshape your data from wide to long.
df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="")
df.all
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# Anne.1 Anne 26 175 1 3
# Christine.1 Christine 54 160 1 5
# John.1 John 31 180 1 Student
# Andy.2 Andy 48 168 2 <NA>
# Anne.2 Anne 26 175 2 Teacher
# Christine.2 Christine 54 160 2 6
# John.2 John 31 180 2 <NA>
# Andy.3 Andy 48 168 3 <NA>
# Anne.3 Anne 26 175 3 <NA>
# Christine.3 Christine 54 160 3 Scientist
# John.3 John 31 180 3 <NA>
Take advantage of the fact that as.numeric() will coerce characters to NA, and use na.omit() to remove all of the rows with NA values.
na.omit(df.all[is.na(as.numeric(df.all$group)), ])
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# John.1 John 31 180 1 Student
# Anne.2 Anne 26 175 2 Teacher
# Christine.3 Christine 54 160 3 Scientist
Again, this might be over-generalizing your problem--there might be NA values in other columns, for example--but it might help direct you towards a solution to your problem.

First step is to use merge_recurse with all.x = TRUE:
library(reshape)
merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE)
# name age height group group2 group3
# 1 Anne 26 175 3 Teacher <NA>
# 2 Christine 54 160 5 6 Scientist
# 3 John 31 180 Student <NA> <NA>
# 4 Andy 48 168 Employer <NA> <NA>
Then you can use apply to get the last non-NA group from all the "group" columns:
group.cols <- grep("group", colnames(merge.all))
merge.all <- data.frame(merge.all[-group.cols],
group = apply(merge.all[group.cols], 1,
function(x)tail(na.omit(x), 1)))
# name age height group
# 1 Anne 26 175 Teacher
# 2 Christine 54 160 Scientist
# 3 John 31 180 Student
# 4 Andy 48 168 Employer

Related

Remove inconsistent duplicate entries from data frame with Base R

I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Edgar 65
## 7 Edgar 55
## 8 Frank 90
## 9 George 120
## 10 George 120
## 11 George 120
## 12 Herbert 300
## 13 Iris 15
## 14 Iris 25
## 15 Iris 25
Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:
#remove inconsistent duplicate entries
df2
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Frank 90
## 7 George 120
## 8 George 120
## 9 George 120
## 10 Herbert 300
Version B)
Another possibility would be to keep only one instance of the consistent entries:
#keep only one instance of consistent entries
df3
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 David 200
## 5 Frank 90
## 6 George 120
## 7 Herbert 300
I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.
A base solution that solves both at once. This has the side effect of requiring row name changes.
A Remove "inconsistent" values
new_df<-do.call("rbind",
Filter(function(x) all(x$amount == x$amount[1]),
split(df,df$name)))
name amount
Andy Andy 100
Bert Bert 50
Cindy.3 Cindy 30
Cindy.4 Cindy 30
David David 200
Frank Frank 90
George.9 George 120
George.10 George 120
George.11 George 120
Herbert Herbert 300
The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)
B Remove duplicates
new_df<-new_df[!duplicated(new_df$name),]
row.names(new_df) <- 1:nrow(new_df)
Combined result
new_df
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
The question specifically requests for a base solution. If for whatever reason someone from the future wants to use dplyr, I will leave this solution here.
Using dplyr, we can check if all values are equal to the first value of amount. If not, make them NA and delete them. Proceed with removing duplicates for what remains.
A Remove inconsistent ones
library(dplyr)
(df %>%
group_by(name) %>%
mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>%
na.omit() -> new_df)
A tibble: 10 x 2
# Groups: name [7]
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
Remove duplicates
new_df %>%
filter(!duplicated(name)) %>%
ungroup()
# A tibble: 7 x 2
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
A) First aggregate to apply the conditions, then filter the data and finally stack the result.
t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
setNames( stack( setNames(lapply( t_m$amount, function(x)
rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) )
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
B) Is a bit more straightforward. Just aggregate and filter.
t <- aggregate( amount ~ name, df, unique )
t[lengths(t$amount) == 1,]
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
6 Frank 90
7 George 120
8 Herbert 300
You can use duplicate, but you need to remove all duplicate rows. (your option B).
The result can be used to filter the data frame for all rows.
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
df3
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 12 Herbert 300
df[df$name %in% df3$name, ]
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 4 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 10 George 120
#> 11 George 120
#> 12 Herbert 300
Created on 2021-12-12 by the reprex package (v2.0.1)
For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated.
Here's the code:
df[!duplicated(df), ]
df[!duplicated(df$name),]
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
5 David 200
6 Edgar 65
8 Frank 90
9 George 120
12 Herbert 300
13 Iris 15
And for the second requirement, you'll need to do something like this:
df <- unique(df)
df <- split(df, df$name)
df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
Both versions are using Base-R. You can do the same using dplyr package in R.
Problem B is a sub-problem of problem A. To solve A we can use var() to find inconsistent values, utilizing the behavior of Filter() which always takes NAs as FALSE. To solve B we just need to get rid of duplicated rows in A applying unique().
Case A
with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 4 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 10 George 120
# 11 George 120
# 12 Herbert 300
Case B
with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
unique()
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 12 Herbert 300

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

R how to avoid "for" when I want to go through dataframe

give a brief example.
I have data frame data1.
name<-c("John","John","Mike","Amy".....)
nationality<-c("Canada","America","Spain","Japan".....)
data1<-data.frame(name,nationality....)
which mean the people is from different countries
each people is specialize by his name and country, and no repeat.
the second data frame is
name2<-c("John","John","Mike","John",......)
nationality2<-c("Canada","Canada","Canada".....)
score<-c(87,67,98,78,56......)
data2<-data.frame(name2,nationality2,score)
every people is promised to have 5 rows in data2, which means they have 5 scores but they are in random order.
what I want to do is to know every person's 5 scores, but I didn't care what his name is and where he is from.
the final data frame I want to have is
score1 score2 score3 score4 score5
1 89 89 87 78 90
2 ...
3 ...
every row represent one person 5 scores but I don't care who he is.
my data number is so large so I can not use for function.
what can I do?
Although there is an already accepted answer which uses base R I would like to suggest a solution which uses the convenient dcast() function for reshaping from wide to long form instead of using tapply() and repeated calls to rbind():
library(data.table) # CRAN version 1.10.4 used
dcast(setDT(data2)[setDT(data1), on = c(name2 = "name", nationality2 = "nationality")],
name2 + nationality2 ~ paste0("score", rowid(rleid(name2, nationality2))),
value.var = "score")
returns
name2 nationality2 score1 score2 score3 score4 score5
1: Amy Canada 93 91 73 8 79
2: John America 3 77 69 89 31
3: Mike Canada 76 92 46 47 75
It seems to me that's what you're asking:
data1 <- data.frame(name = c("John","Mike","Amy"),
nationality = c("America","Canada","Canada"))
data2 <- data.frame(name2 = rep(c("John","Mike","Amy","Jack","John"),each = 5),
score = sample(100,25), nationality2 =rep(c("America","Canada","Canada","Canada","Canada"),each = 5))
data3 <- merge(data2,data1,by.x=c("name2","nationality2"),by.y=c("name","nationality"))
data3$name_country <- paste(data3$name2,data3$nationality2)
all_scores_list <- tapply(data3$score,data3$name_country,c)
as.data.frame(do.call(rbind,all_scores_list))
# V1 V2 V3 V4 V5
# Amy Canada 57 69 90 81 50
# John America 4 92 75 15 2
# Mike Canada 25 86 51 20 12

Find matching intervals in data frame by range of two column values

I have a data frame of time related events.
Here is an example:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.
I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.
Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.
I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.
My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.
Thank you!
UPDATE:
Here is the example of the desired result
Name Event Order Sequence start_event end_event duration Group
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.
As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == "JOHN")
njpos <- which(x$Name != "JOHN")
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
# Name EventOrder Sequence start_event end_event duration Group
# A.2 JOHN 2 A 60 112 52 ID1
# A.3 JOHN 3 A 392 429 37 ID1
# A.7 ADAM 1 A 19 75 56 ID2
# A.8 ADAM 2 A 384 407 23 ID2
# C.5 JOHN 5 C 147 226 79 ID1
# C.6 JOHN 6 C 566 611 45 ID1
# C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2
Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].
This could be pretty easily mapped into dplyr language:
library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
jpos <- which(name == "JOHN")
njpos <- which(name != "JOHN")
over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
group_by(Sequence) %>%
slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
#
# Name EventOrder Sequence start_event end_event duration Group
# (fctr) (int) (fctr) (int) (int) (int) (fctr)
# 1 JOHN 2 A 60 112 52 ID1
# 2 JOHN 3 A 392 429 37 ID1
# 3 ADAM 1 A 19 75 56 ID2
# 4 ADAM 2 A 384 407 23 ID2
# 5 JOHN 5 C 147 226 79 ID1
# 6 JOHN 6 C 566 611 45 ID1
# 7 ADAM 5 C 140 205 65 ID2
# 8 ADAM 6 C 522 599 77 ID2
If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
dat <- dat[dat$Name %in% c(user1, user2),]
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == user1)
njpos <- which(x$Name == user2)
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
}
You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:
apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources