R two tables merge and create new column for repeat values - r

Goodday everyone, I am trying to merge two dataframes and create new dataframe that contains the unique columns, and create new columns for repeat values.
For example, two dataframes are:
df1
col1 col2
A B
C D
df2
col1 col2 col3
A B E
A B F
C D G
C D H
C D I
Target output is
col1 col2 col3 col4 col5
A B E F
C D G H I
Hope you can help me. Thanks!

So I'm not sure weather the final format you are after is something that is helpful. However the first step is a simple left or full join
df1 <- data.frame(col1 = c("A", "C"),
col2 = c("B", "D"), stringsAsFactors = F)
df2 <- data.frame(col1 = c("A", "A", "C", "C", "C"),
col2 = c("B", "B", "D", "D", "D"),
col3 = c("E", "F", "G", "H", "I"), stringsAsFactors = F)
library(tidyverse)
res <- left_join(df1, df2, by = c("col1", "col2"))
res
col1 col2 col3
1 A B E
2 A B F
3 C D G
4 C D H
5 C D I
to get a result in the desired form is a bit trickier.
First we do the same left join as above, we then unite the two columns (col1 & col2) together so that we can group and spread by those columns easily.
Grouping by the united column (fuse) we want a number associated with each col3 value within the group, we paste "col" as a prefix so that when spreading it appears as a column name.
We then spread by the counter column n and fill it with the values of col3.
Finally, we reverse the unite we did earlier.
left_join(df1, df2, by = c("col1", "col2")) %>%
unite(fuse, col1, col2) %>%
group_by(fuse) %>%
mutate(n = paste0("col", 2 + 1:n())) %>%
spread(n, col3) %>%
separate(fuse, c("col1", "col2"))
# A tibble: 2 x 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
1 A B E F NA
2 C D G H I

Related

Summation of money amounts in character format by group

I have a data frame that contains the monetary transactions among individuals. The transactions can be two-way, i.e. A can transfer money to B and B can also transfer money to A. The structure of the data frame looks like below:
From To Amount
A B $100
A C $40
A D $30
B A $25
B C $70
C A $190
C D $110
I want to summarize the total amount of transactions among each pair of individuals who have transactions with each other and the results should be something like:
Individual_1 Individual_2 Sum
A B $125
A C $230
A D $30
B C $70
C D $110
I tried to utilize the grouping feature of the package dplyr but I think it does not apply to my case.
You can use pmin/pmax to sort From and To columns and sum the Amount value.
library(dplyr)
df %>%
group_by(col1 = pmin(From, To),
col2 = pmax(From, To)) %>%
summarise(Amount = sum(readr::parse_number(Amount)))
# col1 col2 Amount
# <chr> <chr> <dbl>
#1 A B 125
#2 A C 230
#3 A D 30
#4 B C 70
#5 C D 110
Using the same logic in base R you can do :
aggregate(Amount~col1 + col2,
transform(df, col1 = pmin(From, To), col2 = pmax(From, To),
Amount = as.numeric(sub('$', '', Amount, fixed = TRUE))), sum)
data
df <- structure(list(From = c("A", "A", "A", "B", "B", "C", "C"), To = c("B",
"C", "D", "A", "C", "A", "D"), Amount = c("$100", "$40", "$30",
"$25", "$70", "$190", "$110")), class = "data.frame", row.names = c(NA, -7L))
A solution using the tidyverse package. You need to find a way to create a common grouping column with the right order of the individuals. dat2 is the final output.
library(tidyverse)
dat2 <- dat %>%
mutate(Amount = as.numeric(str_remove(Amount, "\\$"))) %>%
mutate(Group = map2_chr(From, To, ~str_c(sort(c(.x, .y)), collapse = "_"))) %>%
group_by(Group) %>%
summarize(Sum = sum(Amount, na.rm = TRUE)) %>%
separate(Group, into = c("Individual_1", "Individual_2"), sep = "_") %>%
mutate(Sum = str_c("$", Sum))
print(dat2)
# # A tibble: 5 x 3
# Individual_1 Individual_2 Sum
# <chr> <chr> <chr>
# 1 A B $125
# 2 A C $230
# 3 A D $30
# 4 B C $70
# 5 C D $110
Data
dat <- read.table(text = "From To Amount
A B $100
A C $40
A D $30
B A $25
B C $70
C A $190
C D $110",
header = TRUE)
A complete solution without packages, based on #RonakShah's great pmin/pmax approach, using list notation in aggregate (in contrast to formula notation) which allows name assignment.
with(
transform(d, a=as.numeric(gsub("\\D", "", Amount)), b=pmin(From, To), c=pmax(From, To)),
aggregate(list(Sum=a), list(Individual_1=b, Individual_2=c), function(x)
paste0("$", sum(x))))
# Individual_1 Individual_2 Sum
# 1 A B $125
# 2 A C $230
# 3 B C $70
# 4 A D $30
# 5 C D $110
Data:
d <- structure(list(From = c("A", "A", "A", "B", "B", "C", "C"), To = c("B",
"C", "D", "A", "C", "A", "D"), Amount = c("$100", "$40", "$30",
"$25", "$70", "$190", "$110")), class = "data.frame", row.names = c(NA,
-7L))

left_join in a for loop with different columns names

I have a data.frame called a whose structure is similar to:-
a <- data.frame(X1=c("A", "B", "C", "A", "C", "D"),
X2=c("B", "C", "D", "A", "B", "A"),
X3=c("C", "D", "A", "B", "A", "B")
)
And I have another set which is:-
b <- data.frame(Xn=c("A", "B", "C", "D"),
Feature=c("some", "more", "what", "why"))
I want to add all the Features from set b to set a, such that X1, X2 and X3 have their corresponding feature column in set a. In other words, the columns in set a become:-
colnames(a) <- c("X1", "X2", "X3", "Features1", "Features2", "Features3")
How can I do this using a left_join in a for loop??
In base R, we can unlist a dataframe and match it with b$Xn to get corresponding Feature value. We can cbind this dataframe to original dataframe to get final answer.
temp <- a
temp[] <- b$Feature[match(unlist(temp), b$Xn)]
names(temp) <- paste0('Feature', seq_along(temp))
cbind(a, temp)
# X1 X2 X3 Feature1 Feature2 Feature3
#1 A B C some more what
#2 B C D more what why
#3 C D A what why some
#4 A A B some some more
#5 C B A what more some
#6 D A B why some more
In tidyverse, we can get the data in long format, join the data and get it back to wide format.
library(dplyr)
library(tidyr)
a %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
left_join(b, by = c('value' = 'Xn')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Feature) %>%
select(-row) %>%
rename_all(~paste0('Feature', seq_along(.))) %>%
bind_cols(a, .)
This can be done by using mutate_all to recode all of the columns in a:
library(tidyverse)
a %>%
mutate_all(funs(feat=recode(., !!!set_names(as.character(b$Feature), b$Xn))))
X1 X2 X3 X1_feat X2_feat X3_feat
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more
You can add a rename_at to get the desired names:
a %>%
mutate_all(funs(f=recode(., !!!set_names(as.character(b$Feature), b$Xn)))) %>%
rename_at(vars(matches("f")), ~gsub(".([0-9]).*", "Feature\\1", .))
X1 X2 X3 Feature1 Feature2 Feature3
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more

Loop over list of data frames using data.table

I have 3 data frames that I'd like to run the same data.table function on. I could do this manually for each data.frame but I'd like to learn how to do it more efficiently.
Using the data.table package, I want to replace the contents of col1 with the contents of col2 only if col1 contains "a". And I want to run this code over three different dataframes. On a single data.frame, this works fine:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
library(data.table)
dt = data.table(df1)
dt[grepl(pattern = "a", x = df1$col1), col1 :=col2]
but I am lost trying to get this to run over multiple dataframes:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
df2 <- data.frame(col1 = c("b", "b", "a"), col2 = c("AA", "BB", "BB"))
df3 <- data.frame(col1 = c("b", "b", "b"), col2 = c("AA", "AA", "BB"))
library(data.table)
listdfs = list(df1, df2, df3)
for (i in dt[[]]) {
dt[[i]][grepl(pattern = "a", x = df[[i]]$col1), col1 := col2] }
But this obviously doesn't work because I have no clue what I'm doing with the for loop. Any guidance/teaching would be appreciated. Thanks!
If we are looping through the list, then loop over the sequence of list and then do the assignment
listdfs = list(df1, df2, df3)
lapply(listdfs, setDT) # change the `data.frame` to `data.table`
for (i in seq_along(listdfs)) { # loop over sequence
listdfs[[i]][grepl(pattern = "a", x = col1), col1 := col2]
}
This would change the elements i.e. data.table with in the listdfs as well the object 'df1', 'df2', 'df3' itself as we didn't create any copy
df1
# col1 col2
#1: AA AA # change
#2: AA AA # change
#3: b AA
df2
# col1 col2
#1: b AA
#2: b BB
#3: BB BB # change
df3
# col1 col2
#1: b AA
#2: b AA
#3: b BB

R grep search patterns in multiple columns

I have a data frame like as follows:
Col1 Col2 Col3
A B C
D E F
G H I
I am trying to keep lines matching 'B' in 'Col2' OR F in 'Col3', in order to get:
Col1 Col2 Col3
A B C
D E F
I tried:
data[(grep("B",data$Col2) || grep("F",data$Col3)), ]
but it returns the entire data frame.
NOTE: it works when calling the 2 grep one at a time.
Or using a single grepl after pasteing the columns
df1[with(df1, grepl("B|F", paste(Col2, Col3))),]
# Col1 Col2 Col3
#1 A B C
#2 D E F
with(df1, df1[ Col2 == 'B' | Col3 == 'F',])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Using grepl
with(df1, df1[ grepl( 'B', Col2) | grepl( 'F', Col3), ])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Data:
df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
), row.names = c(NA, -3L), class = "data.frame")
The data.table package makes this type of operation trivial due to its compact and readable syntax. Here is how you would perform the above using data.table:
> df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
+ ), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
+ ), row.names = c(NA, -3L), class = "data.frame")
> library(data.table)
> DT <- data.table(df1)
> DT
Col1 Col2 Col3
1: A B C
2: D E F
3: G H I
> DT[Col2 == 'B' | Col3 == 'F']
Col1 Col2 Col3
1: A B C
2: D E F
>
data.table performs its matching operations with with=TRUE by default. Note that the matching is much faster if you set keys on the data but that is for another topic.

merge the rows in R with the same row name concatenating the content in the column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I need help merging the rows with the same name by concatenating the content in one of the columns. For example, in my dataframe,df, the rows with the same name match completely across the columns except in col 3. I want to merge the rows with the same rowname and concatenate the contents in col3 separated by a comma and get the result as shown below. Thank you for your help.
df
rowname col1 col2 col3
pat 122 A T
bus 222 G C
pat 122 A G
result
rowname col1 col2 col3
pat 122 A T,G
bus 222 G C
Try
aggregate(col3~., df, FUN=toString)
# rowname col1 col2 col3
#1 pat 122 A T, G
#2 bus 222 G C
Or using dplyr
library(dplyr)
df %>%
group_by_(.dots=names(df)[1:3]) %>%
summarise(col3=toString(col3))
# rowname col1 col2 col3
#1 bus 222 G C
#2 pat 122 A T, G
data
df <- structure(list(rowname = c("pat", "bus", "pat"), col1 = c(122,
222, 122), col2 = c("A", "G", "A"), col3 = c("T", "C", "G")),
.Names = c("rowname",
"col1", "col2", "col3"), row.names = c(NA, -3L), class = "data.frame")

Resources