For simple 1:1 merge for example:
data_merge <- (dataset1, dataset2, by.x = "name", by.y = "name")
Is there a way to run a check of the values successfully merged, i.e., a count or flag of those only in merged into the data set from the dataset1, and those only merged in from dataset2, and those successfully matched from both?
The tidylog package provides that for dplyr/tidyr operations. For instance,
library(dplyr); library(tidylog)
left_join(band_members, band_instruments)
Result
Joining, by = "name"
left_join: added one column (plays)
> rows only in x 1
> rows only in y (1)
> matched rows 2
> ===
> rows total 3
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones NA
2 John Beatles guitar
3 Paul Beatles bass
Related
I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp
I have a data frame that looks like this :
names
value
John123abc
1
George12894xyz
2
Mary789qwe
3
I want to rename all the name values of the column "names" and keep only the names (not the extra numbers and characters that its name has). Imagine that the code for each name changes and I have 100.000 rows.I thing that something like starts_with("John") ="John")
Ideally i want the new data frame to look like this:
names
value
John
1
George
2
Mary
3
How I can do this in R using dplyr?
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
Using strings::str_remove you could do:
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
dat |>
mutate(names = str_remove(names, "\\d+.*$"))
#> # A tibble: 3 × 2
#> names value
#> <chr> <dbl>
#> 1 John 1
#> 2 George 2
#> 3 Mary 3
Using base R
dat$names <- trimws(dat$names, whitespace = "\\d+.*")
-output
> dat
# A tibble: 3 × 2
names value
<chr> <dbl>
1 John 1
2 George 2
3 Mary 3
I'm trying to update Sam's role in the tibble below from PM to TC. For some reason, I have no idea how to get this to work, even though it seems simple. Here is my data frame.
df <- tibble(
Name = c("Sam","Jane","Sam","Sam","James","Mary","Swain"),
Role = c("PM","TC","PM","PM","RX","TC","TC"),
Number = c(1:7)
)
Then I have this if_else statement which is supposed to conditionally update Role to "TC" when Name = "Sam." But all it's doing is changing all the row values to TC regardless of name. I don't know why or how to fix it.
df$Role <- if_else(df$Name == "Sam", df$Role <- "TC", df$Role)
You could also use mutate function and refer to column names without dollar sign and data frame name as if they are like any objects in R:
library(dplyr)
df %>%
mutate(Role = if_else(Name == 'Sam', 'TC', Role))
# A tibble: 7 × 3
Name Role Number
<chr> <chr> <int>
1 Sam TC 1
2 Jane TC 2
3 Sam TC 3
4 Sam TC 4
5 James RX 5
6 Mary TC 6
7 Swain TC 7
I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.
Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
It counts separated like this:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
It should look like this:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
Thanks if anyone can help me.
This code works. There is probably something more efficient out there though.
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1
Tks Guys,
I considered your suggestions and tried a similar approach:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2
I have the following data frame
Name Product Unit Class
2 sushil seeds
4 sanju Soap 46 C
5 rahul 5
7 sanju 4 E
9 sushil 20 B
10 rahul Soap A
and what I need is, a data frame without duplicate rows with the below conditions.
if the row is having all columns values filled then eliminate the second duplicate row.
if the row is having few of the columns value empty then replace the empty cell with the similar column values from its duplicate row.
The desired result should look like this.
Name Product Unit Class
1 sushil seeds 20 B
2 sanju Soap 46 C
3 rahul Soap 5 A
Thanks in advance for the help!
here is the df code.
Name <- c("abbas","sushil","abbas","sanju","rahul","shweta","sanju","rajiv","sushil","rahul")
Unit <- c(18," ",18,46,5,67,4,3,20," ")
Product <- c("Rice","seeds","Rice","Soap"," ","Towel"," "," "," ","Soap")
Class <- c("A"," ","A","C"," ","D","E","A","B","A")
Data <- data.frame(Name,Product,Unit,Class)
duplicate <- which(duplicated(Data))
unique <- Data[!duplicated(Data),]
NewData <- unique[unique$Name %in% unique$Name[duplicated(unique$Name)],]
In the following I am assuming that the primary ID is the Name column.
First part (harder):
library(tidyverse)
df[ df == "" ] <- NA
df2 <- df %>%
mutate(complete=complete.cases(df)) %>%
group_by(Name) %>%
mutate(any_complete=any(complete)) %>%
filter( complete | (!complete & !any_complete)) %>%
select(-complete, -any_complete)
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 sushil seeds NA NA
2 sanju Soap 46 C
3 rahul NA 5 NA
4 sushil NA 20 B
5 rahul Soap NA A
Explanation: first we replace all missing strings by actual NA's. Then, we create a column, complete, which checks whether all of the columns are complete for a given row. Next we create another column that tells us whether, for any given Name there is a complete observation. Finally, we keep only the rows which are either (i) complete or (ii) not complete, but a complete observation for that Name is missing.
Second task is simpler, but boring:
df2 %>% arrange(Name, Product) %>% fill(Product) %>%
arrange(Name, Unit) %>% fill(Unit) %>%
arrange(Name, Class) %>% fill(Class) %>%
filter(!duplicated(Name))
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 rahul Soap 5 A
2 sanju Soap 46 C
3 sushil seeds 20 B