Assign one name for one id from similar names - r

I have 1million observations and 4 variables(ID, NAME, COMPANY, TIPS)
My ID values are correctly mapped but in NAME column contains full names and some have an only first name but for sure end of each id (2,3,4) have a full name only, so I want to replace full name to all id so that I display one id and one correct name.
sample data table as below(Dt - Format)
ID Name Company Tips
1 Dave AB 50
2 PAT E DAV ABC 15
2 PAT ERIN DAV(full name) AB 26
3 JIL WIRTH DFG 26
3 JIL K WIRTH EF 45
3 JILL KATH WIRTH(full name) JUI 85
4 MARIANA PO KIL 50
4 MARIANA A PO(full name) LPI 55
5 BRET LLC 52
Expected Output
ID Name Company Tips
1 Dave AB 50
2 PAT ERIN DAV ABC 15
2 PAT ERIN DAV AB 26
3 JIL KATH WIRTH DFG 26
3 JIL KATH WIRTH EF 45
3 JILL KATH WIRTH JUI 85
4 MARIANA A PO KIL 50
4 MARIANA A PO LPI 55
5 BRET LLC 52

One way would be to take the longest name for each ID. Here is a way using dplyr...
library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Name2=Name[which.max(nchar(Name))])
df
ID Name Company Tips Name2
<int> <chr> <chr> <int> <chr>
1 1 Dave AB 50 Dave
2 2 PAT E DAV ABC 15 PAT ERIN DAV
3 2 PAT ERIN DAV AB 26 PAT ERIN DAV
4 3 JIL WIRTH DFG 26 JILL KATH WIRTH
5 3 JIL K WIRTH EF 45 JILL KATH WIRTH
6 3 JILL KATH WIRTH JUI 85 JILL KATH WIRTH
7 4 MARIANA PO KIL 50 MARIANA A PO
8 4 MARIANA A PO LPI 55 MARIANA A PO
9 5 BRET LLC 52 BRET
To overwrite Name with the new values, just change Name2 to Name.

A base R solution would be to sort based on the full name and replace. The final step is the gsub that removes the (full name)
gsub('\\(.*', '', with(df[order(df$ID,
gsub("[\\(\\)]", "", regmatches(df$Name, gregexpr("\\(.*?\\)",
df$Name)))),], ave(Name, ID, FUN = function(i) `<-`(i, tail(i, 1)))))
#[1] "Dave" "PAT ERIN DAV" "PAT ERIN DAV" "JILL KATH WIRTH" "JILL KATH WIRTH" "JILL KATH WIRTH" "MARIANA A PO" "MARIANA A PO"
#[9] "BRET"

A solution uses functions from dplyr and tidyr. It fills the Name using the last one of each ID. dt2 is the final output.
If (full name) is truly in your data frame and you want to remove it, then we can use gsub and regular expression to do that. dt3 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Create example data frames
dt <- read.table(text = "ID Name Company Tips
1 Dave AB 50
2 'PAT E DAV' ABC 15
2 'PAT ERIN DAV(full name)' AB 26
3 'JIL WIRTH' DFG 26
3 'JIL K WIRTH' EF 45
3 'JILL KATH WIRTH(full name)' JUI 85
4 'MARIANA PO' KIL 50
4 'MARIANA A PO(full name)' LPI 55
5 'BRET' LLC 52",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- dt %>%
group_by(ID) %>%
# Replace names that are not on the last row of each ID to be NA
mutate(Name = ifelse(row_number() != n(), NA, Name)) %>%
# Fill NA with the name from the last row
fill(Name, .direction = "up")
# Remove the string (full name)
dt3 <- dt2 %>% mutate(Name = gsub("\\s*\\([^\\)]+\\)", "", Name))

Related

Move information to new column if the first value of the cell is a four-digit number

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.
An ifelse with grepl should help -
library(dplyr)
df <- df %>%
mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter),
Ort = ifelse(Strasse == '', Halter, ''))
# Line Halter Strasse Ort
#1 1 1007 Abc 1007 Abc
#2 2 1012 Long words 1012 Long words
#3 3 Enelbach 54 Enelbach 54
#4 4 Abcd 56 Abcd 56
#5 5 Engasse 21 Engasse 21
grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
In addition to the neat solution of #Ronak Shah, if you want to use base R
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
Line Halter Strasse Ort
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 Enelbach 54
4 4 Abcd 56 Abcd 56
5 5 Engasse 21 Engasse 21
An option is also with separate
library(dplyr)
library(tidyr)
df %>%
separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
Line Strasse Ort
1 1 1007 Abc
2 2 1012 Long words
3 3 Enelbach 54
4 4 Abcd 56
5 5 Engasse 21
data
df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words",
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA,
-5L))
Suisse postal codes are made up of 4 digits:
library(dplyr)
library(stringr)
df %>%
mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
Line Halter Strasse
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 <NA>
4 4 Abcd 56 <NA>
5 5 Engasse 21 <NA>

Match strings by distance between non-equal length ones

Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.
You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(A, B, by = 'name') %>%
mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y)) %>%
arrange(distance)
# name.x age name.y company distance
#1 Peter 35 Peter B 0.000
#2 Samantha 33 Samantha A 0.000
#3 Samantha 33 Samanta A 0.125
#4 Joe 57 Joey C 0.250

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

R - Converting one table to multiple table

I have a csv file called data.csv that contains 3 tables all merged in just one. I would like to separate them in 3 different data.frame when I import it to R.
So far this is what I have got after running this code:
df <- read.csv("data.csv")
View(df)
Student
Name Score
Maria 18
Bob 25
Paul 27
Region
Country Score
Italy 65
India 99
United 88
Sub region
City Score
Paris 77
New 55
Rio 78
How can I split them in such a way that I get this result:
First:
View(StudentDataFrame)
Name Score
Maria 18
Bob 25
Paul 27
Second:
View(regionDataFrame)
Country Score
Italy 65
India 99
United 88
Third:
View(SubRegionDataFrame)
City Score
Paris 77
New 55
Rio 78
One option would be to read the dataset with readLines, create a grouping variable ('grp') based on the location of 'Student', 'Region', 'Sub region' in the 'lines', split it and read it with read.table
i1 <- trimws(lines) %in% c("Student", "Region", "Sub region")
grp <- cumsum(i1)
lst <- lapply(split(lines[!i1], grp[!i1]), function(x)
read.table(text=x, stringsAsFactors=FALSE, header=TRUE))
lst
#$`1`
# Name Score
#1 Maria 18
#2 Bob 25
#3 Paul 27
#$`2`
# Country Score
#1 Italy 65
#2 India 99
#3 United 88
#$`3`
# City Score
#1 Paris 77
#2 New 55
#3 Rio 78
data
lines <- readLines("data.csv")

Resources