Create unique list of names - r

I have a list of actors:
name <- c('John Doe','Peter Gynn','Jolie Hope')
age <- c(26 , 32, 56)
postcode <- c('4011', '5600', '7700')
actors <- data.frame(name, age, postcode)
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
I also have an edge list of relations:
from <- c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope')
to <- c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
edge <- data.frame(from, to)
from to
1 John Doe John Doe
2 John Doe John Doe
3 John Doe Peter Gynn
4 Peter Gynn Jolie Hope
5 Peter Gynn Peter Gynn
6 Jolie Hope Frank Smith
First, I want to eliminate self references in my edge list i.e. rows 1,2,5 in my 'edge' dataframe.
non.self.ref <- edge[!(edge$from == edge$to),]
does not produce the desired result.
Second, edge includes a name not in the 'actor' dataframe ('Frank Smith'). I want to add 'Frank Smith' to my 'actor' dataframe, even though I do not have age or postcode data for 'Frank Smith'. For example:
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
4 Frank Smith NA NA
I would be grateful for a tidy solution!

Here is a tidyverse solution to both parts, though in general try not to ask multiple questions per question.
The first part is fairly simple. filter allows a very intuitive syntax that just specifies you want to keep rows where from isn't equal to to.
The second part is a little more complicated. First we gather up the from and to columns, so all the actors are in one column. Then we use distinct to leave us with a one column tbl with unique actor names. Finally, we can use full_join to combine the tables. A full_join keeps all rows and columns from both tables, matching on shared name column by default, and fills NA if there is no data (as there isn't for Frank).
library(tidyverse)
actors <- tibble(
name = c('John Doe','Peter Gynn','Jolie Hope'),
age = c(26 , 32, 56),
postcode = c('4011', '5600', '7700')
)
edge <- tibble(
from = c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope'),
to = c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
)
edge %>%
filter(from != to)
#> # A tibble: 3 x 2
#> from to
#> <chr> <chr>
#> 1 John Doe Peter Gynn
#> 2 Peter Gynn Jolie Hope
#> 3 Jolie Hope Frank Smith
edge %>%
gather("to_from", "name", from, to) %>%
distinct(name) %>%
full_join(actors)
#> Joining, by = "name"
#> # A tibble: 4 x 3
#> name age postcode
#> <chr> <dbl> <chr>
#> 1 John Doe 26.0 4011
#> 2 Peter Gynn 32.0 5600
#> 3 Jolie Hope 56.0 7700
#> 4 Frank Smith NA <NA>
Created on 2018-03-02 by the reprex package (v0.2.0).

I discovered by including stringsAsFactors = FALSE e.g.
edge <- data.frame(from, to, stringsAsFactors = F)
then:
non.self.ref <- edge[!(edge$from == edge$to),]
works!

An option with dplyr would be to filter the rows by comparing 'from' and 'to' (to get the first output - it is not needed if we are interested only at the second output), unlist, get the unique values, convert it to a tibble and do a left_join
library(dplyr)
edge %>%
filter(from != to) %>% #get the results for the first question
unlist %>%
unique %>%
tibble(name = .) %>%
left_join(actors) # second output
# A tibble: 4 x 3
# name age postcode
# <chr> <dbl> <fctr>
#1 John Doe 26.0 4011
#2 Peter Gynn 32.0 5600
#3 Jolie Hope 56.0 7700
#4 Frank Smith NA <NA>

Related

Merge dataframe with a key value that is contained within a string in a separate dataframe

employee <- c('John','Peter', 'Gynn', 'Jolie', 'Hope', 'Sue', 'Jane', 'Sarah')
salary <- c('VT020', 'VT126', 'VT027', 'VT667', 'VC120', 'VT000', 'VA120', 'VA020')
emp <- data.frame(employee, salary)
benefit <- c('Health', 'Time', 'Bonus')
benefit_id <- c('VT020 VT126 VT667 VA020', 'VT667', 'VT126 VT667 VT000')
ben <- data.frame(benefit, benefit_id)
Above we have to dataframes, one contains names and a unique ID, the other contains a category and a list of unique IDs.
What is the most efficient way to merge the ben dataframe with the emp dataframe such that we get the appropriate benefit assigned to each employee?
tidyverse
library(dplyr)
library(tidyr) # tidyr
ben %>%
mutate(benefit_id = strsplit(benefit_id, "\\s+")) %>%
unnest(benefit_id) %>%
left_join(emp, ., by = c(salary = "benefit_id"))
# employee salary benefit
# 1 John VT020 Health
# 2 Peter VT126 Health
# 3 Peter VT126 Bonus
# 4 Gynn VT027 <NA>
# 5 Jolie VT667 Health
# 6 Jolie VT667 Time
# 7 Jolie VT667 Bonus
# 8 Hope VC120 <NA>
# 9 Sue VT000 Bonus
# 10 Jane VA120 <NA>
# 11 Sarah VA020 Health
Depending on your needs, you may also prefer a different join. For instance, use a full_join if you want all pairings, where NA in employee indicates a benefit sans employee.
FYI: if you are running R before 4.0, then you might have factors in your data. To fix that, just convert the factor columns with as.character first. (This can be determined with sapply(ben, inherits, "factor").)
data.table
library(data.table)
setDT(emp)
ben_long <- setDT(ben)[, list(benefit_id = unlist(strsplit(x = benefit_id, split = " "))), by = benefit]
merge(x = emp, y = ben_long, by.x = "salary", by.y = "benefit_id", all.x = TRUE)
salary employee benefit
1: VA020 Sarah Health
2: VA120 Jane <NA>
3: VC120 Hope <NA>
4: VT000 Sue Bonus
5: VT020 John Health
6: VT027 Gynn <NA>
7: VT126 Peter Health
8: VT126 Peter Bonus
9: VT667 Jolie Health
10: VT667 Jolie Time
11: VT667 Jolie Bonus

Remove duplicates from ONE column not row

I am trying to remove duplicate emails in a column of my data.frame using duplicate() and distinct() in R however, I do not need it to delete the whole row just the duplicate email addresses in that column. Is there anyway to do that using these? Or is there another way to do this?
library(tidyverse)
patient2 <- c('John Doe','Peter Gynn','Jolie Hope', "Mycroft Holmes", "Carrie
Bird", "Carrie Bird", "Marcus Quimby", "Jennifer Poe", "Donna Moon")
salary2 <- c(21000, 23400, 26800, 40000, 50000, 33000, 24000, 75000, 90000)
email2 <- c("doe#gmail.com", "gynn#gmail.com", "hope#gmail.com",
"holmes#gmail.com", "bird#gmail.com", "bird#gmail.com", "quimby#gmail.com",
"poe#gmail.com", "moon#gmail.com")
startdate2 <- as.Date(c('2010-11-1','2008-3-25','2007-3-14', '2020-7-19',
'2019-4-20', '2018-2-13', '2017-4-21', '2019-6-10', '2010-9-19'))
patient.data_2 <- data.frame(patient2, salary2, email2, startdate2)
print(patient.data_2)
patient2<fctr> salary2<dbl> email2<fctr> startdate2<date>
John Doe 21000 doe#gmail.com 2010-11-01
Peter Gynn 23400 gynn#gmail.com 2008-03-25
Jolie Hope 26800 hope#gmail.com 2007-03-14
Mycroft Holmes 40000 holmes#gmail.com 2020-07-19
Carrie Bird 50000 bird#gmail.com 2019-04-20
Carrie Bird 33000 bird#gmail.com 2018-02-13
Marcus Quimby 24000 quimby#gmail.com 2017-04-21
Jennifer Poe 75000 poe#gmail.com 2019-06-10
Donna Moon 90000 moon#gmail.com 2010-09-19
extracted <- merged_data[!duplicated(merged_data$email), ]
extracted
All I would like to do is remove the extra duplicate email for the person
Carrie Bird. Not the entire row because the date is different. I tried using
duplicated() and distinct() and both removed the entire row.
You could use the duplicated function:
dat <- data.frame(a = c(1, 1, 2, 2, 3, 3, 4, 4, 4, 4))
dat$a[duplicated(dat$a)] <- NA
dat
#> a
#> 1 1
#> 2 NA
#> 3 2
#> 4 NA
#> 5 3
#> 6 NA
#> 7 4
#> 8 NA
#> 9 NA
#> 10 NA
Using dplyr
library(dplyr)
dat <- dat %>%
mutate(a = replace(a, duplicated(a), NA))

Add multiple new columns to the dataset, based on another dataset's elements

I have the following products list
> products
# A tibble: 311 x 1
value
<fct>
1 NA
2 Alternativ Economy
3 Ambulant Balance
4 Ambulant Economy
5 Ambulant Premium
6 Ambulant 2
7 Ambulant 3
8 Ambulant 1
9 COMPLETA
10 HOSPITAL ECO
# ... with 301 more rows
and the following df
> df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
+ salary = c(21000, 23400, 26800),
+ startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
> df
employee salary startdate
1 John Doe 21000 2010-11-01
2 Peter Gynn 23400 2008-03-25
3 Jolie Hope 26800 2007-03-14
Now, I want to add the elements of the former (i.e. products) as variables of the latter (i.e. the df). I use
cbind(df, setNames(lapply(products, function(x) x = NA), products))
but I get an error. Can you suggest another way of doing this? What is wrong with my solution? thanks in advance
Here is one solution.
df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
salary = c(21000, 23400, 26800),
startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
products <- data.frame(value = c(NA, "Alternativ Economy", "COMPLETA"))
#products$value <- ifelse(is.na(products$value), "not_available", as.character(products$value))
cbind(df, `colnames<-`(data.frame(matrix(ncol = nrow(products), nrow = nrow(df))), products$value))
employee salary startdate NA Alternativ Economy COMPLETA
1 John Doe 21000 2010-11-01 NA NA NA
2 Peter Gynn 23400 2008-03-25 NA NA NA
3 Jolie Hope 26800 2007-03-14 NA NA NA
I question the wisdom of having NAs as column names, so I'd uncomment that one line of code in there to replace NAs with some character string instead.

Trim leading/trailing whitespaces from a data frame column where the column name comes as a variable

I have a dataframe where the name of the column which is to be trimmed for whitespaces is comming as a variable and I am not able to resolve the variable to point me to the column so that it can be trimmed.
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employee <- c(' John Doe ',' Peter Gynn ',' Jolie Hope')
employ.data <- data.frame(employee, salary, startdate)
Here I try to trim employeecolumn and I have tried dplyr:
employ.data %>% mutate(employee = trimws(employee))
which works.
However, If I say:
abc <- "employee"
and then
employ.data %>% mutate(abc= trimws(abc))
It doesnt work.
I have tried using get(abc) in this function but this doesn't work either.
I understand I cant use abc as employ.data$abc when abc is a variable column name.
INITIAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
FINAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
You can also use str_trim from stringr in the tidyverse.
employ.data %>%
mutate(abc = str_trim(employee))
Which is:
employee salary startdate abc
1 John Doe 21000 2010-11-01 John Doe
2 Peter Gynn 23400 2008-03-25 Peter Gynn
3 Jolie Hope 26800 2007-03-14 Jolie Hope
Use mutate_at
library(dplyr)
employ.data %>% mutate_at(abc, trimws)
# employee salary startdate
#1 John Doe 21000 2010-11-01
#2 Peter Gynn 23400 2008-03-25
#3 Jolie Hope 26800 2007-03-14
Or you can directly do, if you have only one column
employ.data[[abc]] <- trimws(employ.data[[abc]])
If there are multiple columns you can use lapply
employ.data[abc] <- lapply(employ.data[abc], trimws)

For Each Loop to convert into numeric values [duplicate]

This question already has answers here:
How to assign a unique ID number to each group of identical values in a column [duplicate]
(2 answers)
Closed 4 years ago.
I have a question about the manipulation of a data frame. If I have this data frame as an example:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
location <- c('New York', 'Alabama','New York')
employ.data <- data.frame(employee, salary, startdate, location)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 New York
2 Peter Gynn 23400 2008-03-25 Alabama
3 Jolie Hope 26800 2007-03-14 New York
Now I want to transform the location into nummeric values. I know that I can do something like this:
transformlocation <- function(x) {
x <- as.character(x)
if (x =='New York'){
return('1')
}else if (x=='Alabama'){
return('2')
}else if (x=='Florida'){
return('3')
}else
return('0')
}
employ.data$location <- sapply(employ.data$location, transformlocation)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 1
2 Peter Gynn 23400 2008-03-25 2
3 Jolie Hope 26800 2007-03-14 1
But in my final dataset there are hundreds of different values. For example, is it possible to work with a for each statement here?
Thanks for your help!
If it is already a factor variable, then simply convert to integer,i.e.
employ.data$location <- as.integer(employ.data$location)
employ.data
# employee salary startdate location
#1 John Doe 21000 2010-11-01 2
#2 Peter Gynn 23400 2008-03-25 1
#3 Jolie Hope 26800 2007-03-14 2
Otherwise convert to factor and then integer, i.e.
employ.data$location <- as.integer(as.factor(employ.data$location))

Resources