I have a df with data, and a name for each row. I would like the names to be replaced by a random string/number, but with the same string, when a name appears twice or more (eg. for Adam and Camille below).
df <- data.frame("name" = c("Adam", "Adam", "Billy", "Camille", "Camille", "Dennis"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
The expected output is something like this (it is not important how the random string looks or the lenght of it)
df_exp <- data.frame("name" = c("xxyz", "xxyz", "xyyz", "xyzz", "xyzz", "yyzz"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
I have tried several random replacement functions in R, however each of them creates a random string for each row in data, and not an individual one for duplicates, eg. stri_rand_strings:
library(stringi)
library(magrittr)
library(tidyr)
library(dplyr)
df <- df %>%
mutate(UniqueID = do.call(paste0, Map(stri_rand_strings, n=6, length=c(2, 6),
pattern = c('[A-Z]', '[0-9]'))))
One way is with a group_by/mutate
df %>%
group_by(name) %>%
mutate(hidden = stringi::stri_rand_strings(1, length=4)) %>%
ungroup() %>%
mutate(name=hidden)
Basically we just generate one random string per group.
You could also generate a translation table first with something like
new_names <- df %>%
distinct(name) %>%
mutate(new_name = stringi::stri_rand_strings(n(), length=c(2,6)))
and then merge that to the original data. But either way I'm not sure that stri_rand_strings is guaranteed to return unique values -- they're just random values. While unlikely to be the same, it would be easier to check that they are all distinct by creating the translation table first.
Related
I am stuck on what seems to be a doable task in R. I am merging several files and am required to change the name of columns as I go to maintain the data. There might be a better way to do that buy that is another story. But in very simple terms I have two files dfA and dfB as below. I need to merge the two files by "model" on dfB against EITHER column "PART1" or "PART2" or "PART3" depending on a match.
We could do the following:
Bring dfA into long format (note the use of the argument values_transform see here: pivot_longer: values_ptypes: can't convert <integer> to <character>
then use right_join by the appropriate columns and do some select:
library(dplyr)
library(tidyr)
dfA %>%
pivot_longer(
starts_with("PART"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
right_join(dfB, by=c("val"="Model")) %>%
select(Model=val, Detail)
Model Detail
<chr> <chr>
1 A Dog
2 2 Cat
3 Z Cow
data:
dfA <- tibble(PART1 = c("A", "B", "C"),
PART2 = c("X", "Y", "Z"),
PART3 = c(1,2,3),
Detail = c("Dog", "Cat", "Cow"))
dfB <- tibble(Model = c("A", "Z", 2))
Try this simple yet not ideally optimal way:
library("dplyr")
dfC <-
union(
union(
left_join(dfB, dfA, by = c("Model" = "PART1")),
left_join(dfB, dfA, by = c("Model" = "PART2"))
),
left_join(dfB, dfA, by = c("Model" = "PART3"))
)
I'm looking for a more efficient way to write the following:
Read in all my Excel files
DF1 <- read_excel(DF1, sheet = "ABC", range = cell_cols(1:10) )
DF2 <- read_excel(DF2, sheet = "ABC", range = cell_cols(1:10) )
etc...
DF50 <- read_excel(DF50, sheet = "ABC", range = cell_cols(1:10) )
Add a column to each DF with a location
DF1$Location <- location1
DF2$Location <- location2
etc...
DF50$Location <- location50
Keep only columns with specified names, get rid of blank rows, and convert column CR_NUMBER to an integer
library(hablar)
DF1 <- DF1 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
DF2 <- DF2 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
etc...
DF50 <- DF50 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
You can try to use the following getting the data in a list :
library(readxl)
library(hablar)
library(dplyr)
#Get the complete path of file which has name "DF" followed by a number.
file_names <- list.files('/folder/path', pattern = 'DF\\d+', full.names = TRUE)
list_data <- lapply(seq_along(file_names), function(x) {
data <- read_excel(file_names[x], sheet = "ABC", range = cell_cols(1:10))
data %>%
mutate(Location = paste0('location', x))
select(all_of(colnames_r)) %>%
filter(!is.na(NAME)) %>%
convert(int(CR_NUMBER))
})
list_data is a list of dataframes which is usually better to manage instead of having 50 dataframes in global environment. If you still want all the dataframes separately name the list and use list2env.
names(list_data) <- paste0('DF', seq_along(list_data))
list2env(list_data, .GlobalEnv)
I am trying to create a for loop in R that will make a new data frame ("results") when values of a column ("areaName2") in one data frame (df2), matches the value in a column ("ISLAND") from a different data frame (df1).
If there are no matches in the first column in df2, then I want it to move on to pair a second set of columns from df2 and df1 (df2:"areaName1 and df1:"ARCHIP"). Again, if there is a match, it should be printed in the new data frame. If again, there is no match, then I want it to move on the a third pair of columns (df2:"Country" and df1:"COUNTRY").
If all columns in df 2 are blank, then I would like to skip that row.
If there is some information in one of the columns in df 2, but it doesn't match df1, I would like it to state that somehow if that is possible.
I have made an example of df1, df2, and results:
ID <- c(1,2,3,4,5, 6)
COUNTRY <- c("country1", 'country2', 'country3','country4', 'country5', 'country6')
ARCHIP <- c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6')
ISLAND <- c('someisland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6')
df1 <- data.frame(ID, COUNTRY, ARCHIP, ISLAND)
Sciname <- c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6")
AreaName2 <- c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland')
AreaName1 <- c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6')
Country <- c("country1", "country2", "country3", 'country4', NA, 'country6')
df2 <- data.frame(Sciname, Country, AreaName1, AreaName2)
Species <- c("scientificName1","scientificName2", "scientificName3", "scientificName4", 'scientificName6')
Location <- c("someIsland1", "archipelago2", "someIsland3", 'country4', 'UNREGOGNISED')
results <- data.frame(Species, Location)
I was thinking that I need to do something along the lines of this for each column set
for (i in df2$AreaName2) {
results[[i]] <- if(df2$AreaName2 %in% df1$ISLAND)
}
But I am not sure how to make it work for each set, or how to make it run though several columns - maybe I should make a for loop for each of the sets of columns I wish to match?
Any ideas? Thanks!
# I like to use tidyverse :)
library(tidyverse)
# First, to create our datasets - (Thank you for providing sample data!)
# I've set this up in a slightly different way, in an attempt to keep our workspace clear.
# I've also used tibble in place of data.frame, to line up with the tidyverse approach.
df1 <- tibble( ID = seq(1:6),
COUNTRY = c("country1", 'country2', 'country3','country4', 'country5', 'country6'),
ARCHIP = c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6'),
ISLAND = c('someIsland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6'))
df2 <- tibble( Sciname = c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6"),
Country = c("country1", "country2", "country3", 'country4', NA, 'country6'),
AreaName1 = c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6'),
AreaName2 = c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland'))
# Rather than use a for loop, I'll use full_join to match the two tables, then filter for the conditions you're looking for.
# Merge data
join_country <- full_join(df2, df1, by = c("Country" = "COUNTRY"))
# Identify scinames with matching island names
# I use _f to signify my goal here - filtering
island_f <- join_country %>%
filter(AreaName2 == ISLAND) %>%
# Keep only relevant columns
select(Sciname, Location = AreaName2)
# Identify scinames with matching archip names
archip_f <- join_country %>%
filter(
# Exclude scinames we've identified with matching island names.
!(Sciname %in% island_f$Sciname),
AreaName1 == ARCHIP) %>%
select(Sciname, Location = AreaName1)
# Identify scinames left over (countries already matched from full_join)
country_f <- join_country %>%
filter(
# Exclude scinames we've identified with matching island or archip names.
!(Sciname %in% island_f$Sciname),
!(Sciname %in% archip_f$Sciname)) %>%
select(Sciname, Location = Country)
sciname_location <- bind_rows(island_f,
archip_f,
country_f) %>%
arrange(Sciname)
# Finally, to identify records that are populated but don't match at all, we can use anti_join.
records_no_match <- anti_join(df1, df2, by = c("COUNTRY" = "Country"))
You can learn more about relational data from R for Data Science, chapter 13.
Please let me know if you have any questions!
A different solution might be to prioritise the locations first, and then filter for the locations with the highest priority.
Just like Rebecca, I would opt for the tidyverse ;-)
library(tidyverse)
# Bring df2 into long format
df_long2 <- pivot_longer(df2, -Sciname) %>%
select(-name) %>%
mutate(value = replace_na(value, "UNRECOGNISED"))
# Bring df1 into long format
df_long1 <- pivot_longer(df1, -ID) %>%
select(-ID)
results <- df_long2 %>%
left_join(df_long1) %>%
# Prioritize names
mutate(lvl = case_when(
name == "ISLAND" ~ 1,
name == "ARCHIP" ~ 2,
name == "COUNTRY" ~ 3,
is.na(name)~ 4
)) %>%
# Group by name
group_by(Sciname) %>%
# Filter for groups with lowest lvl/highest priority
filter(lvl == min(lvl)) %>%
# Drop duplicate rows
distinct() %>%
select(-name, -lvl) %>%
# Rename
rename(Species = 1,
Location = 2)
Good luck!
I would like to create a new data frame from two existing data frames, they share columns called first name, last name, and email, but I wish to merge them in a way the second data frame just sticks to the first one in order to create a list of all the emails I have. the data frames contain duplicates, so I wish to conserve them to proceed to eliminate them in the next step. Obviously, the code I posted below does not work. Any help?
first <- c("andrea","luis","mike","thomas")
last <- c("robinson", "trout", "rice","snell")
email <- c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
first <- c("mike","steven","mark","john", "martin")
last <- c("rice", "berry", "smalls","sale", "arnold")
email <- c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com", "ma#gmail.com)
alz <- c(1,2,NA,3,4)
der <- c(0,2,3,NA,3)
all_emails <- data.frame(first,last,email)
no_contact_emails <- data.frame(first,last,email,alz,der)
df <- merge(no_contact_emails, all_emails, all = TRUE)
df <- df$email[!duplicated(df$email) & !duplicated(df$email, fromLast = TRUE)]
expected output will be a join dataset with all the emails except the one for mike rice since in the one that is duplicate.
Your reproducible example is a little confusing, so I made you a new one to see if this is what you are looking for:
df1 <- data.frame(
first = c("andrea","luis","mike","thomas"),
last = c("robinson", "trout", "rice","snell"),
email = c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
)
df2 <- data.frame(
first = c("mike","steven","mark","john", "martin"),
last = c("rice", "berry", "smalls","sale", "arnold"),
email = c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com",
"ma#gmail.com")
)
Now, there are 2 different ways you can do this, using dplyr:
library(dplyr)
df1 %>%
bind_rows(df2) %>%
distinct(first, last, .keep_all = TRUE)
Or:
df1 %>%
full_join(df2)
Hope this helps!
I have an example data set with a column that reads somewhat like this:
Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!
In dplyr and tidyr
dat %>%
mutate(var = replace(var, var != "Candy", "Not Candy"))
Significantly faster than the ifelse approaches.
Code to create the initial dataframe can be as below:
library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
Assuming your data frame is dat and your column is var:
dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
Another solution with dplyr using case_when:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
TRUE ~ 'Non-Candy'))
The syntax for case_when is condition ~ value to replace. Documentation here.
Probably less efficient than the solution using replace, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
var == 'Water' ~ 'Water',
TRUE ~ 'Neither-Water-Nor-Candy'))
No need for dplyr. Assuming var is stored as a factor already:
non_c <- setdiff(levels(dat$var), "Candy")
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
See ?levels.
This is much more efficient than the ifelse approach, which is bound to be slow:
library(microbenchmark)
set.seed(01239)
# resample data
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
timings <- replicate(50, {
# copy data to facilitate reuse
cop <- smp
t0 <- get_nanotime()
levs <- setdiff(levels(cop$var), "Candy")
levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
t1 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop = cop %>%
mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
t2 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop$var <-
factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
t3 <- get_nanotime() - t0
c(levels = t1, dplyr = t2, direct = t3)
})
x <- apply(times, 1, median)
x[2]/x[1]
# dplyr direct
# 8.894303 4.962791
That is, this is 9 times faster.
I didn't benchmark this, but at least in some cases with more than one condition, a combination of mutate and a list seems to provide an easy solution:
# assuming that all sweet things fall in one category
dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))
conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE,
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)
dat %>% mutate(sweet = conditions[var])
When you only need two values, a simple ifelse() is prettiet, I think.
Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!
dat %>%
mutate(
var = ifelse(var == "Candy", "Candy", "Non-Candy")
)