Dplyr: Anonymising values up to a million rows with unique names - r

I have the following data:
library(dplyr)
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('John', 'Jane', 'Rich', 'Clive'),
surname = c('Smith', 'Jones', 'Smith', 'Jones'))
I would like to anonymise the values within the 'forename ' and 'surname ' variables so that the data looks like this.
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('forename1', 'forename2', 'forename3', 'forename4'),
surname = c('surname1', 'surname2', 'surname3', 'surname4'))
I could just do this manually but I have a df with millions of rows. What I would like is for the row number in the df to coincide with the value rename. So the data on row 67 for example would show:
d <- tibble(
region = c('all'),
forename = c('forename67'),
surname = c('surname67'))
Does anyone know how I would achieve this using dplyr if possible?
Thannks

As every row is a unique user, we can paste row_number to the column names.
library(dplyr)
d %>%
mutate(forename = paste0("forename", row_number()),
surname = paste0("surname", row_number()))
# A tibble: 4 x 3
# region forename surname
# <chr> <chr> <chr>
#1 all forename1 surname1
#2 one forename2 surname2
#3 eleven forename3 surname3
#4 six forename4 surname4

An option with stringr
library(dplyr)
library(stringr)
d %>%
mutate(forename = str_c("forename", row_number()),
surname = str_c("surname", row_number()))
Or with lapply from base R
d[c('forename', 'surname')] <- lapply(c('forename', 'surname'), function(x)
paste0(x, seq_len(nrow(d))))]

Related

Replace values in dataframe based on other dataframe with column name and value

Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))
We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7
It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7

Adding values from lookup-table based on condition to data frame in R

I've got a data frame containing data of participants who rated images (column image_index):
Now I want to add a new column with gender specific values of the rated image from a another dataframe.
Look-up table of image data:
Final data frame:
How can I accomplish this task?
Sample data:
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
complete_dataset <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19),
external_value = c(125,100,130,150)
)
You need to make a few manipulations on your data to join them together.
Pivot lookup_data longer with tidyr::pivot_longer() so the gender info is in a column to help merge on.
Use dplyr::rename() to make sure the column names are the same between the two tables.
Transform the gender column so it is just 1 letter to match the other table. Here I use stringr::str_sub(x, 1,1) which just takes the first character of a string.
Then I use left_join() to merge. Because the joining column names are already the same I don't need to specify.
Finally I just reorder and sort the data to match your expected output.
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
lookup_data %>%
pivot_longer(-index, names_to = "gender", values_to = "external_value") %>%
rename(image_index = index) %>%
mutate(gender = str_sub(gender, 1, 1)) %>%
left_join(., participants_data) %>%
drop_na(ID) %>%
select(ID, gender, image_index, external_value) %>%
arrange(ID)
#> Joining, by = c("image_index", "gender")
#> # A tibble: 4 x 4
#> ID gender image_index external_value
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 f 19 125
#> 2 2 m 2 100
#> 3 3 d 2 130
#> 4 4 f 19 125
Created on 2022-02-18 by the reprex package (v2.0.1)

R reorder column values alphabetic

i have a dataframe like this in R:
and i want to reorder the second column "Car" alphbethic like this:
Car
Audi/BMW/VW
Audi/BMW
Audi/BMW/VW
Audi/BMW/Porsche/VW
there could be 0 to 15 Cars with seperator "/"
my solution is a little bit complicated. (build a new DataFrame with this column, split them in multiple columns, reorder the rows alphabetic, paste them together, insert in original dataframe)
do you know a better and smarter solution?
thanks a lot
This is basically what you did but without creating new dataframe and new columns.
df$Car <- sapply(strsplit(as.character(df$Car), "/"), function(x)
paste(sort(x), collapse = "/"))
We can use separate_rows to split the second column, then arrange by 'Name', and 'Car' and paste the elements grouped by 'Name'
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
separate_rows(Car) %>%
arrange(Name, Car) %>%
group_by(Name, zipcode) %>%
summarise(Car = str_c(Car, collapse="/"))
# A tibble: 4 x 3
# Groups: Name [4]
# Name zipcode Car
# <chr> <dbl> <chr>
#1 Frank 3456 Audi/BMW/VW
#2 Lilly 1333 Audi/BMW/Porsche/VW
#3 Marie 1416 Audi/BMW
#4 Peter 1213 Audi/BMW/VW
data
df1 <- structure(list(Name = c("Peter", "Marie", "Frank", "Lilly"),
Car = c("BMW/VW/Audi", "Audi/BMW", "VW/BMW/Audi", "Audi/BMW/VW/Porsche"
), zipcode = c(1213, 1416, 3456, 1333)),
class = "data.frame", row.names = c(NA,
-4L))

R fill columns n times

Hi I want to simulate a dataset like this:
City Person
1 1
1 2
1 3
2 1
2 2
2 3
Where City ID can go from 1-30 and Person ID from 1-40. I know that I can create City by the following code:
data=data.frame(City=rep(1:30,40),Person=0)
However, I cannot figure out how to assign the Person variable for each City ID without using a loop. How do I assign the Person IDs from 1-40 for each City ID? Any help will be appreciated. Thanks.
We can do this with
df1$Person <- with(df1, ave(seq_along(City), City, FUN = seq_along))
Or
df1$Person <- sequence(table(df1$City))
Also, an easier expansion would be
expand.grid(City = 1:30, Person = 1:3)
Or with tidyverse
library(tidyverse)
crossing(City = 1:30, Person = 1:3)
Or using tidyverse
library(tidyverse)
df1 %>%
group_by(City) %>%
mutate(Person = row_number())
Or using data.table
library(data.table)
setDT(df1)[, Person := seq_len(.N), by = City]
data
df1 <- data.frame(City = rep(1:2, each = 3))

How to compare variables and return common variables in R

I have a very simple dataset, with one column for ID numbers and one column for DOB of that individual.
Example:
x_df <- data.frame(stringsAsFactors=FALSE,
ID = c("ID-1", "ID-2", "ID-2", "ID-3", "ID-4", "ID-5"),
DOB = c("4/16/1955", "9/4/1976", "9/4/1976", "4/16/1955", "2/10/1995",
"11/29/1980")
)
I am trying to write a code in R that will compare all the DOBs and print the IDs and DOBs when the DOB is the same but the ID is different.
Any suggestions?
lets arrange the data by DOBs, pairs can be compared
library(tidyverse)
x_df %>%
group_by(DOB) %>%
mutate(idord = paste0("x", 1:n()) ) %>%
spread(idord, ID) %>%
filter(x1 != x2)
result is
DOB x1 x2
<chr> <chr> <chr>
1 4/16/1955 ID-1 ID-3
if you might have more than 2 pairs then you can use this
x_df %>%
group_by(DOB) %>%
summarise(idcount = n_distinct(ID), IDall = paste(ID, collapse = "|")) %>%
filter(idcount > 1)
number of IDs and all IDs in one cell
DOB idcount IDall
<chr> <int> <chr>
1 4/16/1955 2 ID-1|ID-3

Resources