Changing Column Names Based on a Different Dataframe - r

I have a data dictionary called data_dict of this format with hundreds of rows:
Field Name
Field Label
marital_status
What is your marital status?
birthplace
What country were you born?
I have another dataframe called df in this format with hundreds of rows:
record_id
marital_status
birthplace
1
3
66
2
6
12
I am currently using df %>% map(~ table(.x, useNA = "ifany")) to summarize the results for all the columns in df. I want the Field Label column values from data_dict to appear instead of the column names from df. How could that be done without changing the column names in df?

We may use rename
library(dplyr)
library(tibble)
df %>%
summarise(across(all_of(data_dict$"Field Name"),
~ list(table(.x, useNA = "ifany")))) %>%
rename(!!! deframe(data_dict[2:1]))
Or using map
library(purrr)
df %>%
rename(!!! deframe(data_dict[2:1])) %>%
map(~ table(.x, useNA = "ifany"))
-output
$record_id
.x
1 2
1 1
$`What is your marital status?`
.x
3 6
1 1
$`What country were you born?`
.x
12 66
1 1
data
df <- structure(list(record_id = 1:2, marital_status = c(3L, 6L), birthplace = c(66L,
12L)), class = "data.frame", row.names = c(NA, -2L))
data_dict <- structure(list(`Field Name` = c("marital_status", "birthplace"
), `Field Label` = c("What is your marital status?", "What country were you born?"
)), class = "data.frame", row.names = c(NA, -2L))

Related

How do I add row values to colnames in R

I have a dataframe and I would like to add the first row to the names of the columns
What I have:
col1
col2
col3
city
state
country
...
...
...
What I want:
col1_city
col2_state
col3_country
city
state
country
...
...
...
I can't do it manually because there are many cols in the df
I think of something like
df %>% rename_with(~ names(.) %>%
map_chr(~glue('{.x}_.[1,])))
Thanks!!
With rename_with
df %>%
rename_with(.cols = everything(),
.fn = ~paste0(colnames(df), '_', df[1,]))
Update: Here's a solution where you can pass the current data as it is created/altered within a pipe:
df |>
(\(x) (x <- x |>
rename_with(.cols = everything(),
.fn = ~paste0(colnames(x), '_', x[1,]))))()
So here you could, for example, do some filtering before the renaming or some mutating or whatever you want.
In base R, just do
names(df) <- paste0(names(df), "_", unlist(df[1,]))
-output
> df
col1_city col2_state col3_country
1 city state country
Or with dplyr
library(dplyr)
library(stringr)
df %>%
set_names(str_c(names(.), '_', slice(., 1)))
-output
col1_city col2_state col3_country
1 city state country
data
df <- structure(list(col1 = "city", col2 = "state",
col3 = "country"), class = "data.frame", row.names = c(NA,
-1L))

How to find common strings across several files

I have data like this:
df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
and
df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I want to know how many strings are common across all these 3 data frames.
If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.
example (if only identical strings count for duplicates):
library(dplyr)
df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))
bind_rows(df1, df2, df3, .id = "origin") %>%
group_by(origin) %>%
distinct(test) %>% ## remove within-dataframe duplicates
group_by(test) %>%
summarise(replicates = n()) %>%
filter(replicates > 1)
Here is an update in case only identical strings are wished:
library(dplyr)
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
filter(duplicated(test) | duplicated(test, fromLast=TRUE))
id test
1 df1 STOMQ
2 df1 TMEIL1
3 df1 YIPF1
4 df2 STOMQ
5 df2 TMEIL1
6 df2 YIPF1
7 df3 STOMQ
8 df3 YIPF1
First answer:
Here is a suggestion:
First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:
library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
arrange(test)
To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index
Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R
in the group column you could find the similar strings:
You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:
library(tidyverse)
library(stringdist)
map_dfr(x$test, ~ {
i <- which(stringdist(., x$test, "jw") < 0.40)
tibble(index = i, title = x$test[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group)) +
bind_cols(df_id = x$id)
group index title df_id
<int> <int> <chr> <chr>
1 1 1 BRsts df3
2 2 2 GFD df3
3 3 3 RSEST df3
4 3 31 TRS df2
5 3 32 TRSF df3
6 4 4 SNTLK df1
7 4 5 SNTLKM df2
8 4 6 SNTM1 df1
9 4 8 STOLA df1
10 4 12 STR2 df2
# ... with 27 more rows

Sum characters from a string using a lookup table in R

I have a lookup dataframe (df1) like this:
col1 col2
A 71
R 156
N 114
D 115
...
and I have a data frame (df2) containing a column of strings like this:
[1] "AARA"
[2] "DDNRRRNRAAN"
[3] "RNDARANDRN"
...
I would like to create a new column in df2 that, for each string, looks up the series of corresponding numbers from df1 and sums them. So, the first row in the new column of df2 would have the value 369 (= 71 + 71 + 156 + 71). How could I go about this task?
One more tidyverse strategy
lookup <- structure(list(col1 = c("A", "R", "N", "D"), col2 = c(71L, 156L,
114L, 115L)), class = "data.frame", row.names = c(NA, -4L))
df <- structure(list(col = c("AARA", "DDNRRRNRAAN", "RNDARANDRN")),
class = "data.frame", row.names = c(NA, -3L))
library(tidyverse)
df %>%
mutate(SUM = map_dbl(str_split(col, ''), ~ sum(lookup$col2[match(.x, lookup$col1)])))
#> col SUM
#> 1 AARA 369
#> 2 DDNRRRNRAAN 1338
#> 3 RNDARANDRN 1182
Created on 2021-06-13 by the reprex package (v2.0.0)
Split the string at every character, use match to get corresponding value for each character and sum them.
df2$res <- sapply(strsplit(df2$col, ''), function(x)
sum(df1$col2[match(x, df1$col1)], na.rm = TRUE))
df2
# col res
#1 AARA 369
#2 DDNRRRNRAAN 1338
#3 RNDARANDRN 1182
Using the same logic a tidyverse option would be -
library(dplyr)
library(tidyr)
df2 %>%
mutate(row = row_number()) %>%
separate_rows(col, sep = '') %>%
left_join(df1, by = c('col' = 'col1')) %>%
group_by(row) %>%
summarise(col = paste0(col, collapse = ''),
col2 = sum(col2, na.rm = TRUE)) %>%
select(-row)
data
df1 <- structure(list(col1 = c("A", "R", "N", "D"), col2 = c(71L, 156L,
114L, 115L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(col = c("AARA", "DDNRRRNRAAN", "RNDARANDRN")),
class = "data.frame", row.names = c(NA, -3L))

Split Columns in a List of Dataframes R

I have a list of data frames which some columns have this special character ->(arrow). Now i do want to loop through this list of data frames and locate columns with this -> (arrow) then the new columns be named with a suffix _old and _new. This is a sample of data frames :
dput(df1)
df1 <- structure(list(v1 = c("reg->joy", "ress", "mer->dls"),
t2 = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(v1 = c("me", "df", "kl"),
t2 = c("James","Jane->dlt", "Egg"),
t3 = c("James ->may","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1 = c("56->34", "df23-> ", "mkl"),
t2 = c("James","Jane", "Egg"),
d3 = c("James->","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
This is what I have tried
dfs <- list(df1,df2,df3)
for (y in 1:length(dfs)){
setDT(dfs[[y]])
df1<- lapply(names(dfs[[y]]), function(x) {
mDT <- df2[[y]][, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_old", "_new")))
}) %>% as.data.table()
}
This only splits one data frame, I need to split all of the data frames.
NOTE: The code I have splits so well on one dataframe, what I want is how to implement it on a List of data frames
EXPECTED OUTPUT
dput(df1)
df1 <- structure(list(v1_old = c("reg", "mer"),
v1_new = c("joy", "dls")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(t2_old = c("dlt"),
t2_new = c("dlt"),
t3_old = c("James"),
t3_new = c("may")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1_old = c("56", "df23 "),
v1_new = c("34", " "),
d3 = c("James"),
d3 = c(" ")),
class = "data.frame", row.names = c(NA, -3L))
I add below a solution using the tidyverse.
Select the columns if one of the strings in the columns contains an arrow:
col_arrow_ls <- purrr::map(dfs, ~select_if(., ~any(str_detect(., "->"))))
Then split the function using tidyr::separate. Since each element of the output is a data frame, purrr::map_dfc is used to column-bind them together:
split_df_fn <- function(df1){
names(df1) %>%
map_dfc(~ df1 %>%
select(.x) %>%
tidyr::separate(.x,
into = paste0(.x, c("_old", "_new")),
sep = "->")
)
}
Apply the function to the list of data frames.
purrr::map(col_arrow_ls, split_df_fn)
[[1]]
v1_old v1_new
1 reg joy
2 ress <NA>
3 mer dls
[[2]]
t2_old t2_new t3_old t3_new
1 James <NA> James may
2 Jane dlt Jane <NA>
3 Egg <NA> Egg <NA>
[[3]]
v1_old v1_new d3_old d3_new
1 56 34 James
2 df23 Jane <NA>
3 mkl <NA> Egg <NA>

How to make a left join, in which the the row taken from "data B" is different than the row in which the id is?

I have a data frame "A" with two columns,
the first has names of cities(unique values), the second has NA, which I want to fill with unemployment.
data frame "B" has a Column with city names, but the unemployment isnt in the same row, to be precise, it is always 1 row below.
How would you merge this two data, so that R looks at the first column on data frame "A", finds its match on data frame "B", and replaces the NA from the second column of data frame "A" with the value 1 row below the row in which the match is made.
Here are some summarized version of how data frame A and B would look like.
names= c("Bogotá", "Medellín")
data_frame_A= as.data.frame(names, ncol=1)
colnames(data_frame_A)= "city"
data_frame_A$Unemployment = NA
data_frame_A
data frame B looks something like this
names= c("Bogotá", "life_exp","Unemployment","Medellín","life_exp","Unemployment")
data_frame_B= as.data.frame(names, ncol=1)
colnames(data_frame_B)= "city"
data_frame_B$column_20 = runif(6, 0.5, 0.8)
data_frame_B
How would you merge this two data then?
Here's a method that checks if each city in data_frame_B is in data_frame_A to assign rows to each city. We make a new column that has the actual city name, and then we can spread the variables out into their own columns. You can join back on to data_frame_A after this if there are columns there that you need.
library(tidyverse)
data_frame_A <- structure(list(city = structure(1:2, .Label = c("Bogotá", "Medellín"), class = "factor"), Unemployment = c(NA, NA)), row.names = c(NA, -2L), class = "data.frame")
data_frame_B <- structure(list(city = structure(c(1L, 2L, 4L, 3L, 2L, 4L), .Label = c("Bogotá", "life_exp", "Medellín", "Unemployment"), class = "factor"), column_20 = c(0.653383622108959, 0.685130500583909, 0.616564040770754, 0.731770524056628, 0.53738643436227, 0.571727990615182)), row.names = c(NA, -6L), class = "data.frame")
data_frame_B %>%
group_by(city_id = cumsum(city %in% data_frame_A$city)) %>%
mutate(city_name = first(city)) %>%
filter(city_name != city) %>%
spread(city, column_20)
#> # A tibble: 2 x 4
#> # Groups: city_id [2]
#> city_id city_name life_exp Unemployment
#> <int> <fct> <dbl> <dbl>
#> 1 1 Bogotá 0.685 0.617
#> 2 2 Medellín 0.537 0.572
Created on 2019-04-22 by the reprex package (v0.2.1)
Setting the random seed in the Note at the end to make the data reproducible we can use the following double left join:
library(sqldf)
sqldf("select a.city, b2.[column_20]
from [data_frame_A] as a
left join [data_frame_B] as b using(city)
left join [data_frame_B] as b2 on b2.rowid = b.rowid + 1")
giving:
city column_20
1 Bogotá 0.7364915
2 Medellín 0.7821402
Note
set.seed(123)
names= c("Bogotá", "Medellín")
data_frame_A= as.data.frame(names, ncol=1)
colnames(data_frame_A)= "city"
data_frame_A$Unemployment = NA
names= c("Bogotá", "life_exp","Unemployment","Medellín","life_exp","Unemployment")
data_frame_B= as.data.frame(names, ncol=1)
colnames(data_frame_B)= "city"
data_frame_B$column_20 = runif(6, 0.5, 0.8)

Resources