Sum characters from a string using a lookup table in R - r

I have a lookup dataframe (df1) like this:
col1 col2
A 71
R 156
N 114
D 115
...
and I have a data frame (df2) containing a column of strings like this:
[1] "AARA"
[2] "DDNRRRNRAAN"
[3] "RNDARANDRN"
...
I would like to create a new column in df2 that, for each string, looks up the series of corresponding numbers from df1 and sums them. So, the first row in the new column of df2 would have the value 369 (= 71 + 71 + 156 + 71). How could I go about this task?

One more tidyverse strategy
lookup <- structure(list(col1 = c("A", "R", "N", "D"), col2 = c(71L, 156L,
114L, 115L)), class = "data.frame", row.names = c(NA, -4L))
df <- structure(list(col = c("AARA", "DDNRRRNRAAN", "RNDARANDRN")),
class = "data.frame", row.names = c(NA, -3L))
library(tidyverse)
df %>%
mutate(SUM = map_dbl(str_split(col, ''), ~ sum(lookup$col2[match(.x, lookup$col1)])))
#> col SUM
#> 1 AARA 369
#> 2 DDNRRRNRAAN 1338
#> 3 RNDARANDRN 1182
Created on 2021-06-13 by the reprex package (v2.0.0)

Split the string at every character, use match to get corresponding value for each character and sum them.
df2$res <- sapply(strsplit(df2$col, ''), function(x)
sum(df1$col2[match(x, df1$col1)], na.rm = TRUE))
df2
# col res
#1 AARA 369
#2 DDNRRRNRAAN 1338
#3 RNDARANDRN 1182
Using the same logic a tidyverse option would be -
library(dplyr)
library(tidyr)
df2 %>%
mutate(row = row_number()) %>%
separate_rows(col, sep = '') %>%
left_join(df1, by = c('col' = 'col1')) %>%
group_by(row) %>%
summarise(col = paste0(col, collapse = ''),
col2 = sum(col2, na.rm = TRUE)) %>%
select(-row)
data
df1 <- structure(list(col1 = c("A", "R", "N", "D"), col2 = c(71L, 156L,
114L, 115L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(col = c("AARA", "DDNRRRNRAAN", "RNDARANDRN")),
class = "data.frame", row.names = c(NA, -3L))

Related

Changing Column Names Based on a Different Dataframe

I have a data dictionary called data_dict of this format with hundreds of rows:
Field Name
Field Label
marital_status
What is your marital status?
birthplace
What country were you born?
I have another dataframe called df in this format with hundreds of rows:
record_id
marital_status
birthplace
1
3
66
2
6
12
I am currently using df %>% map(~ table(.x, useNA = "ifany")) to summarize the results for all the columns in df. I want the Field Label column values from data_dict to appear instead of the column names from df. How could that be done without changing the column names in df?
We may use rename
library(dplyr)
library(tibble)
df %>%
summarise(across(all_of(data_dict$"Field Name"),
~ list(table(.x, useNA = "ifany")))) %>%
rename(!!! deframe(data_dict[2:1]))
Or using map
library(purrr)
df %>%
rename(!!! deframe(data_dict[2:1])) %>%
map(~ table(.x, useNA = "ifany"))
-output
$record_id
.x
1 2
1 1
$`What is your marital status?`
.x
3 6
1 1
$`What country were you born?`
.x
12 66
1 1
data
df <- structure(list(record_id = 1:2, marital_status = c(3L, 6L), birthplace = c(66L,
12L)), class = "data.frame", row.names = c(NA, -2L))
data_dict <- structure(list(`Field Name` = c("marital_status", "birthplace"
), `Field Label` = c("What is your marital status?", "What country were you born?"
)), class = "data.frame", row.names = c(NA, -2L))

How to find common strings across several files

I have data like this:
df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
and
df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I want to know how many strings are common across all these 3 data frames.
If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.
example (if only identical strings count for duplicates):
library(dplyr)
df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))
bind_rows(df1, df2, df3, .id = "origin") %>%
group_by(origin) %>%
distinct(test) %>% ## remove within-dataframe duplicates
group_by(test) %>%
summarise(replicates = n()) %>%
filter(replicates > 1)
Here is an update in case only identical strings are wished:
library(dplyr)
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
filter(duplicated(test) | duplicated(test, fromLast=TRUE))
id test
1 df1 STOMQ
2 df1 TMEIL1
3 df1 YIPF1
4 df2 STOMQ
5 df2 TMEIL1
6 df2 YIPF1
7 df3 STOMQ
8 df3 YIPF1
First answer:
Here is a suggestion:
First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:
library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
arrange(test)
To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index
Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R
in the group column you could find the similar strings:
You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:
library(tidyverse)
library(stringdist)
map_dfr(x$test, ~ {
i <- which(stringdist(., x$test, "jw") < 0.40)
tibble(index = i, title = x$test[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group)) +
bind_cols(df_id = x$id)
group index title df_id
<int> <int> <chr> <chr>
1 1 1 BRsts df3
2 2 2 GFD df3
3 3 3 RSEST df3
4 3 31 TRS df2
5 3 32 TRSF df3
6 4 4 SNTLK df1
7 4 5 SNTLKM df2
8 4 6 SNTM1 df1
9 4 8 STOLA df1
10 4 12 STR2 df2
# ... with 27 more rows

Split Columns in a List of Dataframes R

I have a list of data frames which some columns have this special character ->(arrow). Now i do want to loop through this list of data frames and locate columns with this -> (arrow) then the new columns be named with a suffix _old and _new. This is a sample of data frames :
dput(df1)
df1 <- structure(list(v1 = c("reg->joy", "ress", "mer->dls"),
t2 = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(v1 = c("me", "df", "kl"),
t2 = c("James","Jane->dlt", "Egg"),
t3 = c("James ->may","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1 = c("56->34", "df23-> ", "mkl"),
t2 = c("James","Jane", "Egg"),
d3 = c("James->","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
This is what I have tried
dfs <- list(df1,df2,df3)
for (y in 1:length(dfs)){
setDT(dfs[[y]])
df1<- lapply(names(dfs[[y]]), function(x) {
mDT <- df2[[y]][, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_old", "_new")))
}) %>% as.data.table()
}
This only splits one data frame, I need to split all of the data frames.
NOTE: The code I have splits so well on one dataframe, what I want is how to implement it on a List of data frames
EXPECTED OUTPUT
dput(df1)
df1 <- structure(list(v1_old = c("reg", "mer"),
v1_new = c("joy", "dls")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(t2_old = c("dlt"),
t2_new = c("dlt"),
t3_old = c("James"),
t3_new = c("may")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1_old = c("56", "df23 "),
v1_new = c("34", " "),
d3 = c("James"),
d3 = c(" ")),
class = "data.frame", row.names = c(NA, -3L))
I add below a solution using the tidyverse.
Select the columns if one of the strings in the columns contains an arrow:
col_arrow_ls <- purrr::map(dfs, ~select_if(., ~any(str_detect(., "->"))))
Then split the function using tidyr::separate. Since each element of the output is a data frame, purrr::map_dfc is used to column-bind them together:
split_df_fn <- function(df1){
names(df1) %>%
map_dfc(~ df1 %>%
select(.x) %>%
tidyr::separate(.x,
into = paste0(.x, c("_old", "_new")),
sep = "->")
)
}
Apply the function to the list of data frames.
purrr::map(col_arrow_ls, split_df_fn)
[[1]]
v1_old v1_new
1 reg joy
2 ress <NA>
3 mer dls
[[2]]
t2_old t2_new t3_old t3_new
1 James <NA> James may
2 Jane dlt Jane <NA>
3 Egg <NA> Egg <NA>
[[3]]
v1_old v1_new d3_old d3_new
1 56 34 James
2 df23 Jane <NA>
3 mkl <NA> Egg <NA>

discard last or first group after group_by by referencing group directly

Data:
df <- data.frame(A=c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),
B=rnorm(9),
stringsAsFactors=F)
I don't know if there's a way to do this, but what I'd like to know is if there's way to discard the last group by directly referencing the groups after group_by(A) to get the desired output:
A B
1 a -0.4900863
2 a 1.4106594
3 a -0.2245738
4 b -0.2124955
5 b 0.6963785
6 b 0.9151825
I AM INTERESTED IN SOLUTIONS THAT DIRECTLY WORK AT THE GROUPS LEVEL
For instance, something like:
df %>% group_by(A) %>% head(.Groups,-1)
or
df %>% group_by(A) %>% Groups[1:2]
I AM NOT INTERESTED IN THE FOLLOWING KINDS OF SOLUTIONS
df %>% filter(!(A == max(A)))
df %>% filter(!(A %in% max(A)))
OR OTHER SOLUTIONS THAT DO NOT REQUIRE group_by TO WORK
I was assuming you were not supposed to be assuming that we knew in advance what the number of groups might be. Try using the labels attribute:
all_but_last <- df %>% group_by(A) %>% attr("labels") %>% head(-1)
A
1 a
2 b
... to extract desired rows
> df %>% filter(A %in% all_but_last[[1]])
A B
1 a -0.799026840
2 a -0.712402478
3 a 0.685320094
4 b 0.971492883
5 b -0.001479117
6 b -0.817766296
Helps to use dput to look at the actual contents of a "grouped_df":
dput( df %>% group_by(A) )
structure(list(A = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), B = c(-0.799026840397576, -0.712402478350695, 0.685320094252465,
0.971492883452258, -0.00147911717469651, -0.817766295631676,
-1.00112471676908, 1.88145909873596, -0.305560178617216)), .Names = c("A",
"B"), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "A", drop = TRUE, indices = list(
0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L,
labels = structure(list(
A = c("a", "b", "c")),
row.names = c(NA, -3L),
class = "data.frame",
vars = "A", drop = TRUE, .Names = "A"))
Note that the labels are a data.frame so you could have further applied unlist to the result that became all_but_last and you then would not have needed to extract its value with "[[".
Perhaps this helps
library(dplyr)
df %>%
group_by(A) %>%
group_indices(.) %in% 1:2 %>%
df[.,]
Or with data.table
library(data.table)
setDT(df)[, grp := .GRP, A][grp %in% unique(grp)[1:2]][, grp := NULL][]

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1

Resources