Concatenate data according to a partial numeric match - r

i have two data frames.
one is structured like this:
code. name.
1111 A B
1122 C D
2122 C D
2133 G H
the other is:
code_2. name.
11 F
21 G
i want to obtain a third df that, in relation to code match, concatenate my data present in the first data frame, using a "OR" separator. The code value that I want to mantain is the the one of the second df. It is important that the match among code values would be made on the first and second number of the code belonging to the first dataframe.
code. name.
11 A B OR C D
21 C D OR G H
thank you for your suggestions!

You can use aggregate, i.e.
aggregate(name. ~ substr(code., 1, 2), df, paste, collapse = ' OR ')
# substr(code., 1, 2) name.
#1 11 A B OR C D
#2 21 C D OR G H
You can take care of the column names as usual.

If you prefer tidyverse, you can try something like:
df %>%
group_by(code. = str_extract(as.character(code.), "^.{2}")) %>%
summarise(name. = paste(name., collapse = " OR "))
code. name.
<chr> <chr>
1 11 A B OR C D
2 21 C D OR G H
It groups by the first two elements from "code." and then combines the "name." column based on those elements.
Or the same using sub():
df %>%
group_by(code. = sub("^(.{2}).*", "\\1", as.character(code.))) %>%
summarise(name. = paste(name., collapse = " OR "))
Or the same using substring():
df %>%
group_by(code. = substring(as.character(code.), 1, 2)) %>%
summarise(name. = paste(name., collapse = " OR "))

Related

Select columns from dataframe start with number

I have a data frame with column names with name start with numbers and names with string and I want to subset with names starting with numbers followed by dots.
this code is working for this sample but in my actual data frame the column AA ID get selected. I don't know the reason
df <- data.frame(`AA ID`=c(1,2,3,4,5,6,7,8,9,10),
"BB"=c("AMK","KAMl","HAJ","NHS","KUL","GAF","BGA","NHU","VGY","NHU"),
"CC"=c("TAMAN","GHUSI","KELVIN","DEREK","LOKU","MNDHUL","JASMIN","BINNY","BURTAM","DAVID"),
"DD"=c(62,41,37,41,32,74,52,75,59,36),
"EE"=c("CA","NY","GA","DE","MN","LA","GA","VA","TM","BA"),
"FF"=c("ENGLISH","FRENCH","ENGLISH","FRENCH","ENGLISH","ENGLISH","SPANISH","ENGLISH","SPANISH","RUSSIAN"),
"GG"=c(33,44,51,51,37,58,24,67,41,75),
`1A`=c("","D","","NA","","D","","","D",""),
`2B`=c("","A","","","A","A","A","A","",""),
`3C`=c("","","","","","","","","",""),
`4D`=c("","G","G","G","G","G","G","G","",""),
"Concatenate" = c("","DAG","G","NAG","AG","DAG","AG","AG","D",""))
df <- df %>% rename(`1. A`="X1A",`1. B`="X2B",`1. C`="X3C",`1. D`="X4D")
Error_summary <- select(df,matches("^[0-9]*\\."))
also I am trying to add count in data frames like below
df_row =
df %>%
summarize(across(c(matches("^[0-9]*\\."), Concatenate), ~ sum(!is.na(.) & . != "" & . != "NA")))
but this is also selecting column AA ID which i dont want to select.
Taking into account that your variables supposed to starting with numbers will be converted to variable names starting with X, you could do:
library(tidyverse)
df %>%
select(matches("^X[0-9]"))
which gives:
X1..A X2..B X3..C X4..D
1
2 D A G
3 G
4 NA G
5 A G
6 D A G
7 A G
8 A G
9 D
10
With the same logic you can do your counts:
df %>%
summarize(across(c(matches("^X[0-9]"), Concatenate), ~ sum(!is.na(.) & . != "" & . != "NA")))
which gives
X1..A X2..B X3..C X4..D Concatenate
1 3 5 0 7 8
Although I'm not sure if you want to exclude the "NAG" value in the Concatenate column.

Using mutate with a stored list of formulas over specified columns

This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
match.fun(..3)(!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <- do.call(Map, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))

How to loop through the columns in data frame

I have a following vector h=c("a","b","c","d","e")
I would like to create the dataset that looks like that using lag() function:
pr <- data.frame(your_github = h,
review_this1 = lag(h),
review_this2 = lag(h,2))
However, when I use lag the following happens:
col2=c(NA,"a","b","c","d") and col3=(NA,NA,"a","b","c")
but I need to get outcome similar to data.frame(col1=c("a","b","c","d","e"),col2=c("b","c","d","e","a"), col3=("c","d","e","a","b")) where values in col2 and col3 are looped (i.e the 2nd column is just teh 1st one that is lagged by 1, but the 1st item in 2nd is teh last item in st column).
Something like this?
library(dplyr)
h = c("a","b","c","d","e")
pr <- data.frame(your_github = h,
review_this1 = ifelse(is.na(lead(h)), h[1], lead(h)),
review_this2 = ifelse(is.na(lead(h, 2)), h[2:1], lead(h, 2)))
pr
# your_github review_this1 review_this2
#1 a b c
#2 b c d
#3 c d e
#4 d e a
#5 e a b
With base R you can achieve this with head and tail (test on tio here):
h<-letters[1:5]
pr <- data.frame(your_github = h,
review_this1 = c(tail(h, -1), head(h, -1)),
review_this2 = c(tail(h, -2), head(h, -2)))
print(pr)
Output:
your_github review_this1 review_this2
1 a b c
2 b c d
3 c d e
4 d e a
5 e a b
The idea is to take the start of the vector h with tail and concatenate it with the end of the vector taken by head minus what we got from tail so we have the same length at end for each column (vector) of the dataframe.
If you want to cycle the vector with last value becoming the first, just reverse the signs in tail and head.

R, dplyr: Collect unique values for a column, mutate a label based on set intersection

I am working with a large data set, but let us take a toy example to demonstrate what I am trying to achieve. I am using R and dplyr.
I have a table:
id attribute correct
1 a a
1 b a
1 c a
2 d e
2 e e
3 d f
From the above, I want to create two columns, attribute_set and label. To clarify, I want:
id attribute_set correct label
1 a, b, c a 1
2 d, e e 1
3 d f 0
attribute_set should be a collection (any data structure) that has all of the attributes for an id. label should be 1 if the correct value is in attribute_set and 0 otherwise.
Presently, I create attribute_set like so:
design_mat1 <- design_mat %>%
group_by(id) %>%
mutate(attribute_set = paste(unique(attribute), collapse = "|")) %>%
select(-attribute)
I generate label like so:
design_mat2b <- design_mat2 %>%
group_by(id) %>%
mutate(label = ifelse(correct %in% attribute_set, 1, 0))
However, my label works only when there is one element in attribute_set. I think I have to strsplit on | or make attribute_set use some other data structure. I have been unable to figure out what alternative data structure to use nor have I been able to get a strsplit on | solution to work. Any hints/solutions are appreciated.
After grouping by 'id', we can use summarise to paste the unique elements of 'attribute', while selecting the first or unique values of 'correct' and 'label' if there is any 'correct' elements in 'attribute'
library(dplyr)
design_mat %>%
group_by(id) %>%
summarise(attribute_set = toString(unique(attribute)),
correct = first(correct),
label = +(any(correct %in% attribute)))
# A tibble: 3 x 4
# id attribute_set correct label
# <int> <chr> <chr> <int>
#1 1 a, b, c a 1
#2 2 d, e e 1
#3 3 d f 0
Or use the 'correct' also in group_by and then summarise on 'attribute_set' and 'label'

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

Resources