How to numbering unique pairs X,Y - r

Ok, so I have the following data.frame:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
df<- data.frame(v1,v2)
I want to obtain an extra variable with a numeric count of pairs of v1 and v2 values. The trick is that I need to number them by unique pairs so, for example (456,981 and 981,456) should be numbered 1.
So the outcome would be something like this:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
v3<-c(1,2,1,3,2,3)
df<- data.frame(v1,v2,v3)

You can sort rowwise and use match, i.e.
v1 <- do.call(paste, data.frame(t(apply(df, 1, sort))))
match(v1, unique(v1))
#[1] 1 2 1 3 2 3

How about this using dplyr. Basically you would sort the columns for each row. Not sure if it would be more efficient or not. Obviously it is a lot more lines.
library(dplyr)
df <- data.frame(v1,v2)
# Sort by v1 and v2 elements by row
df.new <- df %>%
mutate(z1 = pmin(v1,v2),
z2 = pmax(v1,v2))
# Build a distinct coding table
df.codes <- df.new %>%
distinct(z1, z2) %>%
mutate(v3 = 1:n())
# Join it back together
df.new %>%
left_join(df.codes, by = c("z1", "z2")) %>%
select(v1, v2, v3)

Related

R - Regular Expressions (Regex) with a list of Data Frames (only first match)

So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:
To whom the job was given.
The standard code that describes what kind of job it is (Ex. "00" inside this "12-00.07").
The date on which it was assigned.
Each data frame contains data about just one worker.
But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.
So, I can find the data with a regular expression that targets “Worker:”
I can also target the first regular expression that represents a date: “dd/dd/dd”
The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one.
In order to achieve this end, I find myself with three problems:
a) The information is presented in no order (cannot subset specific
rows).
b) The intended worker and code are a substring inside other
characters.
c) More then one date is presented on each df and I only desire the
first match. All other dates are misleading.
The input is this:
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
The desired outcome is this:
library(dplyr)
n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")
n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")
df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)
PDF_list <- list(df1, df2)
one_df <- bind_rows(PDF_list)
So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:
library(tidyverse)
library(tidyr)
library(stringr)
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
for(num in 1:length(PDF_list)){
worker <- filter(PDF_list[[num]], grepl("Worker:\\s*?(\\w.+)", text))
code <- filter(PDF_list[[num]], grepl("-(\\d{2}).+", text))
date <- filter(PDF_list[[num]], grepl("^\\d{2}/\\d{2}.+", text))
if(nrow(date) > 1){
date <- date[1,1]
}
t_list <- cbind(worker, code, date)
names(t_list) <- c("name", "code", "date")
PDF_list[[num]] <- t_list
}
rm(worker, code, date, t_list)
one_df <- bind_rows(PDF_list)
View(one_df)
Any help? Thanks!
A method using tidyverse
Loop over the list - map, arrange the rows of the data so that row with the 'Worker:' becomes the top row
Bind the list elements as a single dataset with _dfr suffix in map, while creating a grouping index by specifying the .id
Group by 'grp' column
Use summarise to create summarised output with the first 'date' from the pattern two digits followed by /, two digits / and two digits from the start (^) till the end ($) of the string elements in 'text' column
The first element will become 'name' after removing the substring 'Worker:' and any spaces - str_remove
Similarly, we extract the 'code' rows based on capturing the digits from those having only digits with some characters - or .
library(dplyr)
library(stringr)
library(purrr)
PDF_list %>%
map_dfr(~ .x %>%
arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>%
group_by(grp) %>%
summarise(date = first(text[str_detect(text, "^\\d{2}/\\d{2}/\\d{2}$")]),
name = str_remove(first(text), "Worker:\\s*"),
code = str_replace(text[str_detect(text, '^\\d+-(\\d+)[.-]\\d+$')],
"^\\d+-(\\d+)[.-]\\d+$", "\\1"), .groups = 'drop') %>%
select(name, code, date)
-output
# A tibble: 4 x 3
name code date
<chr> <chr> <chr>
1 Joseph 00 06/01/21
2 Joseph 11 06/01/21
3 3rd 08 05/01/21
4 3rd 57 05/01/21

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

Data cleaning in R: grouping by number and then by name

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.
Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result
library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)
According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

Concatenating two text columns in dplyr

My data looks like this:
round <- c(rep("A", 3), rep("B", 3))
experiment <- rep(c("V1", "V2", "V3"), 2)
results <- rnorm(mean = 10, n = 6)
df <- data.frame(round, experiment, results)
> df
round experiment results
1 A V1 9.782025
2 A V2 8.973996
3 A V3 9.271109
4 B V1 9.374961
5 B V2 8.313307
6 B V3 10.837787
I have a different dataset that will be merged with this one where each combo of round and experiment is a unique row value, ie, "A_V1". So what I really want is a variable name that concatenates the two columns together. However, this is tougher to do in dplyr than I expected. I tried:
name_mix <- paste0(df$round, "_", df$experiment)
new_df <- df %>%
mutate(name = name_mix) %>%
select(name, results)
But I got the error, Column name must be length 1 (the group size), not 6. I also tried the simple base-R approach of cbind(df, name_mix) but received a similar error telling me that df and name_mix were of different sizes. What am I doing wrong?
You can use the unite function from tidyr
require(tidyverse)
df %>%
unite(round_experiment, c("round", "experiment"))
round_experiment results
1 A_V1 8.797624
2 A_V2 9.721078
3 A_V3 10.519000
4 B_V1 9.714066
5 B_V2 9.952211
6 B_V3 9.642900
This should do the trick if you are looking for a new variable
library(tidyverse)
round <- c(rep("A", 3), rep("B", 3))
experiment <- rep(c("V1", "V2", "V3"), 2)
results <- rnorm(mean = 10, n = 6)
df <- data.frame(round, experiment, results)
df
df <- df %>% mutate(
name = paste(round, experiment, sep = "_")
)
You could also try this:
library(tidyr)
library(dplyr)
df = df %>%
unite(combined, round, experiment, sep = "_", remove = FALSE)
The output will be:
combined round experiment results
A_V1 A V1 10.152329
A_V2 A V2 10.863128
A_V3 A V3 10.975773
B_V1 B V1 9.964696
B_V2 B V2 9.876675
B_V3 B V3 9.252936
This will retain your original columns.
Another solution could be to use the stri_join function in stringi package.
library(stringi)
df$new = stri_join(df$round,df$experiment,sep="_")

Making new column with multiple elements after group_by

I'm trying to make a new column as described below. the d's actually correspond to dates and V2 are events on the given dates. I need to collect the events for the given date. V3 is a single column whose row entries are a concatenation. Thanks in advance. My attempt does not work.
df = V1 V2
d1 U
d2 M
d1 T
d1 Q
d2 P
desired resulting df
df.1 = V1 V3
d1 U,T,Q
d2 M,P
df.1 <- df %>% group_by(., V1) %>%
mutate(., V3 = c(distinct(., V2))) %>%
as.data.frame
The above code results in the following error; ignore the 15 and 1s--they're specific to my actual code
Error: incompatible size (15), expecting 1 (the group size) or 1
You can use aggregate like this:
df.1 <- aggregate(V2~V1,paste,collapse=",",data=df)
# V1 V2
# 1 d1 U,T,Q
# 2 d2 M,P
It will not allow a vector as an element in data frame. So instead of using c(), you can use paste to concatenate elements as a single string.
df.1 <- df %>% group_by(V1) %>% mutate(V3 = paste(unique(V2), collapse = ",")) %>% select(V1, V3) %>% unique() %>% as.data.frame()
still with dplyr, you can try:
df %>% group_by(V1) %>% summarize(V3 = paste(unique(V2), collapse=", "))

Resources