dplyr select column based on string match - r

I am wanting to order my columns of a data frame by string matches.
library(dplyr)
data <- data.frame(start_a = 1,
start_f = 3,
end_a = 5,
end_f = 7,
middle_a= 9,
middle_f = 11)
For example I want to select start_f, start_a, middle_f, middle_a, end_f ,end_a
I am attempting to do so with data %>% select(matches("(start|middle|end)_(f|a)"))), so that the order I have typed within the matches is the order that I want the columns to be selected.
Desired output would be data[c(2,1,6,5,4,3)]

You can construct the columns in the order that you want with outer.
order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
cols <- c(t(outer(order1, order2, paste, sep = '_')))
cols
#[1] "start_f" "start_a" "middle_f" "middle_a" "end_f" "end_a"
data[cols]
# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5
If not all combinations of order1 and order2 are present in the data we can use any_of which will select only the columns present in data without giving any error.
library(dplyr)
data %>% select(any_of(cols))
To select based on pattern in names.
order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
pattern <- c(t(outer(order1, order2, function(x, y) sprintf('^%s_%s.*', x, y))))
pattern
#[1] "^start_f.*" "^start_a.*" "^middle_f.*" "^middle_a.*" "^end_f.*" "^end_a.*"
cols <- names(data)
data[sapply(pattern, function(x) grep(x, cols))]
# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5

Related

Replacing parts of value (starts_with) in dataframe with values from a different dataframe in R

I want to replace a part of the value in df1 with value from df2. If df1$col1 starts with the numbers in df2$col1, replace those four numbers (keep rest) with df2$col2. Same for df1$col2. Example: For 16122567 replace with 5059 resulting in 50592567. Have tried different kinds of starts_with, loops, for(i in ..), mutate etc.. Anyone? (I'm new to R).
df1 col1 col2
1 16122567 89992567
2 17236945 16126548
3 95781657 19995670
4 16126972 56972541
df2 col1 col2
1 1612 5059
2 1723 5044
3 8999 5094
4 1999 9053
Here is one way with dplyr. We can create a new column with first 4 characters of col1, left_join with df2, replace NA's with four characters of col2.x. Finally, we use substr to replace values at specific position.
library(dplyr)
df3 <- df1 %>%
mutate(col1 = substr(col1, 1, 4)) %>%
left_join(df2 %>% mutate(col1 = as.character(col1)), by = 'col1') %>%
mutate(col2.y = ifelse(is.na(col2.y), substr(col2.x, 1, 4), col2.y),
col2.x = as.character(col2.x))
substr(df3$col2.x, 1, 4) <- df3$col2.y
df3
# col1 col2.x col2.y
#1 1612 50592567 5059
#2 1723 50446548 5044
#3 9578 19995670 1999
#4 1612 50592541 5059
Here is another approach using base R. We can create a function to do the check and manipulate the text and then apply that function to any column we want to modify.
# the data
df1 <- data.frame(col1 = c(16122567, 17236945, 95781657, 16126972),
col2 = c(89992567, 16126548, 19995670, 56972541))
df2 <- data.frame(col1 = c(1612, 1723, 8999, 1999),
col2 = c(5059, 5044, 5094, 9053))
# a function to do the check and create the chimera strings
check_and_paste <- function(check1, check2, replacement) {
res <- c()
for (i in seq_along(check1)) {
four_digits <- substr(check1[i], 1, 4)
if (four_digits %in% check2) {
res[i] <- paste(replacement[which(four_digits == check2)],
substr(check1[i], 5, 8),
sep = "")
} else {
res[i] <- check1[i]
}
}
return(as.numeric(as.character(res))) # to return numbers
}
# apply to the first column
new_col1 <- check_and_paste(
check1 = df1$col1,
check2 = df2$col1,
replacement = df2$col2
)
# and the second
new_col2 <- check_and_paste(
check1 = df1$col2,
check2 = df2$col1,
replacement = df2$col2
)
# the new data frame
data.frame(new_col1, new_col2)
# new_col1 new_col2
#1 50592567 50942567
#2 50446945 50596548
#3 95781657 90535670
#4 50596972 56972541

Pattern Searching in R

I have two data frames as below. DF1 is slighly messy (as you can see below) has multiple values from DF2 combined into one column.
DF1
SRNo. Value
1 1ABCD2EFGH3IJKL
2 1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV
3 3ABCD4EFGH5IJKL
4 3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL
5 7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH
DF2
SRNo. Value
1 1ABCD2EFGH3IJKL
2 3ABCD4EFGH5IJKL
3 6PQRS7TUVW8XYZA
4 5FGHI9XUZX1RATP
5 9AGTY6UGFW0AAUU
6 6TEYD7RARA8MHAT
7 9IUYT7HGFD3LKJH
I want to do a look up using values column in both the data set. Here is what I am trying to accomplish.
i) For rows 1 & 3 in DF1 it is a simple look up in DF2. I expect the code to return those looked up values.
ii) For row #3 in DF1, only first part of the string matches with a value in DF2. I expect the code to return only the first part.
iii) For row#4 in DF1, both the parts in the string matches with values in DF2. In this case I want the first part of the string that is matching to be retained
iv) For Row #5, the second part in the string matches with the value in DF2. I would expect the code to return the 2nd part of the string.
I have around 47000 rows in first dataset and over 300,000 in second dataset and ofcourse there are other columns in both the datasets. I have tried this in multiple ways using str_split/str_match but could not accomplish what I want to. Every suggestion is appreciated. My rest of the coding is in R.
Thank You
First step is to tidyr::separate() your DF1 at "/". Then I used dplyr::case_when() to see if there was a match between the first of the listed items in DF2 with %in%; if there wasn't then check against the second. I used dplyr::mutate() to append the results to DF1 under dat.
library(dplyr)
library(tidyr)
DF1 <- data.frame("SRNo." = 1:5, Value = c("1ABCD2EFGH3IJKL","1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV","3ABCD4EFGH5IJKL","3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL","7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH"), stringsAsFactors = F) %>% tbl_df()
DF2 <- data.frame("SRNo." = 1:7, Value = c("1ABCD2EFGH3IJKL","3ABCD4EFGH5IJKL","6PQRS7TUVW8XYZA","5FGHI9XUZX1RATP","9AGTY6UGFW0AAUU","6TEYD7RARA8MHAT","9IUYT7HGFD3LKJH"), stringsAsFactors = F) %>%tbl_df()
DF1 %>%
separate(Value, c("Value1", "Value2"), sep = "/") %>%
mutate(dat = case_when(
Value1 %in% DF2$Value ~ Value1,
Value2 %in% DF2$Value ~ Value2,
TRUE ~ NA_character_
))
# # A tibble: 5 x 4
# SRNo. Value1 Value2 dat
# <int> <chr> <chr> <chr>
# 1 1 1ABCD2EFGH3IJKL NA 1ABCD2EFGH3IJKL
# 2 2 1ABCD2EFGH3IJKL 7MLPO0OKMN8MNBV 1ABCD2EFGH3IJKL
# 3 3 3ABCD4EFGH5IJKL NA 3ABCD4EFGH5IJKL
# 4 4 3ABCD4EFGH5IJKL 1ABCD2EFGH3IJKL 3ABCD4EFGH5IJKL
# 5 5 7MLPO0OKMN8MNBV 9IUYT7HGFD3LKJH 9IUYT7HGFD3LKJH
Data.table solution
df1 <- read.table(text="SRNo. Value
1 1ABCD2EFGH3IJKL
2 1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV
3 3ABCD4EFGH5IJKL
4 3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL
5 7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH", header = T, stringsAsFactors = F)
df2 <- read.table( text = "SRNo. Value
1 1ABCD2EFGH3IJKL
2 3ABCD4EFGH5IJKL
3 6PQRS7TUVW8XYZA
4 5FGHI9XUZX1RATP
5 9AGTY6UGFW0AAUU
6 6TEYD7RARA8MHAT
7 9IUYT7HGFD3LKJH", header = T, stringsAsFactors = F )
library( data.table )
setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)]
setDT(df2)
resultv1 <- df2[ df1, on = c( Value = "Value1"), nomatch = 0L ]
resultv2 <- df2[ df1, on = c( Value = "Value2"), nomatch = 0L ]
result <- rbindlist( list( resultv1, resultv2 ) )[!duplicated( i.SRNo.)]
Benchmarking it against the solution from #Paul shows similar runtimes (~2.5 miliseconds).. But data.table sometimes surprises me on larger data-sets..
If memory becomes an issue, you can do it all in one go:
rbindlist( list( setDT(df2)[ setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)],
on = c( Value = "Value1"), nomatch = 0L ],
setDT(df2)[ setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)],
on = c( Value = "Value2"), nomatch = 0L ] ) )[!duplicated( i.SRNo.)]

Select data frame values row-wise using a variable of column names

Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")
My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3
One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3

Remove NAs in row, and move the cell on the right were the NA was located in R also unique values

OK, so I have a data frame in R like this
ID <- c(1, 2, 3)
c1 <- c( 1, 1, NA)
c2 <- c(NA, NA, 5)
c3 <- c(NA, NA, NA)
c4 <- c(2, NA, 5)
c5 <- c(5, 7, 3)
df <- data.frame(ID, c1, c2, c3, c4, c5)
So, this is what I'm looking for
1. Treat every row as a vector
2. Be able to remove all NAs in every row/vector
3. In a given row there can't be repeated values (expect for ID vs a number in other cell)
4. I'm looking to "cut" this row/vector. I don't need 5 values just 2.
I'm doing this for a MAP#k metric, so the order of the numbers (the one on the left is more importante than the next one) son it's important to keep the order.
This is the output that I'm looking for
ID <- c(1, 2, 3)
c1 <- c(1, 1, 5)
c2 <- c(2, 7, 3)
df2 <- data.frame(ID, c1, c2)
Thank you for your help
We loop through the rows of 'df' (using apply with MARGIN as 1), remove the NA elements (!is.na(x)) and get the unique values. Then, if the length of the elements are not the same, the output will be a list ('lst'). We use lengths to get the length of each list element, get theminof it, based on it we subset thelistelements andcbind` with the first column 'ID'.
lst <- apply(df[-1], 1, function(x) unique(x[!is.na(x)]))
dfN <- cbind(df[1], do.call(rbind,lapply(lst, function(x) x[seq(min(lengths(lst)))])))
colnames(dfN)[-1] <- paste0("c", colnames(dfN)[-1])
dfN
# ID c1 c2
#1 1 1 2
#2 2 1 7
#3 3 5 3
NOTE: If the length of unique elements are the same in each row (after removing the NA), the output will be a matrix. Just transpose the output and cbind with the first column.
Or another option is data.table which should be very efficient.
library(data.table)
dM <- melt(setDT(df), id.var="ID", na.rm=TRUE)[,
.(value = unique(value), n = seq(uniqueN(value))), ID]
dcast(dM[dM[, n1 := min(tabulate(ID))][, .I[1:.N <=n1] , ID]$V1],
ID~paste0("c", n), value.var="value")
# ID c1 c2
#1: 1 1 2
#2: 2 1 7
#3: 3 5 3
Ugly but should be efficient (chewed through 3M records in about 20secs and 300K in < 2 secs):
sel <- !is.na(df[-1])
tmp <- unique(data.frame(ID=df$ID[row(df[-1])[sel]], c=df[-1][sel]))
tmp$time <- ave(tmp$ID, tmp$ID, FUN=seq_along)
reshape(tmp[tmp$time <= 2,], idvar="ID", direction="wide", sep="")
# ID c1 c2
#1 1 1 2
#2 2 1 7
#3 3 5 3
Based on akrun data.table idea, I translated the data.table code to dplyr/tidyr (is easier for me to read, that's all). Here is the code
library(dplyr)
library(tidyr)
df_tidy <- df %>%
gather(importance, val, c1:c5) %>%
na.omit %>%
arrange(ID, importance) %>%
group_by(ID) %>%
distinct(ID, val) %>%
mutate(place = seq_len(n())) %>%
filter(place <= 2) %>%
mutate(place = paste("c", place, sep="")) %>%
select(-importance) %>%
spread(place, val)
Thank you akrun and thelatemail !

R - co-locate columns with the same name after merge

Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1

Resources