R grep search patterns in multiple columns - r

I have a data frame like as follows:
Col1 Col2 Col3
A B C
D E F
G H I
I am trying to keep lines matching 'B' in 'Col2' OR F in 'Col3', in order to get:
Col1 Col2 Col3
A B C
D E F
I tried:
data[(grep("B",data$Col2) || grep("F",data$Col3)), ]
but it returns the entire data frame.
NOTE: it works when calling the 2 grep one at a time.

Or using a single grepl after pasteing the columns
df1[with(df1, grepl("B|F", paste(Col2, Col3))),]
# Col1 Col2 Col3
#1 A B C
#2 D E F

with(df1, df1[ Col2 == 'B' | Col3 == 'F',])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Using grepl
with(df1, df1[ grepl( 'B', Col2) | grepl( 'F', Col3), ])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Data:
df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
), row.names = c(NA, -3L), class = "data.frame")

The data.table package makes this type of operation trivial due to its compact and readable syntax. Here is how you would perform the above using data.table:
> df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
+ ), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
+ ), row.names = c(NA, -3L), class = "data.frame")
> library(data.table)
> DT <- data.table(df1)
> DT
Col1 Col2 Col3
1: A B C
2: D E F
3: G H I
> DT[Col2 == 'B' | Col3 == 'F']
Col1 Col2 Col3
1: A B C
2: D E F
>
data.table performs its matching operations with with=TRUE by default. Note that the matching is much faster if you set keys on the data but that is for another topic.

Related

R: Replace column depending on match of two other columns

Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))

R two tables merge and create new column for repeat values

Goodday everyone, I am trying to merge two dataframes and create new dataframe that contains the unique columns, and create new columns for repeat values.
For example, two dataframes are:
df1
col1 col2
A B
C D
df2
col1 col2 col3
A B E
A B F
C D G
C D H
C D I
Target output is
col1 col2 col3 col4 col5
A B E F
C D G H I
Hope you can help me. Thanks!
So I'm not sure weather the final format you are after is something that is helpful. However the first step is a simple left or full join
df1 <- data.frame(col1 = c("A", "C"),
col2 = c("B", "D"), stringsAsFactors = F)
df2 <- data.frame(col1 = c("A", "A", "C", "C", "C"),
col2 = c("B", "B", "D", "D", "D"),
col3 = c("E", "F", "G", "H", "I"), stringsAsFactors = F)
library(tidyverse)
res <- left_join(df1, df2, by = c("col1", "col2"))
res
col1 col2 col3
1 A B E
2 A B F
3 C D G
4 C D H
5 C D I
to get a result in the desired form is a bit trickier.
First we do the same left join as above, we then unite the two columns (col1 & col2) together so that we can group and spread by those columns easily.
Grouping by the united column (fuse) we want a number associated with each col3 value within the group, we paste "col" as a prefix so that when spreading it appears as a column name.
We then spread by the counter column n and fill it with the values of col3.
Finally, we reverse the unite we did earlier.
left_join(df1, df2, by = c("col1", "col2")) %>%
unite(fuse, col1, col2) %>%
group_by(fuse) %>%
mutate(n = paste0("col", 2 + 1:n())) %>%
spread(n, col3) %>%
separate(fuse, c("col1", "col2"))
# A tibble: 2 x 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
1 A B E F NA
2 C D G H I

Loop over list of data frames using data.table

I have 3 data frames that I'd like to run the same data.table function on. I could do this manually for each data.frame but I'd like to learn how to do it more efficiently.
Using the data.table package, I want to replace the contents of col1 with the contents of col2 only if col1 contains "a". And I want to run this code over three different dataframes. On a single data.frame, this works fine:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
library(data.table)
dt = data.table(df1)
dt[grepl(pattern = "a", x = df1$col1), col1 :=col2]
but I am lost trying to get this to run over multiple dataframes:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
df2 <- data.frame(col1 = c("b", "b", "a"), col2 = c("AA", "BB", "BB"))
df3 <- data.frame(col1 = c("b", "b", "b"), col2 = c("AA", "AA", "BB"))
library(data.table)
listdfs = list(df1, df2, df3)
for (i in dt[[]]) {
dt[[i]][grepl(pattern = "a", x = df[[i]]$col1), col1 := col2] }
But this obviously doesn't work because I have no clue what I'm doing with the for loop. Any guidance/teaching would be appreciated. Thanks!
If we are looping through the list, then loop over the sequence of list and then do the assignment
listdfs = list(df1, df2, df3)
lapply(listdfs, setDT) # change the `data.frame` to `data.table`
for (i in seq_along(listdfs)) { # loop over sequence
listdfs[[i]][grepl(pattern = "a", x = col1), col1 := col2]
}
This would change the elements i.e. data.table with in the listdfs as well the object 'df1', 'df2', 'df3' itself as we didn't create any copy
df1
# col1 col2
#1: AA AA # change
#2: AA AA # change
#3: b AA
df2
# col1 col2
#1: b AA
#2: b BB
#3: BB BB # change
df3
# col1 col2
#1: b AA
#2: b AA
#3: b BB

If string exists, replace cell with NA

I have a data frame as follows:
Col1 Col2 Col3 Col4 Col5
U N=>A {N A} NA
V {L E=>e E e}
X M=>P {M P} NA
Y {Z Q=>p Q p}
How do I do the following?
Replace any cells that contain => with NA.
Remove { and } from the data frame.
Final output to look like is this:
Col1 Col2 Col3 Col4 Col5
U NA N A NA
V L NA E e
X NA M P NA
Y Z NA Q p
We can loop over the columns, use grepl to find the elements that have =>, replace it with NA and then replace the additional non-alphabetic characters with gsub
df1[] <- lapply(df1, function(x) gsub("[{}]+", "", replace(x, grepl("=>", x), NA)))
df1
# Col1 Col2 Col3 Col4 Col5
#1 U <NA> N A <NA>
#2 V L <NA> E e
#3 X <NA> M P <NA>
#4 Y Z <NA> Q p
data
df1 <- structure(list(Col1 = c("U", "V", "X", "Y"), Col2 = c("N=>A",
"{L", "M=>P", "{Z"), Col3 = c("{N", "E=>e", "{M", "Q=>p"), Col4 = c("A}",
"E", "P}", "Q"), Col5 = c(NA, "e}", NA, "p}")), .Names = c("Col1",
"Col2", "Col3", "Col4", "Col5"), class = "data.frame", row.names = c(NA,
-4L))

R paste0 2 columns if not NA

I would like to paste0 two columns if the element in one column is not NA.If one element of one columns is NA then keep the element of the other column only.
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"), col2 = c(1, NA, 3)), .Names = c("col1", "col2"),
class = "data.frame",row.names = c(NA, -3L))
# col1 col2
# 1 A 1
# 2 B NA
# 3 C 3
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"),col2 = c(1, NA, 3), col3 = c("A|1", "B", "C|3")),
.Names = c("col1", "col2", "col3"), row.names = c(NA,-3L),
class = "data.frame")
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3
you can also do it with regular expressions:
df$col3 <- sub("NA\\||\\|NA", "", with(df, paste0(col1, "|", col2)))
That is, paste them in regular way and then replace any "NA|" or "|NA" with "". Note that | needs to be "double escaped" because it means "OR" in regexps, that's why the strange pattern NA\\||\\|NA means actually "NA|" OR "|NA".
As #Roland says, this is easy using ifelse (just translate the mental logic into a series of nested ifelse statements):
x <- transform(x,col3=ifelse(is.na(col1),as.character(col2),
ifelse(is.na(col2),as.character(col1),
paste0(col1,"|",col2))))
update: need as.character in some cases.
Try:
> df$col1 = as.character(df$col1)
> df$col3 = with(df, ifelse(is.na(col1),col2, ifelse(is.na(col2), col1, paste0(col1,'|',col2))))
> df
col1 col2 col3
1 A 1 A|1
2 B NA B
3 C 3 C|3
You could also do:
library(stringr)
df$col3 <- apply(df, 1, function(x)
paste(str_trim(x[!is.na(x)]), collapse="|"))
df
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3

Resources