R - dplyr joining two df with conditions for rows (char) - r

I am still a beginner with stackoverflow and dplyr. Perhaps that's why I couldn't find any other similar question.
Problem:
I have two df.
df1 contains a variable ("a") whose entries I want to compare with the entries of a variable in df2 ("c"). Both variables are characters.
If I have a match between both dfs I want to add a row in a new column ("new") which contains the string of df1 ("birne" etc.).
However, the length of each entrie differs between both variables. So perhaps a str_detect, or ends_with should be helpful.
##DFs
df1 <-data.frame("a"= c("055","022","010","0105","0777","077"), "b"= c("birne", "apfel", "banane","traube","blaubeere","kiwi"))
df2 <-data.frame("c"= c("GX00000055","GX0000022","GX00000010","GX00000105","GX0000777","GX0000077"))
## I want
df2_newcolumn<-data.frame("c"= c("GX00000055","GX0000022","GX00000010","GX00000105","GX0000777","GX0000077"), "new"=c("birne", "apfel","NA","NA","blaubeere","NA"))
I thought I can get it using left_join and filter in combination with ends_with, grepl or str_detect. However, I struggeld getting the correct combination and order of command.

I cannot reproduce your desired output (what are there NA's in there?), but a regex join might be what you need:
library(tidyverse)
library(fuzzyjoin)
df2 %>%
regex_left_join(df1 %>% mutate(regex = paste0(a, "$")), by = c(c = "regex")) %>%
# c a b regex
# 1 GX00000055 055 birne 055$
# 2 GX0000022 022 apfel 022$
# 3 GX00000010 010 banane 010$
# 4 GX00000105 0105 traube 0105$
# 5 GX0000777 0777 blaubeere 0777$
# 6 GX0000077 077 kiwi 077$
select(c,b)
# c b
# 1 GX00000055 birne
# 2 GX0000022 apfel
# 3 GX00000010 banane
# 4 GX00000105 traube
# 5 GX0000777 blaubeere
# 6 GX0000077 kiwi

Related

renaming a subset of contiguous columns in a data.frame/tible based on name-indexing in R

I want so select a subset of consecutive columns by column name, and rename with a character vector.
example data:
data<-data.frame(foo=1:4, bar=10:13, zoo_1=letters[1:4], zoo_2=letters[5:8])
foo bar zoo_1 zoo_2
1 1 10 a e
2 2 11 b f
3 3 12 c g
4 4 13 d h
colnames to be replaced: 'bar', 'zoo_1', 'zoo_2'
new names:
new_names<-c('a', 'b', 'c')
I wanted to use some sort of : operator to select, for instance, the columns with the names bar to zoo_2
I found some weird solutions:
#1
names(data)[which(names(df)=='bar'):which(names(df)=='zoo_2')]<-new_names
and
#2
my_rename<-function(x,y,z){
names(x)[match(y, names(df))]<-z
names(x)
}
names(data)<-my_rename(data, c('bar', 'zoo_1', 'zoo_2'), c(new_names)
Solution #2 is bad because it requires spelling out all names to be replaced.
Solution #1 allows me to select the names in a 'bar':'zoo_2' style, but is quite verbose and may be confusing to others. I am most interested In a substitute for this (which:which) hack.
Any ideas?
We can use rename_at
library(dplyr)
data <- data %>%
rename_at(vars(bar:zoo_2), ~ new_names)
names(data)
#[1] "foo" "a" "b" "c"
However, in dplyr 1.05, rename_at has been superseded by rename_with:
data2<-data %>%
rename_with(.cols=bar:zoo_2, ~ new_names)
> identical(data, data2)
[1] TRUE

Filter rows where any values in a vector are contained in a column

I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?
One trick is to combine the goodcodes into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)
With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
data
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Dplyr or R basis. How to select (or delete) lines that have identical values (column 1 and column 2) and keeping column 3 values

In a data.frame class object with {dplyr} or R {base}.
How to select (or delete) lines that have identical values in column 1 and column 2 ( and keeping column's 3 values).
I have no idea (use distinct fonction?)
test <- data.frame(column1 = c("paris","moscou", "rennes"),
column2 = c("paris", "lima", "rennes"),
column3 =c(12,56,78))
> print (test)
column1 column2 column3
1 paris paris 12
2 moscou lima 56
3 rennes rennes 78
Example:
line 1: paris paris
line 4: rennes rennes
library(dplyr)
test2 <- test %>%
filter(column1 == column2)
print (test2)
Error: level sets of factors are different
We can use subset from base R
subset(test, as.character(column1) == as.character(column2))
In dplyr, use filter to retrieve specific rows and use select to retrieve specific columns.
For data.frames you need to as.character to match strings:
library(dplyr)
test %>%
filter(as.character(column1) == as.character(column2))

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources