I want to replace the strings in column ID of df2 with the column genus of df1 based on the matching string in column species in df1. Any tips appreciated, especially dplyr. Maybe left_join?
> df1
genus species
1 Orthobunyavirus Variola virus
2 Alphatorquevirus Torque teno virus 6
3 Yatapoxvirus Yaba-like disease virus
.
> df2
ID
1 Variola virus
2 Torque teno virus 6
3 Yaba-like disease virus
.
desired out
ID
1 Orthobunyavirus
2 Alphatorquevirus
3 Yatapoxvirus
> dput(df1)
structure(list(genus = c("Orthobunyavirus", "Alphatorquevirus",
"Yatapoxvirus"), species = c("Variola virus", "Torque teno virus 6",
"Yaba-like disease virus")), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(ID = c("Variola virus", "Torque teno virus 6",
"Yaba-like disease virus")), class = "data.frame", row.names = c(NA,
-3L))
You could simply use match
df2$ID <- df1$genus[match(df2$ID, df1$species)]
df2
#> ID
#> 1 Orthobunyavirus
#> 2 Alphatorquevirus
#> 3 Yatapoxvirus
df2$ID <- df1$genus[match(df2$ID,df1$species)]
replaces it, removing your original df2 data
df3 <- data.frame(ID = df1$genus[match(df2$ID,df1$species)])
creates a third df with the results.
Related
I have a tibble and want to select only those columns that contain at least one value that matches a regular expression. It took me a while to figure out how to do this, so I'm sharing my solution here.
My use case: I want to select only those columns that include media filenames, from a tibble like the one below. Importantly, I don't know ahead of time what columns the tibble consists of, and whether or not there are any columns that include media filenames.
condition
picture
sound
video
description
A
cat.png
meow.mp3
cat.mp4
A cat
A
dog.png
woof.mp3
dog.mp4
A dog
B
NA
NA
NA
NA
B
bird.png
tjirp.mp3
tjirp.mp4
A bird
R code to reproduce tibble:
dat = structure(list(condition = c("A", "A", "B", "B"), picture = c("cat.png",
"dog.png", NA, "bird.png"), sound = c("meow.mp3", "woof.mp3",
NA, "tjirp.mp3"), video = c("cat.mp4", "dog.mp4", NA, "tjirp.mp4"
), description = c("A cat", "A dog", NA, "A bird")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Solution:
> dat %>% select_if(~any(grepl("\\.png|\\.mp3|\\.mp4", .)))
# A tibble: 4 x 3
picture sound video
<chr> <chr> <chr>
1 cat.png meow.mp3 cat.mp4
2 dog.png woof.mp3 dog.mp4
3 NA NA NA
4 bird.png tjirp.mp3 tjirp.mp4
I'm having some trouble when I try to merge two data frames. Here is an example:
Number <- c("1", "2", "3")
Letter <- factor(c("a", "b", "c"))
map <- data.frame(Number, Letter, row.names = c("Belgium", "Italy", "Senegal"))
This is my first data frame called "map", it looks like this:
Number Letter
Belgium 1 a
Italy 2 b
Senegal 3 c
And if I try to select by row and column I don't have any problem:
map["Belgium", "Number"]
[1] "1"
Here I have my second data frame called "calendar":
Month <- c("January", "February", "March")
calendar <- data.frame(Month, row.names = c("Belgium", "Italy", "Senegal"))
It looks like this:
Month
Belgium January
Italy February
Senegal March
The problem comes when I try to merge both data frames:
map.amp = merge(map, calendar, by = 0)
Row.names Number Letter Month
1 Belgium 1 a January
2 Italy 2 b February
3 Senegal 3 c March
Now, when I try to select a cell using rows and columns, the outcome is always NA
map.amp["Italy", "Month"]
[1] NA
map.amp["Belgium", "Number"]
[1] NA
How can I merge both data frames so I can keep using that kind of select function?
You have to re-set the row names:
row.names(map.amp) <- map.amp$Row.names
If you want to keep using those row names you have to set the Row.names column back to row names. tibble::column_to_rownames is a nice option for this:
map.amp <- merge(map, calendar, by = 0) %>% tibble::column_to_rownames(var = "Row.names")
map.amp[map.amp$Row.names =='Italy', 'Month']
Will work now as row.names is also a column now
You could use the answer in the comment by #thelatemail. Or use
subset(map.amp, Row.names =='Italy')[[ 'Month']] # first get matching rows but them narrow to named column.
or
subset(map.amp, Row.names =='Italy', 'Month') # third argument is for column selection
I just need to remove all replicate numbers and letter "R" from the end of all rows in a column, strain and create a new column with those results in mutant, preferable using dplyr so I can pipe the results forward.
For example
print(df)
strain measurement
1 CK522R1 75
2 CN344attBR1 50
3 GL065R13 32
4 GL078R100 27
Desired Output
strain measurement mutant
1 CK522R1 75 CK522
2 CN344attBR1 50 CN344attB
3 GL065R13 32 GL065
4 GL078R100 27 GL078
Reproducible Data
structure(list(strain = structure(1:4, .Label = c("CK522R1",
"CN344attBR1", "GL065R13", "GL078R100"), class = "factor"), measurement = c(75,
50, 32, 27)), class = "data.frame", row.names = c(NA, -4L))
From d.b's comment:
library(dplyr)
df %>% mutate(mutant=sub("R\\d+$", "",strain),replicate=regmatches(strain, regexpr("R\\d+$", strain)))
I am quite new to R - have worked on this all day but am out of ideas.
I have a dataframe with long descriptions in one column, eg:
df:
ID Name Description
1 A ABC DEF
2 B ARS XUY
3 C ASD
And I have a vector of search terms:
ABC
ARS
XUY
DE
I would like to go through each row in the dataframe and search the Description for any of the search terms. I then want all matches to be concatenated in a new column in the dataframe, e.g.:
ID Name Description Matches
1 A ABC DEF ABC
2 B ARS XUY ARS;XUY
3 C ASD
I would want to search ~100k rows with 1000 search terms.
Does anyone have any ideas? I was able to get a matrix with sapply and grepl, but I'd rather have a concatenated solution.
One option using strsplit and %in% instead of regex:
df$Matches <- sapply(strsplit(as.character(df$Description), '\\s'),
function(x){paste(search[search %in% x], collapse = ';')})
df
# ID Name Description Matches
# 1 1 A ABC DEF ABC
# 2 2 B ARS XUY ARS;XUY
# 3 3 C ASD
data:
search <- c("ABC", "ARS", "XUY", "DE")
df <- structure(list(ID = 1:3, Name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), Description = structure(1:3, .Label = c("ABC DEF",
"ARS XUY", "ASD"), class = "factor"), Matches = c("ABC", "ARS;XUY",
"")), .Names = c("ID", "Name", "Description", "Matches"), row.names = c(NA,
-3L), class = "data.frame")
Another option, which I tried to use in the comments, is to use the stringr package. There are two potential downsides to this approach: 1) it uses regex, and 2) it returns the search term matched instead of the value found.
library(stringr)
df = data.frame(Name=LETTERS[1:3],
Description=c("ABC DEF", "ARS XUY", "ASD"),
stringsAsFactors=F)
search_terms = c("ABC", "ARS", "XUY", "DE")
regex = paste(search_terms, collapse="|")
df$Matches = sapply(str_extract_all(df$Description, regex), function(x) paste(x, collapse=";"))
df
# Name Description Matches
# (chr) (chr) (chr)
# 1 A ABC DEF ABC;DE
# 2 B ARS XUY ARS;XUY
# 3 C ASD
With that being said, I think Alistaire's solution is the better approach since it doesn't use regex.
Here's an alternative:
df <- data.frame(ID=c(1L,2L,3L),Name=c('A','B','C'),Description=c('ABC DEF','ARS XUY','ASD'),stringsAsFactors=F);
st <- c('ABC','ARS','XUY','DE');
df$Matches <- apply(sapply(paste0('\\b',st,'\\b'),grepl,df$Description),1L,function(m) paste(collapse=';',st[m]));
df;
## ID Name Description Matches
## 1 1 A ABC DEF ABC
## 2 2 B ARS XUY ARS;XUY
## 3 3 C ASD
My Sample data set have the following look.
Country Population Capital Area
A 210000210 Sydney/Landon 10000000
B 420000000 Landon 42100000
C 500000 Italy42/Rome1 9200000
D 520000100 Dubai/Vienna21A 720000
How to delete the entire row with a pattern / in its column. I have tried to look in the following link R: Delete rows based on different values following a certain pattern, but it does not help.
You can try grepl
df[!grepl('[/]', df$Capital),]
# Country Population Capital Area
#2 B 420000000 Landon 42100000
library(stringr)
library(tidyverse)
df2 <- df %>%
filter(!str_detect(Capital, "\\/"))
# Country Population Capital Area
# 1 B 420000000 Landon 42100000
Data
df <- structure(list(Country = c("A", "B", "C", "D"), Population = c(210000210L,420000000L, 500000L, 520000100L),
Capital = c("Sydney/Landon", "Landon", "Italy42/Rome1", "Dubai/Vienna21A"),
Area = c(10000000L, 42100000L, 9200000L, 720000L)), class = "data.frame", row.names = c(NA,-4L))