This question already has answers here:
Regular expressions (RegEx) and dplyr::filter()
(2 answers)
Closed 4 years ago.
I have a data.frame like this:
Client Product
1 VV_Brazil_Jul
2 VV_Brazil_Mar
5 VV_US_Jul
1 VV_JP_Apr
3 VV_CH_May
6 VV_Brazil_Aug
I would like to delete all rows with "Brazil".
You can do this using the grepl function and the ! to find the cases that are not matched:
# Create a dataframe where some cases have the product with Brazil as part of the value
df <- structure(list(Client = c(1L, 2L, 5L, 1L, 3L, 6L),
Product = c("VV_Brazil_Jul", "VV_Brazil_Mar", "VV_US_Jul", "VV_JP_Apr", "VV_CH_May", "VV_Brazil_Aug")),
row.names = c(NA, -6L), class = c("data.table", "data.frame"))
# Display the original dataframe in the Console
df
# Limit the dataframe to cases which do not have Brazil as part of the product
df <- df[!grepl("Brazil", df$Product, ignore.case = TRUE),]
# Display the revised dataframe in the Console
df
You can do the same thing with the tidyverse collection
dplyr::slice(df, -stringr::str_which(df$Product, "Brazil"))
Related
I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))
One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))
Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.
Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))
I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.
We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
I have a data frame with a column that contains some elements that are lists. I would like to find out which rows of the data frame contain a keyword in that column.
The data frame, df, looks a bit like this
idstr tag
1 wl
2 other.to
3 other.from
4 c("wl","other.to")
5 wl
6 other.wl
7 c("ll","other.to")
The goal is to assign all of the rows with 'wl' in their tag to a new data frame. In this example, I would want a new data frame that looks like:
idstr tag
1 wl
4 c("wl","other.to")
5 wl
I tried something like this
df_wl <- df[which(is.element('wl',df$tag)),]
but this only returns the first element of the data frame (whether or not it contains 'wl'). I think the trouble lies in iterating through the rows and implementing the "is.element" function. Here are two implementations of the function and it's results:
is.element('wl',df$tag[[4]]) > TRUE
is.element('wl',df$tag[4]) > FALSE
How do you suggest I iterate through the dataframe to assign df_wl with it's proper values?
PS: Here's the dput:
structure(list(idstr = 1:7, tag = structure(c(6L, 5L, 4L, 2L, 6L, 3L, 1L), .Label = c("c(\"ll\",\"other.to\")", "c(\"wl\",\"other.to\")", "other.wl", "other.from", "other.to", "wl"), class = "factor")), .Names = c("idstr", "tag"), row.names = c(NA, -7L), class = "data.frame")
Based on your dput data. this may work. The regular expression (^wl$)|(\"wl\") matches wl from beginning to end, or any occurrence of "wl" (wrapped in double quotes)
df[grepl("(^wl$)|(\"wl\")", df$tag),]
# idstr tag
# 1 1 wl
# 4 4 c("wl","other.to")
# 5 5 wl
I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like