Filter dataset with %in% with pattern - r

I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?

You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))

If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))

Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)

Related

Use an external list to remove data from rows

I have a data frame
df <- data.frame(
A = c(4, 2, 7),
B = c(3, 3, 5),
C = c("Expert,Foo", "Bar,Wild", "Zap")
)
and a second one which I would like to use as index to remove rows which contain the specific values
mylist <- data.frame(rtext = c("Foo","Bar"))
So I tried this:
subset(df, C %in% mylist$rtext)
How can I remove the specific rows?
As it is a partial match, we can use grep. We paste the elements of 'myList' column 'rtext' into a single string with delimiter | which implies OR, then get a logical index with grepl on the 'C' column of 'df', negate (!) to change TRUE to FALSE and FALSE to TRUE to subset the rows that are not in the 'rtext' of 'mylist'
subset(df, !grepl(paste(mylist$rtext, collapse="|"), C))
# A B C
#3 7 5 Zap
Using str_detect from stringr
df[!stringr::str_detect(df$C,paste(mylist$rtext,collapse = '|')),]
A B C
3 7 5 Zap
If you need the 100% match , which means Foooo will not be removed ,check with dplyr and tidyr re-format your df 1st , since str_detect and grepl are partial match , if you have word like Expert,Foott it will still show as match with Foo
library(tidyr)
library(dplyr)
df$id=seq.int(nrow(df))
df1=df %>%
transform(C = strsplit(C, ",")) %>%
unnest(C)
df[!df$id%in%df1$id[df1$C%in%mylist$rtext],]

Remove special characters from entire dataframe in R

Question:
How can you use R to remove all special characters from a dataframe, quickly and efficiently?
Progress:
This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.
Problem:
My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.
df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)
df - This is the original dataframe:
dfa - This is gsub applied to column b. Good!
df_all - This is gsub applied to the entire dataframe. Bad!
Summary:
Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?
Here is a possible solution using dplyr:
# Example data
bla <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
bla %>%
mutate_all(funs(gsub("[[:punct:]]", "", .)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
Update:
mutate_all has been superseded, and funs is deprecated as of dplyr 0.8.0. Here is an updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:
as.data.frame(gsub("[[:punct:]]", "", as.matrix(df)))
I like Ryan's answer using dplyr. As mutate_all and funs are now deprecated, here is my suggested updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use across() from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd

R dplyr: mutate a column for specific row range

I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony

adding values in certain columns of a data frame in R

I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))

Remove Rows From Data Frame where a Row matches a String

I do I remove all rows in a dataframe where a certain row meets a string match criteria?
For example:
A,B,C
4,3,Foo
2,3,Bar
7,5,Zap
How would I return a dataframe that excludes all rows where C = Foo:
A,B,C
2,3,Bar
7,5,Zap
Just use the == with the negation symbol (!). If dtfm is the name of your data.frame:
dtfm[!dtfm$C == "Foo", ]
Or, to move the negation in the comparison:
dtfm[dtfm$C != "Foo", ]
Or, even shorter using subset():
subset(dtfm, C!="Foo")
You can use the dplyr package to easily remove those particular rows.
library(dplyr)
df <- filter(df, C != "Foo")
I had a column(A) in a data frame with 3 values in it (yes, no, unknown). I wanted to filter only those rows which had a value "yes" for which this is the code, hope this will help you guys as well --
df <- df [(!(df$A=="no") & !(df$A=="unknown")),]
if you wish to using dplyr, for to remove row "Foo":
df %>%
filter(!C=="Foo")
I know this has been answered but here is another option:
library (dplyr)
df %>% filter(!c=="foo)
OR
df[!df$c=="foo", ]
If your exclusion conditions are stored in another data frame you could use rows_delete:
library(dplyr)
removal_df <- data.frame(C = "Foo")
df %>%
rows_delete(removal_df, by = "C")
A B C
1 2 3 Bar
2 7 5 Zap
This is also handy if you have multiple exclusion conditions so you do not have to write out a long filter statement.
Note: rows_delete is only available if you have dplyr >= 1.0.0

Resources