How do I delete a row with a pattern in R? - r

I have a dataframe where I want to delete all rows with specific pattern. I am confused with compiling a regular expression.
Data:
structure(list(id = 1:5, email = c("1#gmail.com", "2#gmail.com",
"3#gmail.com", "4#pattern.com", "5#pattern.com")), class = "data.frame", row.names = c(NA,
-5L))
What I am trying to do is:
data <- data %>%
filter(email != ".+#pattern.com")
But something is wrong with my regex. What is the most effective way to compose a regular expression for such patterns? What is the proper regex pattern for my sample case?

This uses grepl to perform a regex comparison
libary(dplyr)
data %>%
filter(!grepl("#pattern.com$", email))
id email
1 1 1#gmail.com
2 2 2#gmail.com
3 3 3#gmail.com

In base R you can remove the rows in which the pattern #pattern.com is detected by the function greplin the email column:
data[-which(grepl("#pattern.com", data$email)),]
id email
1 1 1#gmail.com
2 2 2#gmail.com
3 3 3#gmail.com
Data:
data <- structure(list(id = 1:5, email = c("1#gmail.com", "2#gmail.com",
"3#gmail.com", "4#pattern.com", "5#pattern.com")), class = "data.frame", row.names = c(NA,

Related

Convert dates into numbers

I need your support while working with dates.
While importing an .xls file, the column of dates was correctly converted into numbers by R. Unfortunately some dates are still there in the format: dd/mm/yyyy or d/mm/yyyy or dd/mm/yy. Probably this results from different settings of different os. I don't know. Is there a way to manage this?
Thank you in advance
my_data <- read_excel("my_file.xls")
born_date
18520
30859
16/04/1972
26612
30291
24435
11/02/1964
26/09/1971
18427
23688
Original_dates
14/9/1950
26/6/1984
16/04/1972
9/11/1972
6/12/1982
24/11/1966
11/02/1964
26/09/1971
13/6/1950
Here is one way how we could solve it:
First we define the numeric values only by exlcuden those containing the string /.
Then we use excel_numeric_to_date function from janitor package.
Finally with coalesce we combine both:
library(dplyr)
library(janitor)
library(lubridate)
df %>%
mutate(x = ifelse(str_detect(born_date, '\\/'), NA_real_, born_date),
x = excel_numeric_to_date(as.numeric(as.character(x)), date_system = "modern"),
born_date = dmy(born_date)) %>%
mutate(born_date = coalesce(born_date, x), .keep="unused")
born_date
1 1950-09-14
2 1984-06-26
3 1972-04-16
4 1972-11-09
5 1982-12-06
6 1966-11-24
7 1964-02-11
8 1971-09-26
9 1950-06-13
10 1964-11-07
data:
df <- structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
1) This translates the two types of dates. Each returns an NA for those elements not of that type. Then we use coalesce to combine them. This only needs dplyr and no warnings are produced.
library(dplyr)
my_data %>%
mutate(born_date = coalesce(
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
## born_date
## 1 1950-09-14
## 2 1984-06-26
## 3 1972-04-16
## 4 1972-11-09
## 5 1982-12-06
## 6 1966-11-24
## 7 1964-02-11
## 8 1971-09-26
## 9 1950-06-13
## 10 1964-11-07
2) Here is a base R version.
my_data |>
transform(born_date = pmin(na.rm = TRUE,
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
Note
The input in reproducible form.
my_data <-
structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

to find count of distinct values across two columns in r

I have two columns . both are of character data type.
One column has strings and other has got strings with quote.
I want to compare both columns and find the no. of distinct names across the data frame.
string f.string.name
john NA
bravo NA
NA "john"
NA "hulk"
Here the count should be 2, as john is common.
Somehow i am not able to remove quotes from second column. Not sure why.
Thanks
The main problem I'm seeing are the NA values.
First, let's get rid of the quotes you mention.
dat$f.string.name <- gsub('["]', '', dat$f.string.name)
Now, count the number of distinct values.
i1 <- complete.cases(dat$string)
i2 <- complete.cases(dat$f.string.name)
sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1])
DATA
dat <-
structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA,
NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name"
), class = "data.frame", row.names = c(NA, -4L))
library(stringr)
table(str_replace_all(unlist(df), '["]', ''))
# bravo hulk john
# 1 1 2

match text vectors from two data frames and return sum of third vector

I have two data frames.
First one called : sentence
structure(list(Text = c("This is a pen", "this is a sword", "pen is mightier than a sword"
)), .Names = "Text", row.names = c(NA, -3L), class = "data.frame")
which looks like:
Text
1 This is a pen
2 this is a sword
3 pen is mightier than a sword
Second one called : words
structure(list(wordvec = c("pen", "sword"), value = c(1, 2)), .Names = c("wordvec",
"value"), row.names = c(NA, -2L), class = "data.frame")
which looks like:
wordvec value
1 pen 1
2 sword 2
I have to search for words present in wordvec in sentence, and if they are present i have to return the sum of words.
Desired output is as follows:
Text Value
1 This is a pen 1
2 this is a sword 2
3 pen is mightier than a sword 3
I first tried extracting the words present in sentence$Text matching with words$wordvec and made a vector. This I successfully did.
library(stringi)
sentence$words <- sapply(stri_extract_all(sentence[[1]],regex='(#?)\\w+'),function(x) paste(x[x %in% words[[1]]],collapse=','))
As a next step i tried getting the sum of words present and create a vector sentence$value. I tried the following code
sentence$value <- sum(words$value)[match(sentence$words, words$wordvec)]
We paste the 'wordvec' as a single string, then extract the words from the 'Text' column that matches the pattern in a list, match with the 'wordvec' vector to get the position, based on that we get the corresponding 'value' from the 'words' and then we do the sum.
library(stringr)
sapply(str_extract_all(sentence$Text,
paste0('\\b(',paste(words$wordvec, collapse='|'), ')\\b')),
function(x) sum(words$value[match(x, words$wordvec)]))
#[1] 1 2 3
Another option is using strsplit after converting the 'sentence' data.frame to data.table (setDT(sentence,..)), match the vector of split words with 'wordvec', get the corresponding 'value' and do the sum.
library(data.table)
setDT(sentence, keep.rownames=TRUE)[,
sum(words$value[match(strsplit(Text, '\\s')[[1]],
words$wordvec, nomatch=0)]), by = rn]$V1
#[1] 1 2 3
Here is another simple solution using the for loop. However performance might be an issue. Your dataframe:
sentence<-structure(list(Text = c("This is a pen", "this is a sword", "pen is mightier than a sword"
)), .Names = "Text", row.names = c(NA, -3L), class = "data.frame")
words<-structure(list(wordvec = c("pen", "sword"), value = c(1, 2)), .Names = c("wordvec",
"value"), row.names = c(NA, -2L), class = "data.frame")
Create an empty dataframe with nrow as the number of counts of each word from wordvec.
a<-data.frame(matrix(0, ncol=1, nrow=nrow(sentence)))
Now using the for loop, go through every word in words and find it in sentence by using str_count from stringr. Using cbind you can store the number of times the word has been repeated in a dataframe for future reference. In this case a
for (i in 1:nrow(words))
a<-cbind(a,data.frame(count=str_count(sentence$Text,words$wordvec[i]))*words$value[i])
Now just add the sum of the rows by using rowSums
data.frame(Text=sentence$Text,Value=rowSums(a))
and you will get:
Text Value
1 This is a pen 1
2 this is a sword 2
3 pen is mightier than a sword 3
>
Try it :)

What is the functional form of the assignment operator, [<-?

Is there a functional form of the assignment operator? I would like to be able to call assignment with lapply, and if that's a bad idea I'm curious anyways.
Edit:
This is a toy example, and obviously there are better ways to go about doing this:
Let's say I have a list of data.frames, dat, each corresponding to a one run of an experiment. I would like to be able to add a new column, "subject", and give it a sham-name. The way I was thinking of it was something like
lapply(1:3, function(x) assign(data.frame = dat[[x]], column="subject", value=x)
The output could either be a list of modified data frames, or the modification could be purely a side effect.
dput of list starting list
list(structure(list(V1 = c(-1.16664504687199, -0.429499924318301, 2.15470735901367, -0.287839633854442, -0.850578353982526, 0.211636723222015, -0.184714165752958, -0.773553182015158, 0.801811848828454, 1.39420292299319 ), V2 = c(-0.00828185523886259, -0.0215669898046275, 0.743065397283645, -0.0268464140141802, 0.168027242784788, -0.602901928341917, 0.0740511186398372, 0.180307494696194, 0.131160421341309, -0.924995634374182)), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = "data.frame"), structure(list( V1 = c(1.81912921386885, 1.17011641727415, 0.692247839769473, 0.0323050362633069, 1.35816977313292, -0.437475434344363, -0.270255715332778, 0.96140963297774, 0.914691132220417, -1.8014509598977), V2 = c(1.45082316226241, 2.05135744606495, -0.787250759618171, 0.288104852581324, -0.376868533959846, 0.531872044490353, -0.750375220117567, -0.459592764008714, 0.991667163481123, 1.31280356980115)), .Names = c("V1", "V2" ), row.names = c(NA, -10L), class = "data.frame"), structure(list( V1 = c(0.528912899341174, 0.464615157920766, -0.184211714281637, 0.526909095449027, -0.371529800682086, -0.483772861751781, -2.02134822661341, -1.30841566046747, -0.738493559993166, -0.221463545903242), V2 = c(-1.44732101816006, -0.161730785376045, 1.06294520132753, 1.22680614207705, -0.721565979363022, -0.438309438404104, -0.0243401435910825, 0.624227513999603, 0.276605218579759, -0.965640602482051)), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = "data.frame"))
Maybe I don't get it but as stated in "The Art of R programming":
Any assignment statement in which the left side is not just an
identifier (meaning a variable name) is considered a replacement
function.
and so in fact you can always translate this:
names(x) <- c("a","b","ab")
to this:
x <- "names<-"(x,value=c("a","b","ab"))
the general rule is just "function_name<-"(<object>, value = c(...))
Edit to the comment:
It works with the " too:
> x <- c(1:3)
> x
[1] 1 2 3
> names(x) <- c("a","b","ab")
> x
a b ab
1 2 3
> x
a b ab
1 2 3
> x <- c(1:3)
> x
[1] 1 2 3
> x <- "names<-"(x,value=c("a","b","ab"))
> x
a b ab
1 2 3
There is the assign function. I don't see any problems with using it but you have to be aware of what environment you want to assign to. See the help ?assign for syntax.
Read this chapter carefully to understand the ins and outs of environments in detail. http://adv-r.had.co.nz/Environments.html

Resources