how to extract the specific words from a string in R? - r

how can I extract "7-9", "2-5" and "2-8", then paste to new column as event_time?
event_details
2.9(S) 7-9 street【Train】#2097
2.1(S) 2-5 street【Train】#2012
2.2(S) 2-8A TBC【Train】#202

You haven't really shared the logic to extract the numbers but based on limited data that you have shared we can do :
df$new_col <- sub('.*(\\d+-\\d+).*', '\\1', df$event_details)
df
# event_details new_col
#1 2.9(S) 7-9 street【Train】 7-9
#2 2.1(S) 2-5 street【Train】 2-5
#3 2.2(S) 2-8A TBC【Train】 2-8
Or same using str_extract
df$new_col <- stringr::str_extract(df$event_details, "\\d+-\\d+")
data
df <- structure(list(event_details = structure(c(3L, 1L, 2L),
.Label = c("2.1(S) 2-5 street【Train】",
"2.2(S) 2-8A TBC【Train】", "2.9(S) 7-9 street【Train】"), class =
"factor")), class = "data.frame", row.names = c(NA, -3L))

Related

R-How to get specific string value in data frame header [duplicate]

This question already has answers here:
Use gsub remove all string before first white space in R
(4 answers)
Remove all text before first occurence of specific characeter in R
(1 answer)
Closed 5 years ago.
I have data frame df ,How to return value from string after some first character or number.
My data frame.
ID Name Name.FirstName Name.Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
my expected data frame.
ID Name FirstName Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
I need to remove before first (.) value from dataframe Header only .
dput
structure(list(ID = c(1, 2), Name = structure(c(2L, 1L), .Label = c("Jhon",
"rosy"), class = "factor"), Name.FirstName = structure(c(2L,
1L), .Label = c("peter", "ton"), class = "factor"), Name.Last.Name = structure(c(2L,
1L), .Label = c("N", "P"), class = "factor"), Age = c(23, 22)), .Names = c("ID",
"Name", "Name.FirstName", "Name.Last.Name", "Age"), row.names = c(NA,
-2L), class = "data.frame")
We can use sub to match zero or more characters that are not a . ([^.]*) from the start (^) of the column name followed by a . and replace it with blank (""). As . is a metacharacter i.e. it matches for any character, we escape to get the literal meaning of the character
names(df) <- sub("^[^.]*\\.", "", names(df))
names(df)
#[1] "ID" "Name" "FirstName" "Last.Name" "Age"

Compare dataframe column to another dataframe column

I have a dataframe column containing page paths (let's call it A):
pagePath
/text/other_text/123-string1-4571/text.html
/text/other_text/string2/15-some_other_txet.html
/text/other_text/25189-string3/45112-text.html
/text/other_text/text/string4/5418874-some_other_txet.html
/text/other_text/string5/text/some_other_txet-4157/text.html
/text/other_text/123-text-4571/text.html
/text/other_text/125-text-471/text.html
And I have another string dataframe column let's call it (B) (the two dataframes are different and they don't have the same number of rows).
Here's an example of my column in dataframe B:
names
string1
string11
string4
string3
string2
string10
string5
string100
What I want to do is to check if my page paths (A) are containing strings from my other dataframe (B).
I had difficulties because my two dataframes haven't the same length and the data are unorganized.
EXPECTED OUTPUT
I want to have this output as a result:
pagePath names exist
/text/other_text/123-string1-4571/text.html string1 TRUE
/text/other_text/string2/15-some_other_txet.html string2 TRUE
/text/other_text/25189-string3/45112-text.html string3 TRUE
/text/other_text/text/string4/5418874-some_other_txet.html string4 TRUE
/text/string5/text/some_other_txet-4157/text.html string5 TRUE
/text/other_text/123-text-4571/text.html NA FALSE
/text/other_text/125-text-471/text.html NA FALSE
If my question needs more clarification, please mention this.
We can generate the exist column with grepl()
# Collapse B$names into one string with "|"
onestring <- paste(B$names, collapse = "|")
# Generate new column
A$exist <- grepl(onestring, A$pagePath)
Not that nice, since containing a for loop:
names <- rep(NA, length(A$pagePath))
exist <- rep(FALSE, length(A$pagePath))
for (name in B$names) {
names[grep(name, A$pagePath)] <- name
exist[grep(name, A$pagePath)] <- TRUE
}
We can use str_extract_all from stringr package but NA are replaced with character(0) so we have to change it
df$names <- as.character(str_extract_all(df$pagePath, "string[0-9]+"))
df$exist <- df$names %in% df1$names
df[df=="character(0)"] <- NA
df
# pagePath names exist
#1 /text/other_text/123-string1-4571/text.html string1 TRUE
#2 /text/other_text/string2/15-some_other_txet.html string2 TRUE
#3 /text/other_text/25189-string3/45112-text.html string3 TRUE
#4 /text/other_text/text/string4/5418874-some_other_txet.html string4 TRUE
#5 /text/other_text/string5/text/some_other_txet-4157/text.html string5 TRUE
#6 /text/other_text/123-text-4571/text.html <NA> FALSE
#7 /text/other_text/125-text-471/text.html <NA> FALSE
DATA
dput(df)
structure(list(pagePath = structure(c(1L, 5L, 4L, 7L, 6L, 2L,
3L), .Label = c("/text/other_text/123-string1-4571/text.html",
"/text/other_text/123-text-4571/text.html", "/text/other_text/125-text-471/text.html",
"/text/other_text/25189-string3/45112-text.html", "/text/other_text/string2/15-some_other_txet.html",
"/text/other_text/string5/text/some_other_txet-4157/text.html",
"/text/other_text/text/string4/5418874-some_other_txet.html"), class = "factor")), .Names = "pagePath", class = "data.frame", row.names = c(NA,
-7L))
dput(df1)
structure(list(names = structure(c(1L, 4L, 7L, 6L, 5L, 2L, 8L,
3L), .Label = c("string1", "string10", "string100", "string11",
"string2", "string3", "string4", "string5"), class = "factor")), .Names = "names", class = "data.frame", row.names = c(NA,
-8L))
Here is one way using apply:
df$exist <- apply( df,1,function(x){as.logical(grepl(x[2],x[1]))} )

Converting factors into numeric format with signs in R

Let, I have such a dataframe(df) where each elements are factors:
df
---
+100.5
+120.2
-30.0
+75.0
-600.3
How can I convert df into a numric df using R? I ill be very glad for any help. Thanks a lot.
The conversion from factors to numerical values is sometimes complicated, and I think that it is usually necessary to convert the factors first into characters, and then into numerical values.
This should work:
df_n <- as.data.frame(as.numeric(as.character(df[,1])))
colnames(df_n) <- "df_n"
head(df_n)
# df_n
#1 100.5
#2 120.2
#3 -30.0
#4 75.0
#5 -600.3
class(df_n[,1])
#[1] "numeric"
data
df <- structure(list(df = structure(c(4L, 5L, 2L, 3L, 1L),
.Label = c("-600.3", "-30", "75", "100.5", "120.2"),
class = "factor")), .Names = "df",
row.names = c(NA, -5L), class = "data.frame")
Hope this helps.

How to match column values in two dataframes and make rownames with the matching corresponding column values

I have this dataframe called mydf. I want to match the current column in another dataframe called secondf with the column key.genomloc and extract the corresponding key.wesmut.genom column values and make that rowname as shown in the result.
This is what I have tried, but does not work as desired:
current <- secondf[,"key.genomloc"]
replacement <- secondf[,"key.wesmut.genom"]
v <- mydf[,"current"] %in% current
w <- current %in% mydf[,"current"]
rownames(mydf)<-mydf[,"current"]
rownames(mydf)[v] <- replacement[w]
Data:
mydf <-structure(list(current = structure(c(5L, 2L), .Label = c("chr1:115256529:T:C",
"chr1:115256530:G:T", "chr1:115258744:C:A", "chr1:115258744:C:T",
"chr1:115258747:C:T", "chr11:32417945:T:C", "chr12:25398284:C:A",
"chr12:25398284:C:T", "chr13:28592640:A:C", "chr13:28592641:T:A",
"chr13:28592642:C:A", "chr13:28592642:C:G", "chr15:90631838:C:T",
"chr15:90631934:C:T", "chr2:209113112:C:T", "chr2:209113113:G:A",
"chr2:209113113:G:C", "chr2:209113113:G:T", "chr2:25457242:C:T",
"chr2:25457243:G:A", "chr2:25457243:G:T", "chr4:55599320:G:T"
), class = "factor"), `index` = c(1451738, 1451718)), .Names = c("current",
"index"), row.names = 1:2, class = "data.frame")
secondf<-structure(c("WES:FLT3:p.D835H", "WES:FLT3:p.D835N", "WES:FLT3:p.D835Y",
"WES:FLT3:p.D835A", "WES:FLT3:p.D835V", "chr1:115256530:G:T",
"chr13:28592642:C:T", "chr13:28592642:C:A", "chr1:115258747:C:T",
"chr13:28592641:T:A"), .Dim = c(5L, 2L), .Dimnames = list(NULL,
c("key.wesmut.genom", "key.genomloc")))
Result
rowname current index
WES:FLT3:p.D835A chr1:115258747:C:T 1451738
WES:FLT3:p.D835H chr1:115256530:G:T 1451718
We can use match
mydf$rowname <- secondf[,1][match(mydf$current,secondf[,2])]
mydf[c(3,1:2)]
# rowname current index
#1 WES:FLT3:p.D835A chr1:115258747:C:T 1451738
#2 WES:FLT3:p.D835H chr1:115256530:G:T 1451718

Remove thousand's separator [duplicate]

This question already has answers here:
How can I declare a thousand separator in read.csv? [duplicate]
(4 answers)
Closed 9 years ago.
I imported an Excel file and got a data frame like this
structure(list(A = structure(1:3, .Label = c("1.100", "2.300",
"5.400"), class = "factor"), B = structure(c(3L, 2L, 1L), .Label = c("1.000.000",
"500", "7.800"), class = "factor"), C = structure(1:3, .Label = c("200",
"3.100", "4.500"), class = "factor")), .Names = c("A", "B", "C"
), row.names = c(NA, -3L), class = "data.frame")
I would now like to convert these chars to numeric or even integer. However, the dot character (.) is not a decimal sign but a "thousand's separator" (it's German).
How would I convert the data frame properly?
I tried this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"))
df3 <- as.data.frame(data.matrix(df2))
however, apply seems to convert each column to a list of factors. Can I maybe prevent apply from doing so?
You can use this :
sapply(df, function(v) {as.numeric(gsub("\\.","", as.character(v)))})
Which gives :
A B C
[1,] 1100 7800 200
[2,] 2300 500 3100
[3,] 5400 1000000 4500
This will give you a matrix object, but you can wrap it into data.frame() if you wish.
Note that the columns in you original data are not characters but factors.
Edit: Alternatively, instead of wrapping it with data.frame(), you can do this to get the result directly as a data.frame:
# the as.character(.) is just in case it's loaded as a factor
df[] <- lapply(df, function(x) as.numeric(gsub("\\.", "", as.character(x))))
I think I just found another solution:
It's necessary to use stringsAsFactors = FALSE.
Like this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"), stringsAsFactors = FALSE)
df3 <- as.data.frame(data.matrix(df2))

Resources