R-How to get specific string value in data frame header [duplicate] - r

This question already has answers here:
Use gsub remove all string before first white space in R
(4 answers)
Remove all text before first occurence of specific characeter in R
(1 answer)
Closed 5 years ago.
I have data frame df ,How to return value from string after some first character or number.
My data frame.
ID Name Name.FirstName Name.Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
my expected data frame.
ID Name FirstName Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
I need to remove before first (.) value from dataframe Header only .
dput
structure(list(ID = c(1, 2), Name = structure(c(2L, 1L), .Label = c("Jhon",
"rosy"), class = "factor"), Name.FirstName = structure(c(2L,
1L), .Label = c("peter", "ton"), class = "factor"), Name.Last.Name = structure(c(2L,
1L), .Label = c("N", "P"), class = "factor"), Age = c(23, 22)), .Names = c("ID",
"Name", "Name.FirstName", "Name.Last.Name", "Age"), row.names = c(NA,
-2L), class = "data.frame")

We can use sub to match zero or more characters that are not a . ([^.]*) from the start (^) of the column name followed by a . and replace it with blank (""). As . is a metacharacter i.e. it matches for any character, we escape to get the literal meaning of the character
names(df) <- sub("^[^.]*\\.", "", names(df))
names(df)
#[1] "ID" "Name" "FirstName" "Last.Name" "Age"

Related

how to extract the specific words from a string in R?

how can I extract "7-9", "2-5" and "2-8", then paste to new column as event_time?
event_details
2.9(S) 7-9 street【Train】#2097
2.1(S) 2-5 street【Train】#2012
2.2(S) 2-8A TBC【Train】#202
You haven't really shared the logic to extract the numbers but based on limited data that you have shared we can do :
df$new_col <- sub('.*(\\d+-\\d+).*', '\\1', df$event_details)
df
# event_details new_col
#1 2.9(S) 7-9 street【Train】 7-9
#2 2.1(S) 2-5 street【Train】 2-5
#3 2.2(S) 2-8A TBC【Train】 2-8
Or same using str_extract
df$new_col <- stringr::str_extract(df$event_details, "\\d+-\\d+")
data
df <- structure(list(event_details = structure(c(3L, 1L, 2L),
.Label = c("2.1(S) 2-5 street【Train】",
"2.2(S) 2-8A TBC【Train】", "2.9(S) 7-9 street【Train】"), class =
"factor")), class = "data.frame", row.names = c(NA, -3L))

Replace value in a column with 'blank' if it matches the value in another column [duplicate]

This question already has answers here:
replacing values in a column with another column R
(4 answers)
Closed 3 years ago.
I have a following data set as a data frame within R
article_number 1st_cutoff_date 2nd_cutoff_date
abc 12/01/2019 01/14/2020
def 02/10/2020 02/10/2020
What I want to do is in cases where 1st_cutoff_date == 2nd_cutoff_date, then replace 2nd_cutoff date with blank value " ". So in the second case 'def' then 2nd_cutoff_date would be blank " "
the data frame is of factors and there are NA's - I have converted to character and tried the following:
AAR_FTW_Final_w_LL[AAR_FTW_Final_w_LL$`1st_Booking_Deadline` == AAR_FTW_Final_w_LL$`2nd_Booking_Deadline`, c("2nd_Booking_Deadline")] <- " "
&
ind<- AAR_FTW_Final_w_LL$`1st_Booking_Deadline` == AAR_FTW_Final_w_LL[`2nd_Booking_Deadlilne`]
AAR_FTW_Final_w_LL[ind, c("2nd_Booking_Deadline")] <- " "
Both return the error:
Error in AAR_FTW_Final_w_LL$`1st_Booking_Deadline` :
$ operator is invalid for atomic vectors
I have tried replacing the $ with [] but then I get the error that one of the columns is missing. Is there any easier way to do to this task?
Convert from factors to characters :
df[] <- lapply(df, as.character)
Then use replace
transform(df, `2nd_cutoff_date` = replace(`2nd_cutoff_date`,
`1st_cutoff_date` == `2nd_cutoff_date`, ''))
# article_number X1st_cutoff_date X2nd_cutoff_date
#1 abc 12/01/2019 01/14/2020
#2 def 02/10/2020
It adds X to the column name since it is not a standard in R to have columns starting with a number.
Another approach after you convert the data to characters would be
df$`2nd_cutoff_date`[df$`1st_cutoff_date` == df$`2nd_cutoff_date`] <- ""
data
df <- structure(list(article_number = structure(1:2, .Label = c("abc",
"def"), class = "factor"), `1st_cutoff_date` = structure(2:1,
.Label = c("02/10/2020", "12/01/2019"), class = "factor"),
`2nd_cutoff_date` = structure(1:2, .Label = c("01/14/2020",
"02/10/2020"), class = "factor")), class = "data.frame", row.names = c(NA, -2L))

Finding list of word present in column of a Dataframe using Grepl in R

I have a dataframe df:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")
Also, I have a list of word:
wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")
I looked for the words from wordlist are present in the column text or not by unlisting the text and using grepl.
library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
The problem is, I want to find exact words of the wordlist present in Column text.
With grepl it also shows word with partial match, for example AudiA6 from text was also partially matched to word Audi present in wordlist. Also my dataframe is very big and using grepl take a lot time in running the code. Please, if possible recommend any other Approach to do so. I want something like this:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement",
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
You can use str_extract_all from stringr after adding word boundaries (\\b) to each of the words you want to extract so only full matches are considered (and you need to collapse all words with "|" to indicate a "or"):
sapply(stringr::str_extract_all(df$text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange" "engine,replacement,BMW" "brand" "Volkswagen,company,BMW"
If you want to put it in your data.table:
df[, match:=sapply(stringr::str_extract_all(text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")]
df
# page text match
#1: 12 ToyotaCorolla is offering new car exchange offers exchange
#2: 6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3: 9 I just bought a brand new AudiA6 brand
#4: 65 Volkswagen is the parent company of BMW Volkswagen,company,BMW

How to match column values in two dataframes and make rownames with the matching corresponding column values

I have this dataframe called mydf. I want to match the current column in another dataframe called secondf with the column key.genomloc and extract the corresponding key.wesmut.genom column values and make that rowname as shown in the result.
This is what I have tried, but does not work as desired:
current <- secondf[,"key.genomloc"]
replacement <- secondf[,"key.wesmut.genom"]
v <- mydf[,"current"] %in% current
w <- current %in% mydf[,"current"]
rownames(mydf)<-mydf[,"current"]
rownames(mydf)[v] <- replacement[w]
Data:
mydf <-structure(list(current = structure(c(5L, 2L), .Label = c("chr1:115256529:T:C",
"chr1:115256530:G:T", "chr1:115258744:C:A", "chr1:115258744:C:T",
"chr1:115258747:C:T", "chr11:32417945:T:C", "chr12:25398284:C:A",
"chr12:25398284:C:T", "chr13:28592640:A:C", "chr13:28592641:T:A",
"chr13:28592642:C:A", "chr13:28592642:C:G", "chr15:90631838:C:T",
"chr15:90631934:C:T", "chr2:209113112:C:T", "chr2:209113113:G:A",
"chr2:209113113:G:C", "chr2:209113113:G:T", "chr2:25457242:C:T",
"chr2:25457243:G:A", "chr2:25457243:G:T", "chr4:55599320:G:T"
), class = "factor"), `index` = c(1451738, 1451718)), .Names = c("current",
"index"), row.names = 1:2, class = "data.frame")
secondf<-structure(c("WES:FLT3:p.D835H", "WES:FLT3:p.D835N", "WES:FLT3:p.D835Y",
"WES:FLT3:p.D835A", "WES:FLT3:p.D835V", "chr1:115256530:G:T",
"chr13:28592642:C:T", "chr13:28592642:C:A", "chr1:115258747:C:T",
"chr13:28592641:T:A"), .Dim = c(5L, 2L), .Dimnames = list(NULL,
c("key.wesmut.genom", "key.genomloc")))
Result
rowname current index
WES:FLT3:p.D835A chr1:115258747:C:T 1451738
WES:FLT3:p.D835H chr1:115256530:G:T 1451718
We can use match
mydf$rowname <- secondf[,1][match(mydf$current,secondf[,2])]
mydf[c(3,1:2)]
# rowname current index
#1 WES:FLT3:p.D835A chr1:115258747:C:T 1451738
#2 WES:FLT3:p.D835H chr1:115256530:G:T 1451718

Remove thousand's separator [duplicate]

This question already has answers here:
How can I declare a thousand separator in read.csv? [duplicate]
(4 answers)
Closed 9 years ago.
I imported an Excel file and got a data frame like this
structure(list(A = structure(1:3, .Label = c("1.100", "2.300",
"5.400"), class = "factor"), B = structure(c(3L, 2L, 1L), .Label = c("1.000.000",
"500", "7.800"), class = "factor"), C = structure(1:3, .Label = c("200",
"3.100", "4.500"), class = "factor")), .Names = c("A", "B", "C"
), row.names = c(NA, -3L), class = "data.frame")
I would now like to convert these chars to numeric or even integer. However, the dot character (.) is not a decimal sign but a "thousand's separator" (it's German).
How would I convert the data frame properly?
I tried this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"))
df3 <- as.data.frame(data.matrix(df2))
however, apply seems to convert each column to a list of factors. Can I maybe prevent apply from doing so?
You can use this :
sapply(df, function(v) {as.numeric(gsub("\\.","", as.character(v)))})
Which gives :
A B C
[1,] 1100 7800 200
[2,] 2300 500 3100
[3,] 5400 1000000 4500
This will give you a matrix object, but you can wrap it into data.frame() if you wish.
Note that the columns in you original data are not characters but factors.
Edit: Alternatively, instead of wrapping it with data.frame(), you can do this to get the result directly as a data.frame:
# the as.character(.) is just in case it's loaded as a factor
df[] <- lapply(df, function(x) as.numeric(gsub("\\.", "", as.character(x))))
I think I just found another solution:
It's necessary to use stringsAsFactors = FALSE.
Like this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"), stringsAsFactors = FALSE)
df3 <- as.data.frame(data.matrix(df2))

Resources