remove specific data within the string in R - r

im new to R, i have this data frame and im trying to delet all the infromation from this column except the genes symbols which always comes secound in place within the string.
enter image description here
best regards!
i tried this function (gsub) but it deleted the specific element only . i`m wandring if i can use it to keep the gene symbol only ( which is always come in the secound place in the string) and delet every thing else

If your data is consistently in the format shown in the image (where the gene ID is always the third "word" of the string), then the word() function from the stringr package can extract the data you want.
library(stringr)
dat = data.frame(gene_assignment = rep(c('idnumbers // geneID // Other stuff'),10))
dat$geneID = word(dat$gene_assignment, 3)
Note that this makes the following assumptions:
Your data is always in the format where there are some id numbers, followed by " // ", followed by the gene ID, followed by a space, and then anything else
Neither the ID numbers in the front nor the gene ID ever contain a space in them
These assumptions are necessary because word() uses spaces to determine when each word starts and ends.

Related

Replacing Content of a column with part of that column's content

I'd like to replace the content of a column in a data frame with only a specific word in that column.
The column always looks like this:
Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')
Place(fullName='Iphofen, Deutschland', name='Iphofen', type='city', country='Germany', countryCode='DE')
I'd like to extract the city name (in this case Würzburg or Iphofen) into a new column, or replace the entire row with the name of the town. There are many different towns so having a gsub-command for every city name will be tough.
Is there a way to maybe just use a gsub and tell Rstudio to replace whatever it finds inside the first two ' '?
Might it be possible to tell it, "give me the word after "name=' until the next '?
I'm very new to using R so I'm kind of out of ideas.
Thanks a lot for any help!
I know of the gsub command, but I don't think it will be the most appropriate in this case.
Yes, with a regular expression you can do exactly that:
string <- "Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')"
city <- gsub(".*name='(.*?)'.*", "\\1", string)
The regular expression says "match any characters followed by name=', then capture any characters until the next ' and then match any additional characters". Then you replace all of that with just the captured characters ("\\1").
The parentheses mean "capture this part", and the value becomes "\\1". (You can do multiple captures, with subsequent captures being \\2, \\3, etc.
Note the question mark in (.*?). This means "match as little as possible while still satisfying the rest of the regex". If you don't include the question mark, the regular expression will match "greedily" and you will capture the entire rest of the line instead of just the city since that would also satisfy the regular expression.
More about regular expression (specific to R) can be found here

Cleaning a column with break spaces that obtain last, first name so I can filter it from my data frame

I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!

How can i remove the first x number of characters of a column name from 200+ columns with each column being not the same number of characters

How can I remove a specific number of characters from a column name from 200+ column names for example: "Q1: GOING OUT?" and "Q5: STATE, PROVINCE, COUNTY, ETC" I just want to remove the "Q1: " and the "Q5: "I have looked around but haven't been able to find one where I don't have to manually rename them manually. Are there any functions or ways to use it through tidyverse? I have only been starting with R for 2 months.
I don't really have anything to show. I have considered using for loops and possibly using gsub or case_when, but don't really understand how to properly use them.
#probably not correctly written but tried to do it anyways
for ( x in x(0:length) and _:(length(CandyData)-1){
front -> substring(0:3)
back -> substring(4:length(CandyData))
print <- back
}
I don't really have any errors because I haven't been able to make it work properly.
Try this:
col_all<-c("Q1:GOING OUT?","Q2:STATE","Q100:PROVINCE","Q200:COUNTRY","Q299:ID") #This is an example.If you already have a dataframe ,you may get colnames by **col_all<-names(df)**
for(col in 1:length(col_all)) # Iterate over the col_all list
{
colname=col_all[col] # assign each column name to variable colname at each iteration
match=gregexpr(pattern =':',colname) # Find index of : for each colname(Since you want to delete characters before colon and keep the string succeeding :
index1=as.numeric(match[1]) # only first element is needed for index
if(index1>0)
{
col_all[col]=substr(colname,index1+1,nchar(colname))#Take substring after : for each column name and assign it to col_all list
}
}
names(df)<-col_all #assign list as column name of dataframe
The H 1 answer is still the best: sub() or gsub() functions will do the work. And do not fear the regex, it is a powerful tool in data management.
Here is the gsub version:
names(df) <- gsub("^.*:","",names(df))
It works this way: for each name, fetch characters until reaching ":" and then, remove all the fetched characters (including ":").
Remember to up vote H 1 soluce in the comments

Create a function in R to extract character from string by using position? The positions of characters are figured out based on pattern condition

I want to create a function that extract characters from strings by using substring, but got some problems to find out the end_position to cut the character.
I got a string that stored in term of log file like that:
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"}")
My idea is identify the position of each descriptions in the log and cut the character behind
start_position = as.numeric(str_locate(string,'\"country\":\"')[,2])
end_position = ??????
country = substring(x,start_position,end_postion)
The sign to recognize the end of character that I want to cut is the symbol "," at the end. FOR EXAMPLE: \"country\":\"UNITED STATES\",
Could you guys tell me any way to get the position of "," with condition of specific pattern in front? I intend to create a function later to extract character based on the recognized pattern. In this example, they are "country" and "country code"
Instead of using substring have a look into strsplit, that will split according to a pattern.
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"})")
strsplit(string,",")[[1]][1]
[1] "{\"country\":\"UNITED STATES\""
You can change the pattern with every regex you like

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

Resources