I have five columns named organoleptic.1., organoleptic.2., organoleptic.3. and so forth in a data frame called "df". I want to rename them to organoleptic1, organoleptic2, organoleptic3, etc. That is, I want to remove the two dots surrounding the number. I did it using the names function:
names(df)[names(df) == "organoleptic.1."] <- "organoleptic1"
names(df)[names(df) == "organoleptic.2."] <- "organoleptic2"
names(df)[names(df) == "organoleptic.3."] <- "organoleptic3"
names(df)[names(df) == "organoleptic.4."] <- "organoleptic4"
names(df)[names(df) == "organoleptic.5."] <- "organoleptic5"
However, I would like to do it just typing one line of code. Is it possible to do that using regular expressions or any other trick? Many thx!
We can try by using gsub function. Edit: Fixed from sub to gsub
colnames(df) <- gsub('.', '', colnames(df), fixed=TRUE)
Related
I want to manipulate the names of all the columns in a dataframe with this function that I wrote:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
}
That said, when actually called, it doesn't do anything (no error just nothing).
What's the problem here?
I suspect the problem is that I'm only assigning things and not returning anything. But in this case I don't want to return a value just change the column names.
The only parameter here is df and I'm calling the names() function multiple times. Shouldn't this work? Any help is appreciated!
Two things here:
R tends to not operate in side-effect, so while you may pass a data.frame in to it, the first time you change anything about it, the df in the function is completely copied into a new object that will go away when the function is done. The original frame is untouched. There are some functions in R that operate in side-effect, but most of R is not. With this, you cannot just make changes inside the function and assume that it will have an effect outside of the function. For this, you would need to reassign the results back to the frame, as in:
mydata <- clean_names(mydata)
When there is no literal return(.) statement in a function, R will return the last expression (often invisibly). You will often see functions end with the desired object (df here) without using the literal return function; that function is useful in some circumstances but usually not needed.
The last expression is usually invisible. You can see what is really happening by capturing the return value in a new variable or, as a shortcut, just (clean_names(mydata)). My gut feeling is that the output from that function is a vector of strings.
Why? Because the last expression is a reassignment of names. The RHS of that assignment is producing a character vector, and that is passed to the `names<-` function on the LHS, and that value (the vector of strings) is then used as the return value of the function.
The resolution here is to add df (or return(df) if you must) to the end of your function, as in:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
df
}
After doing both of those steps, you should then get data.
From the names documentation:
For names<-, the updated object. (Note that the value of names(x) <- value is that of the assignment, value, not the return value from the left-hand side.)
Therefore you should try:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
return(df)
}
Having a data.frame like 'df', I would like to spot this exact phrase "keratinization [GO:0031424]" in each cell of the column 'bio_process'. Afterwards, I want to create a new vector with 'ID' of the observations that the match occured.
ID <- c("Q9BYP8", "Q17RH7", "Q6L8G8", "Q9BYR4")
bio_process <- c("keratinization [GO:0031424]", "NA", "keratinization [GO:0031424]", "aging [GO:0007568]; hair cycle [GO:0042633]; keratinization [GO:0031424]")
df <- as.data.frame(cbind(ID,bio_process))
in order to acheive this, I applied a for loop. I used the %in% into the loop, like this:
n <- 4
ids <- vector(mode = "character", length = n)
for (i in 1:n) {
if ("keratinization [GO:0031424]" %in% df$bio_process[i]) {
ids[i] <- data$ID[i]
}
}
As a result I would like the content of 'ids' vector to be like this one below.
"Q9BYP8" "Q6L8G8" "Q9BYR4"
However, %in% does not work for the cells were 'keratinization [GO:0031424]' is not the only content.
Any ideas? Thank you
you can use grepl in Base-R
df$ID[grepl("keratinization \\[GO:0031424\\]",df$bio_process)]
[1] Q9BYP8 Q6L8G8 Q9BYR4
note I had to escape the [ character with \\ as square brackets have special meaning in regex.
I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE
I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]
I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]