How to group column names and add suffixes to them? - r

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.

An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Related

R change column names, retain part of colnames

I have a df with colnames thus:
Resampled.Band.1..raster.bsq...404.502014.Nanometers.
...
Resampled.Band.74..raster.bsq...950.851990.Nanometers.
I want them like this:
950.851990_nm
With:
orig_names <- names(df)
new_name <- gsub("Resampled.Band.", "", orig_names)
and
new_name <- gsub(".Nanometers.", "_nm", new_name)
names(all_roi_rfl) <- new_name
I achieve part of what I want: to change the first and last parts of the colnames:
1..raster.bsq...404.502014_nm
I could repeat this to clean the colnames up most of the way.
But how do I deal with the part of the colnames that varies itself, the band number?
Extract the values that you want using regex and replace the column names.
x <- c('Resampled.Band.1..raster.bsq...404.502014.Nanometers.',
'Resampled.Band.74..raster.bsq...950.851990.Nanometers.')
sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', x)
#[1] "404.502014_nm" "950.851990_nm"
This extracts number that occur between "raster.bsq" and "Nanometers" and appends "_nm" to extracted value.
In your case to replace column names it would be :
names(all_roi_rfl) <- sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', names(all_roi_rfl))
A similar answer to Ronak's but using gsub instead.
First generate a dataframe...
df <-
data.frame(
Resampled.Band.1..raster.bsq...404.502014.Nanometers. = c(1, 2, 1, 2),
Resampled.Band.74..raster.bsq...950.851990.Nanometers. = c('a', 'b', 'c', NA))
using gsub identify the string before and after the piece you want to extract
colnames(df) <- gsub(".*raster.bsq...(.+).Nanometers.", "\\1_nm", colnames(df))

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

strsplit intermediate pattern in first column in a data frame

I have a data frame and I would like to split the first column into two columns but the separate pattern is similar to others and I only want to split the pattern located on number 4.
data frame:
TCGA-TS-A7P1-01A-41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A-11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A-11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A-11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A-11D-A34E-05 0.8882558
desired output:
TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A 11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A 11D-A34E-05 0.8882558
I tried:
sapply(strsplit(as.character(df$ID), "-"), '[', 1:4)
However, it is not the desired output above that I want. Thank you very much.
It seems all the elements of your first column are of the same length so one simple way could be:
df <- data.frame(col1 = c("TCGA-TS-A7P1-01A-41D-A39S-05","TCGA-NQ-A57I-01A-11D-A34E-05","TCGA-3H-AB3O-01A-11D-A39S-05"),
col2 = c(0.8637304,0.7812147,0.8963944), stringsAsFactors = FALSE)
df$col1bis <- substr(df$col1,18,28)
df$col1 <- substr(df$col1,1,16)
Then I reaggange the order of the columns:
df <- df[, c(1,3,2)]
resulting in:
> df
col1 col1bis col2
1 TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
2 TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
3 TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
I tried this one and it worked well.
df <- cbind(df[,1],df)
df[,1] <- substr(df[,1],1,16)
df[,2] <- substr(df[,2],18,28)

finding similar strings in each row of two different data frame

I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)
At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written
for example
Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others
Maybe you should normalize your data first. For instance, you could do:
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
This function can applied to each column of df1 as well as the lookup table df2:
s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")
With this normalized data, it is easy to generate the output you are looking for:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}
process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4" "3-15" "7-16"
The output is not exactly the same as in the question, but this may be due to manual lookup errors.
In order to iterate over all columns of df1, you could use lapply:
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)
Now, res is a list indexed by the column names of df1:
res[["sample_1"]]
# [1] "4" "1" "4"

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

Resources