strsplit intermediate pattern in first column in a data frame - r

I have a data frame and I would like to split the first column into two columns but the separate pattern is similar to others and I only want to split the pattern located on number 4.
data frame:
TCGA-TS-A7P1-01A-41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A-11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A-11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A-11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A-11D-A34E-05 0.8882558
desired output:
TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A 11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A 11D-A34E-05 0.8882558
I tried:
sapply(strsplit(as.character(df$ID), "-"), '[', 1:4)
However, it is not the desired output above that I want. Thank you very much.

It seems all the elements of your first column are of the same length so one simple way could be:
df <- data.frame(col1 = c("TCGA-TS-A7P1-01A-41D-A39S-05","TCGA-NQ-A57I-01A-11D-A34E-05","TCGA-3H-AB3O-01A-11D-A39S-05"),
col2 = c(0.8637304,0.7812147,0.8963944), stringsAsFactors = FALSE)
df$col1bis <- substr(df$col1,18,28)
df$col1 <- substr(df$col1,1,16)
Then I reaggange the order of the columns:
df <- df[, c(1,3,2)]
resulting in:
> df
col1 col1bis col2
1 TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
2 TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
3 TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944

I tried this one and it worked well.
df <- cbind(df[,1],df)
df[,1] <- substr(df[,1],1,16)
df[,2] <- substr(df[,2],18,28)

Related

How to group column names and add suffixes to them?

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.
An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Replace values in column with matching column in different DF

I have two data frames:
DF <- data.frame(A=letters[1:5],B=1:5)
DF_2 <- data.frame(match_col = c("a","a","c"))
Here we have to get only matching columns of DF_2$match_col
final_df <- data.frame(A=c("a","a","c","d","e"),B=1:5)
Your question here is not very clear. For youR DF_2, I am not sure if there is a column of B in it. I assume you forgot to include it, as I assume you need that column to perform matching.
Please see below:
DF <- data.frame(A=letters[1:5],B=1:5)
DF_2 <- data.frame(match_col = c("a","a","c"))
DF_2$B=c(1:3)
DF$A= as.character(DF$A)
DF_2$match_col= as.character(DF_2$match_col)
for(id in 1:nrow(DF_2)){
DF$A[DF$B %in% DF_2$B[id]] <- DF_2$match_col[id]
}
DF
Here my DF matches with your final_df, therefore I presume my assumption is right.

Exclude redundant rows containing different strings

I would like to exclude rows from a data-frame which contain mirrored info. This is my input:
dfin <- 'info
c1-10-20-c2-40-50
c2-1-2-c4-20-25
c4-20-25-c2-1-2
c2-40-50-c1-10-20'
dfin <- read.table(text=dfin, header=T)
In the above example you can see that rows 1 and 3; 2 and 4 represent the same logic in a 'mirror'. In my context does not matter if I have c1-10-20-c2-40-50 or c2-40-50-c1-10-20, thus I would like to filter one of this rows out (any of them). I don't have more than two redundant rows. Moreover, In my actual data-set these 'mirrored' rows are scattered and do not follow a pattern. My expected output:
dfout <- 'info
c1-10-20-c2-40-50
c2-1-2-c4-20-25'
dfout <- read.table(text=dfout, header=T)
We can split the 'info' column by -, sort it, convert to a logical vector with duplicated which will be used for subsetting the rows.
dfN <- dfin[!duplicated(lapply(strsplit(as.character(dfin$info), "-"), sort)),, drop=FALSE]
all.equal(dfN, dfout, check.attributes=FALSE)
#[1] TRUE
Here is an approach which does not keep the original order:
dfin <- 'info-info-info-info-info-info
c1-10-20-c2-40-50
c2-1-2-c4-20-25
c4-20-25-c2-1-2
c2-40-50-c1-10-20'
df <- read.table(text=dfin, header=T, sep = "-", strip.white = T)
dfout<-as.data.frame(unique(t(apply(df, 1, sort))))
I extended your column name to make it work.

finding similar strings in each row of two different data frame

I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)
At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written
for example
Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others
Maybe you should normalize your data first. For instance, you could do:
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
This function can applied to each column of df1 as well as the lookup table df2:
s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")
With this normalized data, it is easy to generate the output you are looking for:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}
process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4" "3-15" "7-16"
The output is not exactly the same as in the question, but this may be due to manual lookup errors.
In order to iterate over all columns of df1, you could use lapply:
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)
Now, res is a list indexed by the column names of df1:
res[["sample_1"]]
# [1] "4" "1" "4"

R: control auto-created column names in call to rbind()

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Resources