R change column names, retain part of colnames - r

I have a df with colnames thus:
Resampled.Band.1..raster.bsq...404.502014.Nanometers.
...
Resampled.Band.74..raster.bsq...950.851990.Nanometers.
I want them like this:
950.851990_nm
With:
orig_names <- names(df)
new_name <- gsub("Resampled.Band.", "", orig_names)
and
new_name <- gsub(".Nanometers.", "_nm", new_name)
names(all_roi_rfl) <- new_name
I achieve part of what I want: to change the first and last parts of the colnames:
1..raster.bsq...404.502014_nm
I could repeat this to clean the colnames up most of the way.
But how do I deal with the part of the colnames that varies itself, the band number?

Extract the values that you want using regex and replace the column names.
x <- c('Resampled.Band.1..raster.bsq...404.502014.Nanometers.',
'Resampled.Band.74..raster.bsq...950.851990.Nanometers.')
sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', x)
#[1] "404.502014_nm" "950.851990_nm"
This extracts number that occur between "raster.bsq" and "Nanometers" and appends "_nm" to extracted value.
In your case to replace column names it would be :
names(all_roi_rfl) <- sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', names(all_roi_rfl))

A similar answer to Ronak's but using gsub instead.
First generate a dataframe...
df <-
data.frame(
Resampled.Band.1..raster.bsq...404.502014.Nanometers. = c(1, 2, 1, 2),
Resampled.Band.74..raster.bsq...950.851990.Nanometers. = c('a', 'b', 'c', NA))
using gsub identify the string before and after the piece you want to extract
colnames(df) <- gsub(".*raster.bsq...(.+).Nanometers.", "\\1_nm", colnames(df))

Related

How to group column names and add suffixes to them?

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.
An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Using similar variable names in R, split/subset a large dataframe into multiple smaller ones

I have a dataset with more than 300 variables in the following manner:
create example data:
id <- c('a','b','c', 'd', 'e', 'f')
type <- c(1,2,3,1,2,3)
x_97 <- c(1,2,3,4,5,6)
y_97 <- c('q','w','r','t', 'y', 'i')
z_97 <- c(80,90,70,50,60,40)
x_98 <- c(7,8,9,4,5,6)
y_98 <- c('y', 'i', 'r','t','q','w')
x_99 <- c(4,5,5,6,1,2)
z_99 <- c(20,10,40,50,20,50)
w_99 <- c(8,9,7,4,5,NA)
my.data <- data.frame(id, type, x_97, y_97, z_97, x_98, y_98, x_99, z_99)
Please note: _97, _98, _99 are years 1997, 1998 and 1999.
expected outcome:
I want to split this big data frame into 3 smaller data frames by year on the basis of id and type.
initial thoughts:
I am creating a list:
my.list <- c("_97", "_98", "_99")
And now I want to write something like this:
newdata97 <- subset(my.data, all variables with the 1st object of my.list)
newdata98 <- subset(my.data, all variables with the 2nd object of my.list)
and so on.
question
I am not sure how to achieve the newdata frames as above. Can anyone please help?
Moreover, I think there must be a more elegant solution to this with something from apply family. Any idea?
Thank you very much for your help.
We can use loop through the 'my.list', use grep to extract the column names that match the substring in 'my.list', cbind with the first two column to create a list of data.frames
lst1 <- lapply(my.list, function(x) cbind(my.data[1:2],
my.data[grep(x, names(my.data))]))
If there is one of the columns among 'x', 'y', 'z' are missing, then can assign it to NA
lst1 <- lapply(lst1, function(x) {nm1 <- setdiff(paste0(c('x', 'y',
'z'), substring(names(x)[3], 2)), names(x)[-(1:2)]); x[nm1] <- NA; x})
Or instead of creating columns later, create NA columns in the 'my.data'
my.data[setdiff(paste0(rep(c("x_", "y_", "z_"), each = 3),
97:99), names(my.data)[-(1:2)])] <- NA
and then use grep as above into creating a list of data.frames
Or another option is split based on the substring of the column names
lst1 <- lapply(split.default(my.data[-(1:2)],
sub(".*_", "", names(my.data)[-(1:2)])), function(x) cbind(my.data[1:2], x))
It is better to keep it as a list, but if we need individual data.frames in the global env, then name the list elements and use list2env (not recommended though)
names(lst1) <- paste0("newdata", substring(my.list, 2))
list2env(lst1, envir = .GlobalEnv)

Extract rows from data frame which have matches from vector, but matches must be all the way at the end of string in value

I have a data frame like the following:
sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)
df <- data.frame(sampleid, values)
I also have a vector like the following:
matches <- c("632_CSF_CD8+", "632_CSF").
I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:
matches <- c("632_CSF").
Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.
How can this be achieved?
Thanks!
Just use $ in your pattern to indicate that it occurs at the end of the string.
grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"
You can make this with stringr and some manipulations.
You need to encode regex, it's done with quotemeta function.
Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.
And then it should be used with str_detect to get boolean indices.
library(stringr)
# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')
ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE TRUE TRUE
df[ind, ]
# sampleid values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3 control_sdlkfjd_2632_CSF 7.001628
Suggest making your dataset more regular.
library(tidyverse)
df_regular <- df %>%
separate(
sampleid,
into = c("patient_type",
"test_number",
"patient_group",
"patient_id"),
extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))
df_regular
df_regular %>%
filter(patient_group %in% "2632" & patient_id %in% "000000CSF")

Provide column and row names for data.frame in 1 line

I have a vector of rownames (x) and I want to name my columns (2) "A" and "B".
I want to do it in one line of code - data.frame(row.names = x, "A", "B").
Please advise what I am doing wrong? Should I use multiple lines of code for this?
I am not quite sure what you are after. But you can re-name row and column names as below using dimnames - this can be extended to multidimensional arrays as well.
df <- data.frame(A=c(1:3), B=c(4:6))
dimnames(df)[[1]] <- row_names_vector
dimnames(df)[[2]] <- col_names_vector
Other option is
rownames(df) <- row_names_vector
colnames(df) <- col_names_vector
One line
dimnames(df) <- list(row_names_vector, col_names_vector)
Example
row_names_vector <- letters[1:3]
col_names_vector <- letters[1:2]
dimnames(df) <- list(row_names_vector, col_names_vector)

R: control auto-created column names in call to rbind()

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Resources