Replace characters in column names gsub - r

I am reading in a bunch of CSVs that have stuff like "sales - thousands" in the title and come into R as "sales...thousands". I'd like to use a regular expression (or other simple method) to clean these up.
I can't figure out why this doesn't work:
#mock data
a <- data.frame(this.is.fine = letters[1:5],
this...one...isnt = LETTERS[1:5])
#column names
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
#function to remove multiple spaces
colClean <- function(x){
colnames(x) <- gsub("\\.\\.+", ".", colnames(x))
}
#run function
colClean(a)
#names go unaffected
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
but this code does:
#direct change to names
colnames(a) <- gsub("\\.\\.+", ".", colnames(a))
#new names
colnames(a)
# [1] "this.is.fine" "this.one.isnt"
Note that I'm fine leaving one period between words when that occurs.
Thank you.

names(a) <- gsub(x = names(a), pattern = "\\.", replacement = "#")
you can use gsub function to replace . with another special character like #.

Rich Scriven had the answer:
Define
colClean <- function(x){ colnames(x) <- gsub("\\.\\.+", ".", colnames(x)); x }
and then do
a <- colClean(a)
to update a

Related

How to add a character when saving files with lapply

I have the following list
L = list.files(".", ".txt")
which is
a.txt
b.txt
c.txt
and I want to apply some code to all files in that list, but I want to save the dataframes with the samename plus some character to indicate that it is modified. For example
a_modified.txt
b_modified.txt
c_modified.txt
I'm currently used this code:
datalist = lapply(L, function(x) {
DF = read.csv(x, sep = ",")
DF$X = gsub("[:.:][[:digit:]]{1,3}","", DF$X))
colnames(DF)[colnames(DF)=="X"] <- "ID"
DF <- merge(DF, genes ,by="ID")
write.csv(DF, x)
return(DF)
})
I tried using
write.csv(DF, x+"_modified")
which was obviously wrong, as write.csv does not accept this exact operation.
Any ideas?
We need paste instead of +
write.csv(DF, paste0(sub("\\.txt", "", x), "_modified.csv"))
or this can be done within sub itself
write.csv(DF, sub("\\.txt", "_modified.csv", x))
NOTE: intial datasets were .txt

removing spaces from all column names at once in R using gsub [duplicate]

I am reading in a bunch of CSVs that have stuff like "sales - thousands" in the title and come into R as "sales...thousands". I'd like to use a regular expression (or other simple method) to clean these up.
I can't figure out why this doesn't work:
#mock data
a <- data.frame(this.is.fine = letters[1:5],
this...one...isnt = LETTERS[1:5])
#column names
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
#function to remove multiple spaces
colClean <- function(x){
colnames(x) <- gsub("\\.\\.+", ".", colnames(x))
}
#run function
colClean(a)
#names go unaffected
colnames(a)
# [1] "this.is.fine" "this...one...isnt"
but this code does:
#direct change to names
colnames(a) <- gsub("\\.\\.+", ".", colnames(a))
#new names
colnames(a)
# [1] "this.is.fine" "this.one.isnt"
Note that I'm fine leaving one period between words when that occurs.
Thank you.
names(a) <- gsub(x = names(a), pattern = "\\.", replacement = "#")
you can use gsub function to replace . with another special character like #.
Rich Scriven had the answer:
Define
colClean <- function(x){ colnames(x) <- gsub("\\.\\.+", ".", colnames(x)); x }
and then do
a <- colClean(a)
to update a

read tables and assign to a string in a loop in R

I'm sure there is a trivial answer to this but I can't seem to find the right code. I have a list of files and a list of strings that I would like to assign the contents of those files to as dataframes. Then I would like to perform other things on the dataframes within the same loop. I also need to keep each dataframe for downstream work. here is my code:
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
fc_files <- c('fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt','fc21_g21_full_annot_uniq.txt')
# make dataframes
for (file in fc_files)
{ fc_n <- 1
g_n <- 1
print(file);
# THE BIT THAT DOESN'T WORK
assign(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T));
# HERE I EXPECT THE TOP OF MY DF TO BE PRINTED BUT IT ISN'T
head(data_fc14);
# I TRY THIS INSTEAD
do.call("<-",list(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T)))
# I TRY TO PRINT THE DF AGAIN BUT STILL NO LUCK
head(paste("data", fc_samples[fc_n], sep='_'))
# FIRST DOWNSTREAM THING I WOULD LIKE TO DO,
# WON'T WORK UNTIL I SOLVE THE DF ASSIGNMENT ISSUE
names(paste("data", fc_samples[fc_n], sep='_'))[names(paste("data", fc_samples[fc_n], sep='_'))==c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc','DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc',
# 'NOVEL_fc')] <- c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ','GENE','AFFECTS','dbSNP','NOVEL')
# ITERATE TO THE NEXT FILE
fc_n <- fc_n+1
}
I tried solutions from here and here but it didn't help.
If anyone has an elegant solution to this then that would be great! Thanks in advance!
Fixing your code:
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
# Make dummy example files
fc_files <- file.path("example-data", c(
'fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt',
'fc21_g21_full_annot_uniq.txt'))
set.seed(123) ; dummy_df <-
setNames(
as.data.frame(replicate(12, rnorm(7))),
c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc',
'DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc','NOVEL_fc')
)
if (!dir.exists("./example-data")) dir.create("example-data")
invisible({
lapply(fc_files, write.table, x = dummy_df, sep = "\t")
})
# "fc_n <- 1" should be outside the loop:
fc_n <- 1
for (file in fc_files) {
g_n <- 1
assign(paste("data", fc_samples[fc_n], sep='_'),
read.table(file,sep = "\t", header=T))
# Copy data to be able to change its names
f <- get(paste("data", fc_samples[fc_n], sep='_'))
names(f)[names(f) == c('SAMPLE_fc','CHROM_fc','START_fc',
'REF_fc','ALT_fc','REGION_fc',
'DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc',
'dbSNP_fc','NOVEL_fc')] <-
c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ',
'GENE','AFFECTS','dbSNP','NOVEL')
# Assign it back, now that names have been changed
assign(paste("data", fc_samples[fc_n], sep='_'), f)
fc_n <- fc_n+1
}
A "more elegant" way:
assign()ing is not considered best practice, rather work with lists.
Though I occasionally use it myself, there are sometimes good reasons to.
# For the '%>%' pipe
library(magrittr)
data <-
samples %>%
grep(pattern = "fc", value = TRUE) %>%
setNames(nm = .) %>%
lapply(grep, x = fc_files, value = TRUE) %>%
lapply(read.table, sep = "\t", header = TRUE) %>%
lapply(function(f) setNames(f, sub("_fc", "", names(f))))
identical(data_fc14, data$fc14)
# [1] TRUE
identical(data_fc18, data$fc18)
# [1] TRUE
identical(data_fc21, data$fc21)
# [1] TRUE
# Clean up
print(unlink("example-data", recursive = TRUE))
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
fc_files <- c('fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt','fc21_g21_full_annot_uniq.txt')
g_files <- c('g14_full_annot_uniq.txt','g18_full_annot_uniq.txt','g21_full_annot_uniq.txt')
# make dataframes
df_names <- c("data_fc14","data_fc18","data_fc21")
fc_n <- 1
for (file in fc_files)
{
assign(df_names[fc_n], read.table(file,sep = "\t", header=T)); #WORKS
#do.call("<-",list(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T))); #ALSO WORKS
print(head(df_names[fc_n]))
print(head(eval(as.symbol(df_names[fc_n]))))
df <- eval(as.symbol(df_names[fc_n]))
names(df)[names(df) == c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc','DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc',
'NOVEL_fc')] <- c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ','GENE','AFFECTS','dbSNP','NOVEL')
assign(df_names[fc_n], df)
print(head(eval(as.symbol(df_names[fc_n]))))
print(file);
fc_n <- fc_n+1
}
Thanks to all that helped, I solved it using the advise from "apom" in the end as it is most intuitive for more novice R users.

R: Reading number with comma and negative number with negative sign at the end

I am having problems reading a CSV file that has a number in a strange format. I would like to read the value as a number into R.
I am reading the CSV file using read.csv normally into a DF.
The problem is that one of the columns read the value as a factor variable.
Example:
CSV file:
713,78-;713,78;577,41-;577,41;123,82-;123,82
After I read it into a dataframe the result is:
[1] 713,78- 713,78 577,41- 577,41 123,82- 123,82
6 Levels: 713,78- 713,78 577,41- 577,41 123,82- 123,82
In the case illustrated above, I would like the following output:
[1] -713.78 713.78 -577.41 577.41 -123.82 123.82
Where the column number would be class Numeric.
It should work in general:
fixData <- function(x)
{
x <- gsub(',', '.', x)
x[grep('-$', x)] <- paste0('-', x[grep('-$', x)])
x <- as.numeric(sub('-$', '', x))
return(x)
}
myData <- read.csv2(file, stringsAsFactors = F)
fixedData <- sapply(myData , fixData )
That's an ugly number format.
This should get it to what you want it to be.
x <- factor(c("713,78-", "713,78", "577,41-", "577,41", "123,82-", "123,82"))
scalar <- ifelse(grepl("-", x), -1, 1)
x <- as.character(x)
x <- gsub(",", ".", x)
x <- gsub("-", "", x)
x <- as.numeric(x) * scalar

How to convert special value in R data-frame to NA?

I have a data-frame containing # as a missing value in multiple columns. How can I convert all such #s to NAs?
is.na(dat) <- dat == "#"
will do the trick (where dat is the name of your data frame).
You can do this a few ways. One is to re-read the file in with the na.strings argument set to "#"
read.table(file, na.strings = "#")
Another would be to just change the values in the data frame df with
df[df == "#"] <- NA
I have written a function makemeNA that is part of my "SOfun" package.
The function looks like this (in case you don't want to get the package just for this function):
makemeNA <- function (mydf, NAStrings, fixed = TRUE) {
if (!isTRUE(fixed)) {
mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x))
NAStrings <- ""
}
mydf[] <- lapply(mydf, function(x) type.convert(
as.character(x), na.strings = NAStrings))
mydf
}
Usage would be:
makemeNA(df, "#")
Get the package with:
library(devtools)
install_github("mrdwab/SOfun")

Resources