match part of names in data.frame to new column - r

I have a data.frame with one column of Sample.Names. The Names contain information where the samples are from. For instance c(RT4_4, RT3_6, RT4_2, RT3_9, RT5_5) RTx is the name of the site they are from and then follows a number.
I want a new columns that gives me the information were they are from. If Sample.Names contains RT4 -> df$Site == RT4
I don't know if there is a functions that allows you to look at part of the name my idea was
df$Site <- with(df, ifelse(df$Sample.Name %in% "RT4","RT4", ifelse(df$Sample.Name %in% "RT3","RT3","RT5")))
this doesn't work

You can use sub:
df$Site <- sub("_.+", "", df$Sample.Name)
This works with numbers consisting of multiple digits too.

Related

Extracting Multiple Dataframes from a folder and using them in multiple functions

I am new to R and I need help sorting a scenario in R programming.
For the First Part of the problem:
I have a folder with multiple SAS files in a specific location and whose path would be coming in an excel file. I have managed to extract the files from the path provided in excel file as below:
Spec <- read_excel("file", sheet = "first")
Then to extract the dataframes from the folder with the below code (There are 3 dataframes in the folder namely "aa.sas7bdat", "bb.sas7bdat", "cc.sas7bdat" or it can be any number depending on the folder, but for this one we are taking 3 dataframes)
Path <- Spec$`Source Data Path`[1:1] (#it is in the first row in first column of excel file)
Files <- list.files(path = Path, pattern="*.sas7bdat", full.names=FALSE)
then putting a loop for sorting the dataframes as required for further process (as explained later)
Final_List <- NULL
for (y in Files){
List <- unlist(strsplit(y, split = ".sas7bdat", fixed = TRUE))
Final_List <- c(Final_List, List)
Final_List <- toupper(Final_List)
read_files[[List]] <- read_sas(y)
}
print(Final_List)
The above loops output is 3 dataframes namely "AA", "BB", "CC" stored in variable "Final_List" and now we need to access these dataframes from here on to another function.
Now for the second part
Now there is a requirement to filter all the dataframes based on one value of a column of single dataframe Dynamically
Let's say the input by user is a value of "Male" from column name "Gender" from dataframe "AA" (It can be any value from any column of any dataframe as this needs to be DYNAMIC)
Column_Name_Value <- "MALE" (#from the column "Gender" selecting only male values)
Dataframe_Name <- "AA"
I have created 3 functions to help in filtering
1st Function to filter the unique values of "Gender" (or any column name) from dataframe "AA"
Unique_Value_Fun <- function (dataframe, value) {
Unique_Value1 <- dataframe %>% distinct({{value}}) %>% filter({{value}} != "")
Unique_Value <- unlist(Unique_Value1)
return(Unique_Value)
}
Unique_Value <- Unique_Value_Fun({{Dataframe_Name}}, UQ(sym(Column_Name_Value)))
Now we have the dataframe AA with only "Male" values. Now there is a common column of ID's in all the dataframes, if the first dataframe is filtered with the "Unique Value" it will have only those "ID's" present. Now we need to filter all other dataset present in the folder with the same "ID's". The CATCH is all other dataframes will not have the same column name and the same values. BUT they have a common column "ID" for every dataframe so we need to use this common ID for filtering the rest of the dataframes.
Below code have the rest 2 functions
for (z in Unique_Value) {
print(z)
First_dataframe_Fun <- function (dataframe, value) {
dataframe %>% filter({{value}} == {{z}})
}
First_dataframe <- First_dataframe_Fun({{Dataframe_Name}}, UQ(sym(Column_Name_Value)))
Now if I take the dataframes values and put it against the functions, it works well (i.e, it's hardcoded but not Dynamic)
AA <- First_dataframe
BB <- BB %>% filter(ID %in% First_dataset$ID) (# filtering as per ID of first dataset to match the ID's)
CC <- CC %>% filter(ID %in% First_dataset$ID)
Based on the "First Dataframe ID's" we need to filter the rest of the dataframes dynamically. Now to make it dynamic I tried with if condition inside a for loop but it didn't work out.
Please suggest a logic or a similar code where I can make this Dynamic. If I provide any value of a dataframe it should filter the rest of datasets with the ID's (as there can be n number of dataframes but in our case we have taken only 3 dataframes for example).

Modifying an object referenced by "get()" in R

Apologies if this has been asked before. It's at the limit of my understanding of R, so I'm not even sure of the correct language in which to couch the query (hence, my inability to identify duplicate questions).
In my environment, I have an unknown number of objects (dataframes), each of which has an unknown number of columns that have meaningful names but with nonsense endings, which make it hard to reference them. The meaningful parts of the column names are usually followed by a double period and some further text. I want to automate finding and removing the meaningless suffixes. All the objects I want to modify have ".dat" in their names. Here's my attempt at an example:
# create some objects in my environment
a <- "a string, not of interest to me"
b.dat <- data.frame(col1 = 1:2, col2..gibberish = 3:4)
c.dat <- data.frame(col1..some.text = 5:6, col2 = 7:8)
# find the dataframes that I want to manipulate
dfs <- ls(pattern = ".dat")
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# and, the bit that doesn't work: change the problematic column names to their shorter alternatives
names(get(df))[problem.cols] <<- parts
return(0)
})
If I run this line by line, it does everything I want, up to and including names(get(df))[problem.cols], which it knows are the names of the columns in the dataframe I'm trying to alter. However, it won't assign the altered names to that, yielding the error message: Error in get(*tmp*) : invalid first argument.
I'm open to alternative approaches to achieve my desired end-point. However, I'm also intrigued by why this doesn't work and how, more generally, it's possible to alter an object referenced using "get()". Thanks in advance for any advice - and apologies if this is so naive it's been a waste of your time just reading it.
FWIW, I can see the similarity to this question but I can't adapt the answer to my needs.
Actually, I eventually made the link to using the "assign" function. This seems to work (so I've posted it here, in case it helps anyone else) - but I'd still be interested in alternative solutions:
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# change the problematic column names to their shorter alternatives
nms[problem.cols] <- parts
names(dat) <- nms
assign(df, dat, envir = .GlobalEnv)
return(0)
})

Change complicated strings in R with qsub or R-strings

I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name
You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.
If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))
Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilĂ 

Order df columns according to a target vector (but the names match only partially)

I have a data.frame (PC) that looks like this:
http://i.stack.imgur.com/NWJKe.png
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
http://i.stack.imgur.com/vQ48u.png
I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.
PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")
I started by changing the names in the age matrix to replace the '-' by '.':
new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID
Then, I ordered the row names (SUBJUD) of the age matrix by age:
sort.age <- with(age, age[order(AGE) , ])
sort.age <- na.omit(sort.age)
I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).
age.id <- sort.age$SUBJID
But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?
Thank you very much in advance!
Svalf
It would have been better to show the example without using an image. Suppose, if there are two strings,
str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1')
str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the - with . and sorted), we remove the suffix part by matching the . followed by 4 digits (\\d{4}) followed by one or more characters to the end of the string (.*$) and replace it by ''.
str1N <- sub('\\.\\d{4}.*$', '', str1)
str1[order(match(str1N, str2))]
#[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
#[3] "GTEX.PFPP.0007.SM.2D8W1"

automatic column prefix with cbind and just one column

I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))
A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)
What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.

Resources