question regarding changing tile of rows and columns of a big matrix - r

I have a complex names of big matrix. I'm supposed to replace the name by splitting the name of each column which are separated by "_".
sample of name
d__Bacteria.p__Firmicutes.c__Clostridia.o__Lachnospirales.f__Lachnospiraceae.g__Tuzzerella.__
now my target is to extract only family name of each group ending "aceae" (number 6 name) in all columns names and replace instead of such a big complex name.
may I ask you to help me?
I made vectors of columns and rows, and used library(stringr)
strsplit(colname_matrix, "_")
I have a list of split names but I do not know how I remove the rest and just keep names ending with "aceae" and apply it for all names in columns and rows.
matrix is symmetrical

x<-"d__Bacteria.p__Firmicutes.c__Clostridia.o__Lachnospirales.f__Lachnospiraceae.g__Tuzzerella.__ "
library(stringr)
str_extract(x, "(?<=f__)[^.g]+")
Base R if u do not want "aceae"
sub(".*.f__","",sub("aceae.*", "", x))
or
y<-as.vector(str_split(x,"__"))
y[[1]][str_detect(y[[1]], "aceae")]

Related

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

Separating out 6 numerical values from a column in R where there are several delimiters

I have a csv with 2 columns but should be 7. The first column is a numerical ID. The second column has the other six numerical values. However, there are several different delimiters between them. They all follow the same pattern: numerical value, a dash ("-) OR a colon (":"), eight spaces, and then the next numerical value, until the final numerical value, with nothing after it. It starts with a dash and alternates with a colon. For example:
28.3- 7.1: 62.3- 1.8: 0.5- 196
Some of these cells have missing values denoted by a single period ("."). Example:
24- .: 58.2- .: .- 174
I'm using R but I can't figure out how to accomplish this. I know it probably requires dplyr or tidyverse but I can't find what to do where there are different delimiters and spaces.
So far, I've only successfully loaded the csv and used "str()" to determine that the column with these six values is a factor.
Here is how the data look in the .csv:
Here is how it looks in RStudio after I read it in using read.csv
Here is how it looks in RStudio if I use tab as the delimiter when using read.csv, as suggested in the comments
I would try just to sort out that first column if it is the only one doing the following:
CDC_delim <- read.table('CBC.csv', sep="\t", header=F)
head(CBC_delim)
then to split that first column into two but keep both elements:
CBC_delim <- CBC_delim %>%
#
mutate(column1 = as.character(column1)) %>% # your column names will be different, maybe just V1,
#
mutate(col2 = sapply(strsplit(column1,","), `[`, 1),
col3 = sapply(strsplit(column1,","), `[`, 2))
Should leave you with some basic tidy up such as deleteing the original column1, you can check you column names using colnames(CBC_delim)
But also see:
how-to-read-data-with-different-separators

Rename dataframe columns by string matching in R

I am looping through a series of ids, loading 2 csvs for each, and applying some analysis to them. I need rename the columns of one of the 2 csvs to match the row values of the other. I need to do this inside the loop in order to apply it to the csvs for every id.
I have tried renaming the columns like this:
`names(LCC_diff)[2:length(LCC_diff)] <- c("Bare.areas" = "Bare areas",
"Tree." = "Tree ", "Urban.areas" = "Urban areas",
"Water.bodies" = "Water bodies")`
where LCC_diff is a dataframe and the first value in each pair is the original column name and the second is the name that i want to assign to that column, but it just replaces the column names in order, and does not match them.
This is a problem because not all column names need replaced, and the csvs for different ids have these columns in different orders.
How do I match the original column names to the strings that I want to use to replace them?
Try rename them first, it should be much easier if they have the same name.
library(stringr)
str_replace_all(c("Tree ","Bare areas")," ",".")
[1] "Tree." "Bare.areas"

R - Subset based on column name

My data frame has over 120 columns (variables) and I would like to create subsets bases on column names.
For example I would like to create a subset where the column name includes the string "mood". Is this possible?
I generally use
SubData <- myData[,grep("whatIWant", colnames(myData))]
I know very well that the "," is not necessary and
colnames
could be replaced by
names
but it would not work with matrices and I hate to change the formalism when changing objects.

Identifying dataframe columns by their first few characters

I have a dataframe in which the column names begin with certain characters:
> colnames(df)
[1] "p.crossfencing" "p.livestockdrinking" "v.livestocktrail"
[5] "v.landclearing" "v.grazelivestock" "v.useequipment"
Etc...
I'd like to select columns based on the first few characters (for example, those column names that begin with "v.") Basically, I'm trying do the same thing that ls(pattern="") does for objects, but in my case, for column names within a dataframe.
EDIT: Answer by Thomas below put me on the right path. I needed to use:
j[grep("^v.",j)]
where j <- colnames(df).
Are you looking for df[,grep("^v.",names(df))]?
You could also write something as below:
df[, (grep(x = colnames(df), pattern = "^v."))]

Resources