regex for searching through dataframe in R

regex for searching through dataframe in R - r

I have a list of barcodes with the format: AAACCTGAGCGTCAAG-1
The letters can be A, C, G or T and the number after the dash can be 1 - 16.
barcode = c('AAACCTGAGCGTCAAG-1',
'AAACCTGAGTACCGGA-1',
'AAACCTGCAGCTGCTG-1',
'AAACCTGCATCACGAT-3',
'AAACCTGCATTGGGCC-5',
'AAACCTGGTATAGTAG-10',
'AAACCTGGTCGCGTGT-1',
'AAACCTGGTTTCCACC-16',
'AAACCTGTCATGCATG-14',
'AAACCTGTCGCAGGCT-15',
'AAACGGGAGAACTCGG-1')
cluster = c(6,3,6,16,17,11,14,18,9,8,14)
df <- data.frame(Barcode = barcode, Cluster = cluster)
I need to subset this dataframe based on the -# at the end of the barcode. I have been using this to subset the dataframe. The problem is this works for every number except 1.
> df[grep("([ACGT]-10){1}", df$Barcode),]
Barcode Cluster
6 AAACCTGGTATAGTAG-10 11
When I use the following, it will include all the barcodes that end in -1, as well as -10, -11, -12, -13, -14, -15 and -16.
> df[grep("([ACGT]-1){1}", df$Barcode),]
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
6 AAACCTGGTATAGTAG-10 11
7 AAACCTGGTCGCGTGT-1 14
8 AAACCTGGTTTCCACC-16 18
9 AAACCTGTCATGCATG-14 9
10 AAACCTGTCGCAGGCT-15 8
11 AAACGGGAGAACTCGG-1 14
>
Is there a regex that will include barcodes ending in -1, but exclude all other barcodes that end in numbers from 10 - 16?
I want to subset the dataframe so that I only get this:
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14
>
Thanks!

How about:
df[grep("-1$", df$Barcode),]
This matches 1 at the end of the string, but also requires that the digit before 1 is not 1, so you don't match 11
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14

I think you can just use df[grep("([ACGT]-1$){1}", df$Barcode),]
You can just use a $ to specify the end of the chain. See more information here on "pattern" use: http://www.jdatalab.com/data_science_and_data_mining/2017/03/20/regular-expression-R.html

Related

Vectorized function usage and joining individual terms into a single tibble

the title is vague but let me explain:
I have a non-vectorized function that outputs a 15-row table of volume estimates for a tree. Each row is a different measurement unit or portion of the input tree. I have a Tables argument to help the user decide what units and measurement protocol they're looking to find, but in 99% of use case scenarios, the output for a single tree's volume estimate is a tibble with more than one row.
I've removed ~20 other arguments from the function for demonstration's sake. DBH is a tree's diameter at breast height. Vol column is arbitrary.
Est1 <- TreeVol(Tables = "All", DBH = 7)
Est1
# A tibble: 15 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 1. Total_Above_Ground_Cubic_Volume 7 2
2 2. Gross_Inter_1/4inch_Vol 7 4
3 3. Net_Scribner_Vol 7 6
4 4. Gross_Merchantable_Vol 7 8
5 5. Net_Merchantable_Vol 7 10
6 6. Merchantable_Vol 7 12
7 7. Gross_SecondaryProduct_Vol 7 14
8 8. Net_SecondaryProduct_Vol 7 16
9 9. SecondaryProduct 7 18
10 10. Gross_Inter_1/4inch_Vol 7 20
11 11. Net_Inter_1/4inch_Vol 7 22
12 12. Gross_Scribner_SecondaryProduct 7 24
13 13. Net_Scribner_SecondaryProduct 7 26
14 14. Stump_Volume 7 28
15 15. Tip_Volume 7 30
the user can utilize the Tables argument as so:
Est2 <- TreeVol(Tables = "Scribner_BF", DBH = 7)
# A tibble: 3 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
The problem arises in that I'd like to write a vectorized version of this function that can calculate the volume for an entire .csv of tree inventory data. Ideally, I'd like the multi-row outputs that relate to a single tree to output as one long tibble, with each 15-row default output filtered by what the user passes to the Tables argument as so:
Est3 <- VectorizedTreeVol(Tables = "Scribner_BF", DBH = c(7, 21, 26))
# A tibble: 9 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
4 3. Net_Scribner_Vol 21 18
5 12. Gross_Scribner_SecondaryProduct 21 72
6 13. Net_Scribner_SecondaryProduct 21 76
7 3. Net_Scribner_Vol 26 8
8 12. Gross_Scribner_SecondaryProduct 26 78
9 13. Net_Scribner_SecondaryProduct 26 84
To achieve this, I wrote a for() loop that acts as the heart of the vectorized function. I've heard from multiple people that it's very inefficient (and I agree), but it works with the principle I'd like to achieve, in theory. Nothing I've found on this topic has suggested a better idea for application in a vectorized function like mine.
The general setup for the loop looks like this:
for(i in 1:length(DBH)){
Output <- VectorizedTreeVol(Tables = Tables[[i]], DBH = DBH[[i]]) %>%
purrr::reduce(dplyr::full_join, by = NULL) %>%
SuppressWarnings()
and in functions where the non-vectorized output is always a single row, the heart of its respective vectorized function doesn't need to be encased in a for() loop and looks like this:
Output <- OtherVectorizedFunction(Tables = Tables, DBH = DBH) %>%
purrr::reduce(dplyr::full_join, by = ColumnNames) %>% #ColumnNames is a vector with all of the output's column names
SuppressWarnings()
This specific call to reduce() has worked pretty well when I've used it to vectorize the other functions in the project, but I'm open to suggestions regarding how to join the output tables. I've been stuck on this dilemma for a few months now, and any help regarding how to achieve what this for() loop is striving for in theory would be awesome. Is having a vectorized function that outputs a tibble like Est3 even possible? Any feedback/comments are much appreciated.

Given this function:
TreeVol <- function(DBH) {
data.frame(Tables = c("Tree_Vol", "Intercapillary_transfusion", "Woodiness"),
Vol = c(DBH^2, sqrt(DBH) + 3, sin(DBH)),
DBH)
}
We could put our DBH parameters into purrr::map and then bind_rows to get a data.frame.
VecTreeVol <- function(DBH) {
DBH %>%
purrr::map(TreeVol) %>%
bind_rows()
}
Result
> VecTreeVol(DBH = 1:3)
Tables Vol DBH
1 Tree_Vol 1.0000000 1
2 Intercapillary_transfusion 4.0000000 1
3 Woodiness 0.8414710 1
4 Tree_Vol 4.0000000 2
5 Intercapillary_transfusion 4.4142136 2
6 Woodiness 0.9092974 2
7 Tree_Vol 9.0000000 3
8 Intercapillary_transfusion 4.7320508 3
9 Woodiness 0.1411200 3

WGCNA : Choosing a soft-threshold power

powers = c(c(1:10), seq(from = 12, to=20, by=2));
While going through WGCNA i came across this code which i am not able to understand, can anybody explain me the meaning of that piece of code

The code will create a vector of numbers stored in powers.
Specifically: 1:10 creates the numbers 1 2 3 4 5 6 7 8 9 10 (can read as 1 through 10) and seq(from = 12, to = 20, by = 2) creates a sequence of every other number from 12 to 20, i.e. 12 14 16 18 20.
Powers will contain the following 15 numbers: 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
I am not familiar with the WGCNApackage or if powers is an argument to a function, but this is what powers contains.

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!

I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

R: counting amount of patterns of numbers [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm fairly new here and also fairly new to R so apologies if anything is unclear.
Basically, I have a csv table of numbers for each person, 1 number for each week for 38 weeks.
For example, Anthony has number 6 in week 1, 12 in week 2 and so on, these numbers are fairly random and range from 1-20.
I have taken the numbers from the table and saved them into a string, hence Anthonys string when printed would look like
"6 12 18 7 17 4 16 11 20 15 3 5 19 10 8 9 1 14 13 19 11 16 18 4 17 7 6 12 14 1 10 13 20 15 3 5 8 9"
What I'm trying to do with this is find/count the amount of times a number between 1 and 10 occurs in groups of 3 consecutively and then groups of 4 consecutively and possibly 5.
For example, in this string 8, 9 and 1 occur consecutively and then 3, 5, 8 and 9 occur consecutively, meaning the amount of occurrences is 2.
I've tried using str_count from the stringr package and also tried a few different functions located here - Count the number of overlapping substrings within a string
I can't seem to find a method/function to get this to output what I want (a simple count of the number of occurrences).
If anyone could provide any insight/help it would be greatly appreciated.

It would be easier to keep these as numbers. Here I use scan() to turn your string into a vector of values indicating if each number is less than 10 or not then I call rle() on it to calculate run lenths
x <- "6 12 18 7 17 4 16 11 20 15 3 5 19 10 8 9 1 14 13 19 11 16 18 4 17 7 6 12 14 1 10 13 20 15 3 5 8 9"
rr <- rle(scan(text=x)<10)
Now I can mangle this into a data.frame and see which runs were longer than 2
subset(as.data.frame(unclass(rr)), values==T & lengths>2)
# lengths values
# 9 3 TRUE
# 17 4 TRUE
So we can see that we had a run of 3 and a run of 4.
I could clean this up by defining a function to turn the rle into a data.frame more easily and track the starting indexes
as.data.frame.rle <- function(x) {
data.frame(unclass(x), start=head(cumsum(c(0,rr$lengths))+1,-1))
}
and can then run
subset(as.data.frame(rle(scan(text=x)<10)), values==T & lengths>2)
# lengths values start
# 9 3 TRUE 15
# 17 4 TRUE 35
so we can see those runs start at positions 15 and 35.

Replacing each value in a vector with its rank number for a data.frame

In this hypothetical scenario, I have performed 5 different analyses on 13 chemicals, resulting in a score assigned to each chemical within each analysis. I have created a table as follows:
---- Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
Chem_1 3.524797844 4.477695034 4.524797844 4.524797844 4.096698498
Chem_2 2.827511555 3.827511555 3.248136118 3.827511555 3.234398548
Chem_3 2.682144761 3.474646298 3.017780505 3.682144761 3.236152242
Chem_4 2.134137304 2.596921333 2.95181339 2.649076603 2.472875191
Chem_5 2.367736454 3.027814219 2.743137896 3.271122346 2.796607809
Chem_6 2.293110565 2.917318708 2.724156207 3.293110565 2.530967343
Chem_7 2.475709113 3.105794018 2.708222528 3.475709113 3.088819908
Chem_8 2.013451822 2.259454085 2.683273938 2.723554966 2.400976121
Chem_9 2.345123123 3.050074893 2.682845391 3.291851228 2.700844104
Chem_10 2.327658894 2.848729452 2.580415233 3.327658894 2.881490893
Chem_11 2.411243882 2.98131398 2.554456095 3.411243882 3.109205453
Chem_12 2.340778276 2.576860244 2.549707035 3.340778276 3.236545826
Chem_13 2.394698249 2.90682524 2.542599327 3.394698249 3.12936843
I would like to create columns corresponding to each analysis which contain the rank position for each chemical. For instance, under Analysis1,Chem_1 would have value "1", Chem_2 would have value "2", Chem_3 would have value "4", Chem_7 would have value "4", Chem_11 would have value "5", and so on.

We can use dense_rank from dplyr
library(dplyr)
df %>%
mutate_each(funs(dense_rank(-.)))
In base R, we can do
df[] <- lapply(-df, rank, ties.method="min")
In data.table, we can use
library(data.table)
setDT(df)[, lapply(-.SD, frank, ties.method="dense")]
To avoid the copies from multiplying with -, as #Arun mentioned in the comments
lapply(.SD, frankv, order=-1L, ties.method="dense")

You can also do this in base R:
cbind("..." = df[,1], data.frame(do.call(cbind,
lapply(df[,-1], order, decreasing = T))))
... Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
1 Chem_1 1 1 1 1 1
2 Chem_2 2 2 2 2 12
3 Chem_3 3 3 3 3 3
4 Chem_4 7 7 4 7 2
5 Chem_5 11 9 5 11 13
6 Chem_6 13 5 6 13 11
7 Chem_7 5 11 7 12 7
8 Chem_8 9 6 8 10 10
9 Chem_9 12 13 9 6 5
10 Chem_10 10 10 10 9 9
11 Chem_11 6 4 11 5 6
12 Chem_12 4 12 12 8 4
13 Chem_13 8 8 13 4 8

If I'm not mistaken, you want to have the column-wise rank of your table. Here is my solution:
m=data.matrix(df) # converts data frame to matrix, convert your data to matrix accordingly
apply(m, 2, function(c) rank(c)) # increasingly
apply(m, 2, function(c) rank(-c)) # decreasingly
However, I believe you could solve it by yourself with the help of the answers to this question
Get rank of matrix entries?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

regex for searching through dataframe in R - r

How about: df[grep("-1$", df$Barcode),] This matches 1 at the end of the string, but also requires that the digit before 1 is not 1, so you don't match 11 Barcode Cluster 1 AAACCTGAGCGTCAAG-1 6 2 AAACCTGAGTACCGGA-1 3 3 AAACCTGCAGCTGCTG-1 6 7 AAACCTGGTCGCGTGT-1 14 11 AAACGGGAGAACTCGG-1 14

I think you can just use df[grep("([ACGT]-1$){1}", df$Barcode),] You can just use a $ to specify the end of the chain. See more information here on "pattern" use: http://www.jdatalab.com/data_science_and_data_mining/2017/03/20/regular-expression-R.html

Related

Vectorized function usage and joining individual terms into a single tibble

WGCNA : Choosing a soft-threshold power

Frequency distribution using binCounts

R: counting amount of patterns of numbers [closed]

Replacing each value in a vector with its rank number for a data.frame

Categories

Resources