How to count specific occurrences of strings within chr data into a new column - r

I have a dt similar to below, with chr data held in the Description column like below. I need to count the number of times certain strings of characters occur in that column, and sum them in the Occurrences column.
In the table below, it would be counting the number of times "A18" or "A19" appears.
ID
Date
Description
Occurrences
1
2020-01-01
A1901,A1804,A2008,AB06
2
2
2020-01-14
A1402,A1805,A1902
2
3
2018-02-25
A1702
0
I'm very new to R and datatables, so haven't tried much. I've searched, but only found how to count occurrences of whole strings, not within them.

Use str_count:
library(stringr)
library(dplyr)
df %>%
mutate(Occurences2 = str_count(Description, "A18|A19"))

Related

How to tally elements in a cell

(I am using r Studio)
I am doing a lit review where I record gene variants and then record the paper ID from which the variant was recorded from. I want to be able to count the number of papers each variant has like a tally:
For example
in the first row of column PMID, there are 4 papers, so I want my output for that specific cell to be 4, and for the next cell below to be 5, and below that to be 3.
If anyone could help with that'd be greatly appreciated!
Dataframe "gene" column "Pmid"
You could use strsplit and lengths
df <- data.frame(PMID = c("258,234,212", "234,235,256,265"))
df$counts <- lengths(strsplit(df$PMID, ","))
df
#-----
PMID counts
1 258,234,212 3
2 234,235,256,265 4

How do I write back results of a count query to a column in R?

I would like to count the instances of a Employee ID in a column and write back the results to a new column in my dataframe. So far I am able to count the instances and display the results in the R Studio console, but I'm not sure how to write the results back. Here is what I have tested successfully:
ids<-BAR$`Employee ID`
counts<-data.frame(table(ids))
counts
And here are the returned results:
1 00000018 1
2 00000179 1
3 00001045 1
4 00002729 1
5 00003095 2
6 00003100 1
Thanks!
If we need to create a column, use add_count
library(dplyr)
BAR1 <- BAR %>%
add_count(`Employee ID`)
table returns the summarised output. If we want to create a column in the original data
BAR1$n <- table(ids)[as.character(BAR$`Employee ID`)]
If you use a data.table you will be able to do this quickly, especially with larger datasets, using .N to count number of occurrences per grouping variable given in by.
# Load data.table
library(data.table)
# Convert data to a data.table
setDT(BAR)
# Count and assign counts per level of ID
BAR[, count := .N, by = ID]

How to subset the first column (rownames) in R [duplicate]

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys
Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.
If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

String subset using a pattern and range of string length and in R

I have a data set that contains a column with strings made up of 4 letters (A,T,C,G); these strings range from 2-1991 characters long. I would like to subset all rows where the strings match a particular pattern. For example, I would like to create a new dataframe that subsets all rows where there are 0-10 consecutive Ts in column 17.
Please let me know if you require additional information and thank you for your time!
You could filter out all rows where you find 11 consecutive Ts, which would include rows that have 11 consecutive Ts, and rows that have more.
## Example vector
v = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG")
v[!grepl("T{11}",v)]
[1] "TTTTTTTTTTACAGATAT" "TTTACACAC"
Edit to also include cases where you want to look for 11-20 consecutive Ts
If you want to select rows that have between 11 and 20 Ts, you could use a negative lookbehind and a negative lookahead, to search for a stretch of between 11 and 20 Ts that is neither preceded nor followed by a T.
## Second example vector:
v2 = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG","ACTTTTTTTTTTTTTTTTTTTTTGCGCA")
v2[grepl("(?<!T)T{11,20}(?!T)",v2,perl=T)]
[1] "TTTTTTTTTTTTTACAGAT" "TTTTTTTTTTTACAG"

Expand Row with Multiple Observations into Individual Rows

Just wondering if there is a way to expand rows which have multiple observations, into rows of unique observations using R? I have data in an excel spreadsheet with the variable headings: Lease, Line, Bay, Date, Predators, Food.Index, DD, MM, YY.
On some dates, there have been multiple predators (from 1 to 4) recorded in the same row. Other days just have 0. On a day where there has been 4 predators recorded, I would like to somehow transform the data to show four unique observations (instead of one row with 4 recorded under "Predators").
I have 1669 rows of data and multiple rows need to be expanded
Example of Data set
Many thanks for your help in advance.
enter image description here
Assuming you have your data in a data.frame, df, one possible solution would be
df.expanded <- df[rep(row.names(df), df$Predators), ]
EDIT: If you also want to keep the rows with 0 predators, you can use pmax to always return at least one:
df.expanded <- df[rep(row.names(df), pmax(df$Predators, 1)),]
Here the pmax(df$Predators, 1) will return the elementwise maximum of df$Predators and 1 so that it returns a new vector where each element is at least 1 but takes the value of df$Predators if that number is greater than 1.

Resources