Identifying, grouping unique entries in data frame (R) - r

I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")

I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1

Related

R: how to map a set of integers to another set of integers

I have a data set where each individual has a unique person ID. I'm interested in turning these ID numbers to another set of more manageable type integer IDs.
ID <- c(59970013552, 51730213552, 1233923, 2949394, 9999999999)
Essentially, I'd like to map these IDs a new_ID, where
> new_ID
[1] 1 2 3 4 5
The reason I'm doing this is that my analysis requires as.integer(ID), and R will coerce large integers into NA. I have tried using as.integer64 from the bit64 package, but the class integer64 is not compatible with my analysis.
I've also thought to just do ID - min(ID) + 1 to get around having huge ID numbers. But this also doesn't work, because some of my larger IDs are so large that even if I subtract the min(ID) value, as.integer(ID) will still coerce them to NA.
This should be a duplicate but I couldn't find a relevant answer hence posting an answer.
We can use match
match(ID, unique(ID))
#[1] 1 2 3 4 5
OR convert the ID into factors along with levels
as.integer(factor(ID, levels = unique(ID)))
#[1] 1 2 3 4 5

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

R require cell counts for number of occurrences of regex pattern over entire data frame

I'm working in R and I have a data frame containing epigenetic information. I have 300,000 rows containing genomic locations and 15 columns each of which identifies a transcription factor motif that may or may not occur at each locus.
I'm trying to use regular expressions to count how many times each transcription factor occurs at each genomic locus. Individual motifs can occur > 15 times at any one locus, so I'd like the output to be a matrix/data frame containing motif counts for each individual cell of the data frame.
A typical single occurrence of a motif in a cell could be:
2212(AATTGCCCCACA,-,0.00)
Whereas if there were multiple occurrences of a motif, these would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:
144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)
Here is some toy data:
df <-data.frame(NAMES = c('LOC_A', 'LOC_B', 'LOC_C', 'LOC_D'),
TFM1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "0", "0"),
TFM2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "0"),
stringsAsFactors = F)
I'd be looking for the output in the following format:
NAMES TFM1 TFM2
LOC_A 2 2
LOC_B 1 1
LOC_C 0 1
LOC_D 0 0
If possible, I'd like to avoid for loops, but if loops are required so be it. To get row counts for this data frame I used the following code, kindly recommended by #akrun:
df$MotifCount <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))
Notice that the unique identifier for the motifs used here is "\\d+\\(" to pick up the number and opening bracket at the start of each motif identification string. This would have to be included in any solution code. Something similar which worked across the whole data frame to provide individual cell counts would be ideal.
Many Thanks
We don't need the Reduce part
data.frame(c(df[1],lapply(df[-1], function(x) lengths(str_extract_all(x, "\\d+\\(")))) )
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0
This will also work:
cbind.data.frame(df[1],sapply(lapply(df[-1], function(x) str_extract_all(x, "\\d+\\(")), function(x) lapply(x, length)))
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

Problems with using subset in r

I need to subset my data frame, but I do not know what condition to use.
df2<-subset(df, condition )
A part of the dataframe, `df`:
state value
a 1
b 2
c 3
a 1
b 4
c 5
I count the sum of the value column for each state using : table(df$state)
I need to create a date frame where I show just the rows where the sum of the value column is bigger then a given value x.
If x is 3, I need to have in the new data frame just the rows that have the "state" column equal to b or c.
What should I replace "condition" with? How can I use : table(df$state) in the condition?
It is not clear what are you trying to do.
table(df$state) count the occurence of each state in your data, not the sum of variable "value" for each "state".You should instead use something like this:
vv <- tapply(dat$value,dat$state,sum)
vv
a b c
2 6 8
Now you can use the result within subset, to get the sum of the value column is bigger then a given value x. For example x == 3:
subset(dat,state %in% names(vv)[vv>3])
or without using `subset ( more efficient)
dat[dat$state %in% names(vv)[vv>3],]

Resources