Calling groups of string digits using grepl - r

I'm using the grepl function to try and sort through data; all the row numbers are different survey respondents, and each number in the "ANI_type" string represents a different type of animal - I need to sort these depending on animal type. More specifically, I need to group some of the digits in the strings into categories. For example, digits 6,7,8,9,10,11 all need to be placed in the animals$pock object. How would I go about that using the grep function?
> animals$dogs <- as.numeric(grepl("\\b1\\b", animals$ANI_type))
> animals
ANI_type dogs cats repamp
1 1,2,5,12,13,14,15,16,18,19,27 1 1 0
2 2 0 1 0
3 20,21,22,23,26 0 0 0
4 20,21,22,23 0 0 0
5 13 0 0 0
6 2 0 1 0
7 20,21,22 0 0 0
8 20,21,22,23 0 0 0
9 20,21,22 0 0 0
10 5,20,21,22,27 0 0 0
11 1,2,20,21,22 1 1 0
12 5,18,20,21,22,23,26 0 0 0
13 20,21 0 0 0
14 21 0 0 0
15 20,21 0 0 0
16 20,21,26 0 0 0
17 2 0 1 0
18 1,2 1 1 0
19 2 0 1 0
20 3,4 0 0 1
The expected output is the columns (dog, cat, repamp) above... these were easy to do as there is only one digit; what I'm having trouble with is splitting up multiples.

A tidyverse solution could be employed with mutate() and if_else() from the dplyr library, and grepl(), for example:
animals <- animals %>%
mutate(dogs = if_else(grepl("\\b1\\b|\\b22\\b", ANI_TYPE),
cats = if_else(grepl("\\b2\\b|\\b31\\b", ANI_TYPE))
In this case, you'd want to separate all the different potential codes for each animal using the pipe operator | which functions as an OR operator in R.

Related

How to compare data frame with a factor as a variable?

I have a data frame, please see below.
How do I compare the Volume where Purchase == 1 to the previous Purchase == 1 Volume and create a factor variable V1 like shown in the Picture 2?
The df[5,"V1"] == 1 because df[5,"Volume"] > df[3,"Volume"].... and so on.
How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?
I've tried sub-setting, then do the comparison but when tried to put them back to a factor variable, the number of rows of the result is not the same as the number of rows of the df therefore I cannot put the factor variable to the dataframe.
Picture 2
Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0
With data.table:
library(data.table)
data <- data.table(read.table(text=' Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0', header=T))
data[, V1 := 0]
data[Purchase == 1, V1 := as.integer(Volume > shift(Volume)) ]
data[, V1 := as.factor(V1)]
Here, I filtered data to where Purchase = 1, then I brought previous Volume with shift function.
Finally, I compared Volume to Previous volume and assigned 1 if Volume is larger than Previous.

How to count frequency of string value (from a few ID in the same column)? [R]

I wanted to clean my table but since I'm still new to [R], what I can do are pretty limited. The list is actually pretty long, around 100,000 rows, it would be impossible for me to do it manually ~ Please help me.
Suppose I have a very long list of data in table form. Each of them have a "Publication.Code" and a "Date". The Code is unique while the Date can be duplicated. For each of it, they have a list of "names" under the column "Type".
Publication.Code Date Type
1 AC00069535742 2009-04-16 E62D 21/15;E60R 7/06;E60R 21/06;E62D 25/14
2 BB000069535652 2008-10-30 F06Q 10/
3 FV000069434701 2007-04-05 E30B 15/;E30B 15/16
4 RG000069534443 2006-07-06 E62D 21/15;E62D 25/14;T60T 7/06;E60R 21/06
5 MV000069333663 2006-02-23 H04N 1/1;G01J 3/51
6 KK000069533634 2006-02-23 H12N 9/1;H12N 15/54;H12P 9/
7 NQ000069534198 2006-02-16 H12N 15/54;H12N 15/7;H12N 1/21;H12N 9/1
I wanted to mutate a new column using the 1st 4 alphabets of each names (That are E60R, E62D, F06Q, E30B, T60T, H04N, G01J, H12N) in the column "Type" and count its frequency among the list of names just like below:
Publication.Code Date E60R E62D F06Q E30B T60T H04N G01J H12N
1 AC00069535742 2009-04-16 2 2 1 0 0 0 0 0
2 BB000069535652 2008-10-30 0 0 1 0 0 0 0 0
3 FV000069434701 2007-04-05 0 0 0 2 0 0 0 0
4 RG000069534443 2006-07-06 1 2 0 0 1 0 0 0
5 MV000069333663 2006-02-23 0 0 0 0 0 1 1 0
6 KK000069533634 2006-02-23 0 0 0 0 0 0 0 3
7 NQ000069534198 2006-02-16 0 0 0 0 0 0 0 4
After that, I would like to sum that up by year, maybe by:
Year E60R E62D F06Q E30B T60T H04N G01J H12N
1 2009 2 2 1 0 0 0 0 0
2 2008 0 0 1 0 0 0 0 0
3 2007 0 0 0 2 0 0 0 0
4 2006 1 2 0 0 1 1 1 7
& also the cumulative sum of each column:
Year E60R E62D F06Q E30B T60T H04N G01J H12N
1 2009 2 2 1 0 0 0 0 0
2 2008 2 2 2 0 0 0 0 0
3 2007 2 2 2 2 0 0 0 0
4 2006 2 4 2 2 1 1 1 7
I understand that I can use dplyr to mutate the column and count the frequency by Year but I'm not sure how to just extract certain value from the column, really appreciate for any help ~
If you put your Types into vector myTypes, this should work for the first part of your problem
require(plyr)
require(stringr)
df<-read.table(header = TRUE, sep=",", text="
Publication.Code, Date, Type
AC00069535742, 2009-04-16, E62D 21/15;E60R 7/06;E60R 21/06;E62D 25/14
BB000069535652, 2008-10-30, F06Q 10/")
myTypes <- c("E60R", "E62D", "F06Q", "E30B", "T60T", "H04N", "G01J", "H12N")
res <- adply(df, .margin = 1, .fun = function(x) setNames(str_count(x$Type, pattern = myTypes), myTypes))
res$Type <- NULL
This will solve the second part
res$Date <-lubridate::ymd(res$Date)
ddply(res, .(year(Date)), function(x)colSums(x[,-(1:2)]))
To calculate cumulative values for each column use cumsum in colwise
names(res2)[1] <-"year"
cbind(year = res2$year, colwise(cumsum, myTypes)(res2))

filter specific rownames with spaces

My rownames consists of 6 strings that are separated by a space. I would like to keep the rows that has 0 in the third string. I am not sure how this is done, since the strings are not defines as columns.
hsa-miR-143-3p TGAGAAGAAGCACTGTAGCTCTT 6AT u-TT 0 0 0 1
hsa-miR-10a-5p GACCCTGTAGATCCGAATTTGTA 1GT u-A 0 u-G 1 0
hsa-miR-10a-5p GACCCTGTAGATCCGAATTTGTG 1GT 0 0 0 54 24
hsa-miR-1296-5p TTAGGGCCCTGGCTCCATCT 0 0 0 u-CC 11 17
hsa-miR-887-3p GTGAACGGGCGCCATCCCGAGGCTT 0 0 0 d-CTT 1 8
hsa-miR-10a-5p ACCCGGTAGATCCGAATTTGTG 5GT 0 d-T 0 7 11
out:
hsa-miR-1296-5p TTAGGGCCCTGGCTCCATCT 0 0 0 u-CC 11 17
hsa-miR-887-3p GTGAACGGGCGCCATCCCGAGGCTT 0 0 0 d-CTT 1 8
Assuming df holds your data frame, you could try
idx <- which(read.table(text=rownames(df))$V3=="0")
df[idx, ]

Re-expanding a compressed dataframe to include zero values in missing rows

Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))

r sum several colmns by another column

I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance
Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code

Resources