Not-equal to character in R - r

What is the command for printin the string that are not equal to a specific character? From the data below I would like to print the number of rows where the t5-column does not start with d-. (In this example that is all the rows)
I tried
dim(df[df$t5 !="d-",])
df:
name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
6 seq_10002_x17 17 hsa-miR-10a-5p 23 44 5GT 0 d-T 0 TATATACC TGTGTAAG miRNA 1
19 seq_100091_x3 3 hsa-miR-142-3p 54 74 0 u-CA d-TG 0 AGGGTGTA TGGATGAG miRNA 1
20 seq_100092_x1 1 hsa-miR-142-3p 54 74 0 u-CT d-TG 0 AGGGTGTA TGGATGAG miRNA 1
23 seq_100108_x5 5 hsa-miR-10a-5p 23 44 4NC 0 d-T 0 TATATACC TGTGTAAG miRNA 1
26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
28 seq_100121_x1 1 hsa-miR-192-5p 25 45 1CT u-CT d-C d-A GGCTCTGA AGCCAGTG miRNA 1

df1 <- df[!grepl("^d-",df[,8]),]
nrow(df1)
print(df1)

There is one row in your data that has a t5 entry that does not start with "d-". To find this row, you could try:
df[!grepl("^(d-)",df$t5),]
# name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
#26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
If you only want to know the row number, you can get it with rownames()
> rownames(df[!grepl("^(d-)",df$t5),])
#[1] "26"
or with which(),
> which(!grepl("^(d-)",df$t5))
#[1] 5
depending on whether you want the row number counting from the top of your data frame or the row number according to the value on the left.

Related

Multiple columns in one random effect GLMER

I'm trying to find variance in infectivity trait of animals in different herds. Each herds contains a fixed number of offspring from 5 different sires.
Example of data:
Herd
S
C
DeltaT
I
sire1
I1
sire2
I2
sire3
I3
sire4
I4
sire5
I5
1
20
0
14
1
13
0
26
0
46
0
71
0
91
1
1
1
0
14
5
13
1
26
0
46
2
71
1
91
1
18
4
0
14
13
2
5
52
4
84
2
87
2
98
0
19
11
3
14
27
2
6
13
7
18
3
46
5
85
6
Herd is the herdname. S is the number of susceptible animals in the herd, C is the number of cases in the time interval. DeltaT is the time interval length. Sire# is the ID of the sire in the Herd. I# is the number of infected Ofspring of the corresponding Sire#. This means that a sireID "13" in the first two rows in the column sire1. Refers to the same sire as the "13" in sire2 of the last row. To include these 5 sires into one random effect in a glmer of lme4 is getting me in trouble.
I tried:
glmer(data = GLMM_Data,
cbind(C, S-C) ~ (1 | Herd) + (1| (I1 | sire1) + (I2 | sire2) + (I3 | sire3) + (I4 | sire4) + (I5 | sire5)),
offset = log(GLMM_Data$I/nherds * GLMM_Data$DeltaT),
family = binomial(link="cloglog"))
This gave errors. So any help on combining these 10 columns in a single random factor would be more than welcome. Thanks in advance.
p.s. I know my offset, family and the left side of the formula are working since the analysis of susceptibility is working

Merge columnwise from file_list

I have 96 files in file_list
file_list <- list.files(pattern = "*.mirna")
They all have the same columns, but the number of rows varies. Example file:
> head(test1)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 TGGAGTGTGATAATGGTGTTT seq_100003_x4 4 hsa-miR-122-5p 15 35 11TC 0 0 g GCTGTGGA TTTGTGTC miRNA
2 TGTAAACATCCCCGACCGGAAGCT seq_100045_x4 4 hsa-miR-30d-5p 6 29 17CT 0 0 CT TTGTTGTA GAAGCTGT miRNA
3 CTAGACTGAAGCTCCTTGAAAA seq_100048_x4 4 hsa-miR-151a-3p 47 65 0 I-AAA 0 gg CCTACTAG GAGGACAG miRNA
4 AGGCGGAGACTTGGGCAATTGC seq_100059_x4 4 hsa-miR-25-5p 14 35 0 0 0 C TGAGAGGC ATTGCTGG miRNA
5 AAACCGTTACCATTACTGAAT seq_100067_x4 4 hsa-miR-451a 17 35 0 I-AT 0 gtt AAGGAAAC AGTTTAGT miRNA
6 TGAGGTAGTAGCTTGTGCTGTT seq_10007_x24 24 hsa-let-7i-5p 6 27 12CT 0 0 0 TGGCTGAG TGTTGGTC miRNA
precursor ambiguity
1 hsa-mir-122 1
2 hsa-mir-30d 1
3 hsa-mir-151a 1
4 hsa-mir-25 1
5 hsa-mir-451a 1
6 hsa-let-7i 1
second file
> head(test2)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 ATTGCACTTGTCCTGGCCTGT seq_1000013_x1 1 hsa-miR-92a-3p 49 69 14TC 0 t 0 AAAGTATT CTGTGGAA miRNA
2 AAACCGTTACTATTACTGAGA seq_1000094_x1 1 hsa-miR-451a 17 36 11TC I-A 0 tt AAGGAAAC AGTTTAGT miRNA
3 TGAGGTAGCAGATTGTATAGTC seq_1000169_x1 1 hsa-let-7f-5p 8 28 9CT I-C 0 t GGGATGAG AGTTTTAG miRNA
4 TGGGTCTTTGCGGGCGAGAT seq_100019_x12 12 hsa-miR-193a-5p 21 40 0 0 0 ga GGGCTGGG ATGAGGGT miRNA
5 TGAGGTAGTAGATTGTATAGTG seq_100035_x12 12 hsa-let-7f-5p 8 28 0 I-G 0 t GGGATGAG AGTTTTAG miRNA
6 TGAAGTAGTAGGTTGTGTGGTAT seq_1000437_x1 1 hsa-let-7b-5p 6 26 4AG I-AT 0 t GGGGTGAG GGTTTCAG miRNA
precursor ambiguity
1 hsa-mir-92a-2 1
2 hsa-mir-451a 1
3 hsa-let-7f-2 1
4 hsa-mir-193a 1
5 hsa-let-7f-2 1
6 hsa-let-7b 1
I would like to create a unique ID consisting of the columns mir and seq:
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT
Then I would like to merge all the 96 files based in this ID and take the column freq form each file.
ID freq_file1 freq_file2 ...
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT 4 12
If an ID is not pressent in a specific file the freq should be NA
We can use Reduce with merge on a list of data.frames.
lst <- lapply(mget(ls(pattern="test\\d+")),
function(x) subset(transform(x, ID=paste(precursor,
seq)), select=c("ID", "freq")))
Reduce(function(...) merge(..., by = "ID"), lst)
NOTE: In the above, I assumed that the "test1", "test2" objects are already created in the global environment by reading the files in 'file_list'. If not, we can directly read the files into a list instead of creating additional data.frame objects i.e.
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("precursor", "seq", "freq"))[,
list(ID=paste(precursor, seq), freq=freq)])
Reduce(function(x,y) x[y, on = "ID"], lst)
Or instead of fread (from data.table) use read.csv/read.table and use merge as before on 'lst'

Print varible names in table() with 2 binary variables in R

I'm sure I'll kick myself for not being able to figure this out, but when you have a table with 2 variables (i.e. cross-tab) and both are binary or otherwise have the same levels, how can you make R show which variable is displayed row-wise and which is column-wise?
For example:
> table(tc$tr, tc$fall_term)
0 1
0 1569 538
1 0 408
is a little confusing because it's not immediately obvious which is which. Of course, I checked out ?table but I don't see an option to do this, at least not a logical switch that doesn't require me to already know which is which.
I tried ftable but had the same problem.
The output I want would be something like this:
> table(tc$tr, tc$fall_term)
tr tr
0 1
fallterm 0 1569 538
fallterm 1 0 408
or
> table(tc$tr, tc$fall_term)
fallterm fallterm
0 1
tr 0 1569 538
tr 1 0 408
You can use the dnn option :
table(df$tr,df$fall_term) # impossible to tell the difference
0 1
0 18 33
1 15 34
table(df$tr,df$fall_term,dnn=c('tr','fall_term')) # you have the names
fall_term
tr 0 1
0 18 33
1 15 34
Note that it's easier (and safer) to do table(df$tr,df$fall_term,dnn=colnames(df))
Check out dimnames, and in particular their names. I’m using another example here since I don’t have your data:
x = HairEyeColor[, , Sex = 'Male']
names(dimnames(x))
# [1] "Hair" "Eye"
names(dimnames(x)) = c('Something', 'Else')
x
# Else
# Something Brown Blue Hazel Green
# Black 32 11 10 3
# Brown 53 50 25 15
# Red 10 10 7 7
# Blond 3 30 5 8

subset all columns in a data frame less than a certain value in R

I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865
An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0

For-loop working but not as axpected

I'm working on making a loop to compute three columns: the min, the max and the mean of the mesurement of several plots. I am working with forest inventories of several measurements at thousands of plots.
What I want to do is to compute the min, max and mean of the basal area (a measurement) for each species (153 species total) at specific plots that differ between the species.
First,I have to select, for each species, all the plots matching the criteria (plots are purs== at least 80 % of the plots is composed by only one species) from purs 80.
head(purs80[,1:10])
02 03 04 05 06 07 08S 09 10 11
27 0.000000 0.000000 0 0 0 0 0 0 0.00000 0
41 0.000000 0.000000 0 0 0 0 0 0 0.00000 0
47 6.369376 8.824162 0 0 0 0 0 0 84.80646 0
54 0.000000 100.000000 0 0 0 0 0 0 0.00000 0
83 100.000000 0.000000 0 0 0 0 0 0 0.00000 0
101 0.000000 0.000000 0 0 0 0 0 100 0.00000 0
#list of all the purs plots by species
listplotspur80<-apply(purs80, 2,function(v) which(v > 80))
This is working. Listplotspur is a list of 153 element each of them composed by the number of plots where the criteria is met. Just the head of a summary of it as well as the last element.
head(summary( listplotspur80))
Length Class Mode
02 "1422" "-none-" "numeric"
03 "1479" "-none-" "numeric"
04 " 50" "-none-" "numeric"
05 "1836" "-none-" "numeric"
06 " 689" "-none-" "numeric"
07 " 51" "-none-" "numeric"
So you can see the number of elements vary for each element of the list.
> listplotspur80[[153]]
22455 505927 516264 524860 545205 639576
1345 15389 15738 16029 16711 19410
This give me the plotID as names that Ican extract with the function names as below
> names(listplotspur80[[153]])
[1] "22455" "505927" "516264" "524860" "545205" "639576"
Now that I'm able to extract the list of plots for each species, I need to associate to each plot, its value of the basal area BA which are stocked into a data frame called BA.
> head(BA)
BA plotID
19 41.72365 19
23 13.37109 23
27 55.92989 27
41 25.50725 41
45 34.86734 45
47 30.63582 47
> dim(BA)
[1] 44065 2
So from this list of element where I have the list of plots for each species and the data frame BA where I have the BA associated to each plot I want to calculate min,max and mean from these plot for every species and stock this into a new data frame.
#Create a loop that does the job!
outG80<-matrix(nrow=153, ncol=3, NA)
for (i in 1:153 ){
outG80[i,1]<-min(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
For each species, I am selecting the rows corresponding to the plotIDs I have according to the list and I'm applying the function to the all the BA corresponding (column 1 of BA).
outG80[i,2]<-max(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
outG80[i,3]<-mean(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
}
outG80<-as.data.frame(outG80)
names(outG80)<-c("Gmin","Gmax","Gmean")
outG80
So the loop work and I am able to get a data frame as I want...BUT it just not the good results and I can't find why. See the min and max are the same whereas I know that I have 1422 different plots for the first species for example with different value for the BA.
Gmin Gmax Gmean
1 33.23970 33.23970 33.23970
2 29.89472 29.89472 29.89472
3 13.90947 43.33606 28.62277
4 17.91288 17.91288 17.91288
5 Inf -Inf NaN
6 11.42602 11.42602 11.42602
If you have any idea of the mistake in my loop please let me know.
Thanks a lot for your help.
I have been trying to do some code that you could use to replicate the problem but I end up with huge data frames. Sorry for this inconvenient.
Here's what a small reproducible data set might look like:
set.seed(5)
BA <- data.frame(BA=round(runif(5,0,10),1), plotID=11:15)
purs80 <- matrix(sample(c(0,90), 4*6, prob=c(0.8, 0.2), replace=TRUE), ncol=6)
colnames(purs80) <- paste("sp", 1:ncol(purs80), sep="")
rownames(purs80) <- sample(BA$plotID)[1:4]
In this case, I would first get the BA values in the same order as in the purs80 data frame and then get the min, max, and mean within the apply function.
ordered.BA <- BA$BA[match(rownames(purs80), BA$plotID)]
out <- t(apply(purs80, 2, function(v) {
use <- ordered.BA[which(v > 80)]
if(length(use)==0) c(Gmin=NA, Gmax=NA, Gmean=NA)
else c(Gmin=min(use), Gmax=max(use), Gmean=mean(use))
}))
Here's the data and results:
> BA
BA plotID
1 2.0 11
2 6.9 12
3 9.2 13
4 2.8 14
5 1.0 15
> purs80
sp1 sp2 sp3 sp4 sp5 sp6
15 0 0 0 90 0 0
12 0 0 0 0 0 0
11 90 0 0 90 0 90
13 90 0 0 90 0 0
> out
Gmin Gmax Gmean
sp1 2 9.2 5.600000
sp2 NA NA NA
sp3 NA NA NA
sp4 1 9.2 4.066667
sp5 NA NA NA
sp6 2 2.0 2.000000

Resources