I'm working on making a loop to compute three columns: the min, the max and the mean of the mesurement of several plots. I am working with forest inventories of several measurements at thousands of plots.
What I want to do is to compute the min, max and mean of the basal area (a measurement) for each species (153 species total) at specific plots that differ between the species.
First,I have to select, for each species, all the plots matching the criteria (plots are purs== at least 80 % of the plots is composed by only one species) from purs 80.
head(purs80[,1:10])
02 03 04 05 06 07 08S 09 10 11
27 0.000000 0.000000 0 0 0 0 0 0 0.00000 0
41 0.000000 0.000000 0 0 0 0 0 0 0.00000 0
47 6.369376 8.824162 0 0 0 0 0 0 84.80646 0
54 0.000000 100.000000 0 0 0 0 0 0 0.00000 0
83 100.000000 0.000000 0 0 0 0 0 0 0.00000 0
101 0.000000 0.000000 0 0 0 0 0 100 0.00000 0
#list of all the purs plots by species
listplotspur80<-apply(purs80, 2,function(v) which(v > 80))
This is working. Listplotspur is a list of 153 element each of them composed by the number of plots where the criteria is met. Just the head of a summary of it as well as the last element.
head(summary( listplotspur80))
Length Class Mode
02 "1422" "-none-" "numeric"
03 "1479" "-none-" "numeric"
04 " 50" "-none-" "numeric"
05 "1836" "-none-" "numeric"
06 " 689" "-none-" "numeric"
07 " 51" "-none-" "numeric"
So you can see the number of elements vary for each element of the list.
> listplotspur80[[153]]
22455 505927 516264 524860 545205 639576
1345 15389 15738 16029 16711 19410
This give me the plotID as names that Ican extract with the function names as below
> names(listplotspur80[[153]])
[1] "22455" "505927" "516264" "524860" "545205" "639576"
Now that I'm able to extract the list of plots for each species, I need to associate to each plot, its value of the basal area BA which are stocked into a data frame called BA.
> head(BA)
BA plotID
19 41.72365 19
23 13.37109 23
27 55.92989 27
41 25.50725 41
45 34.86734 45
47 30.63582 47
> dim(BA)
[1] 44065 2
So from this list of element where I have the list of plots for each species and the data frame BA where I have the BA associated to each plot I want to calculate min,max and mean from these plot for every species and stock this into a new data frame.
#Create a loop that does the job!
outG80<-matrix(nrow=153, ncol=3, NA)
for (i in 1:153 ){
outG80[i,1]<-min(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
For each species, I am selecting the rows corresponding to the plotIDs I have according to the list and I'm applying the function to the all the BA corresponding (column 1 of BA).
outG80[i,2]<-max(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
outG80[i,3]<-mean(BA[which(BA$plotID==as.numeric(names(listplotspur80[[i]]))),1])
}
outG80<-as.data.frame(outG80)
names(outG80)<-c("Gmin","Gmax","Gmean")
outG80
So the loop work and I am able to get a data frame as I want...BUT it just not the good results and I can't find why. See the min and max are the same whereas I know that I have 1422 different plots for the first species for example with different value for the BA.
Gmin Gmax Gmean
1 33.23970 33.23970 33.23970
2 29.89472 29.89472 29.89472
3 13.90947 43.33606 28.62277
4 17.91288 17.91288 17.91288
5 Inf -Inf NaN
6 11.42602 11.42602 11.42602
If you have any idea of the mistake in my loop please let me know.
Thanks a lot for your help.
I have been trying to do some code that you could use to replicate the problem but I end up with huge data frames. Sorry for this inconvenient.
Here's what a small reproducible data set might look like:
set.seed(5)
BA <- data.frame(BA=round(runif(5,0,10),1), plotID=11:15)
purs80 <- matrix(sample(c(0,90), 4*6, prob=c(0.8, 0.2), replace=TRUE), ncol=6)
colnames(purs80) <- paste("sp", 1:ncol(purs80), sep="")
rownames(purs80) <- sample(BA$plotID)[1:4]
In this case, I would first get the BA values in the same order as in the purs80 data frame and then get the min, max, and mean within the apply function.
ordered.BA <- BA$BA[match(rownames(purs80), BA$plotID)]
out <- t(apply(purs80, 2, function(v) {
use <- ordered.BA[which(v > 80)]
if(length(use)==0) c(Gmin=NA, Gmax=NA, Gmean=NA)
else c(Gmin=min(use), Gmax=max(use), Gmean=mean(use))
}))
Here's the data and results:
> BA
BA plotID
1 2.0 11
2 6.9 12
3 9.2 13
4 2.8 14
5 1.0 15
> purs80
sp1 sp2 sp3 sp4 sp5 sp6
15 0 0 0 90 0 0
12 0 0 0 0 0 0
11 90 0 0 90 0 90
13 90 0 0 90 0 0
> out
Gmin Gmax Gmean
sp1 2 9.2 5.600000
sp2 NA NA NA
sp3 NA NA NA
sp4 1 9.2 4.066667
sp5 NA NA NA
sp6 2 2.0 2.000000
Related
I've begun using R recently so this might be simple to solve. I actually have two problems but I believe they`re connected.
I have a simple dataset (.csv file with 3 columns and 7 rows) and I'm trying to create a table out of it and plot a bar graph with the values of the two numerical columns.
Grupo de idade;Freq. Relativa Homens;Freq. Relativa Mulheres
16 a 19;0,411;0,415
20 a 24;0,787;0,701
25 a 34;0,922;0,745
35 a 44;0,923;0,755
45 a 54;0,882;0,760
55 a 64;0,696;0,583
65 ou mais;0,205;0,126
df = read.csv(filename, header = TRUE, sep = ";")
tab = table(df)
sd = cbind(df$Freq.Homens, df$Freq.Mulheres)
barplot(sd, beside = TRUE)
So first my table ends up looking like this, with the values as headers:
Freq..Relativa.Homens
Grupo.de.idade 0,205 0,411 0,696 0,787 0,882 0,922 0,923
16 a 19 0 0 0 0 0 0 0
20 a 24 0 0 0 0 0 0 0
25 a 34 0 0 0 0 0 0 0
35 a 44 0 0 0 0 0 0 0
45 a 54 0 0 0 0 1 0 0
55 a 64 0 0 0 0 0 0 0
65 ou mais 0 0 0 0 0 0 0
And my graph is plotted with integers values like 2, 4, and 6. I noticed that happened because of the cbind function, but without it, I can`t plot anything.
First: R thinks anglo-american (; , i.e. the decimal mark is a ".".
The decimal mark in your data is a ",". You have to tell this to R, by adding the argument `dec = ","``, i.e.
df = read.csv(filename, header = TRUE, sep = ";". dec = ",")
Otherwise R interprets the numbers as characters or strings
table makes a contigency table of two variables. This however makes only sense for categorical variables, e.g. number of observations by age and sex.
You have only one categorical variable (Grupo.de.idade) and two continuous variables
R does the best to make sense of this, and simply interprets the values of the continuous variables as categories, which however makes no sense, e.g there is 1 observation in your data set with "Grupo de idade" = "16 a 19" and a value of "0,411" for "Freq. Relativa Homens". That's what table is telling you.
Moreover your data is already in table format so if you want to have a look at your data simply type df to the console
df
#> Grupo.de.idade Freq..Relativa.Homens Freq..Relativa.Mulheres
#> 1 16 a 19 0.411 0.415
#> 2 20 a 24 0.787 0.701
#> 3 25 a 34 0.922 0.745
#> 4 35 a 44 0.923 0.755
#> 5 45 a 54 0.882 0.760
#> 6 55 a 64 0.696 0.583
#> 7 65 ou mais 0.205 0.126
The easiest way to meke a simple barplot is like this:
barplot(Freq..Relativa.Homens ~ Grupo.de.idade, data = df)
On the left of the "~" put the variable to plot, on the right the grouping variable. Furthermore you have to tell R the name of the dataset.
However, instead of a trial-and-error-approach to R I recommend to work through the introductory chapters of one of the free tutorials or textbooks one can find on the internet, like The Pirate's guide to R
Created on 2020-03-27 by the reprex package (v0.3.0)
library(boot)
install.packages("AMORE")
library(AMORE)
l.data=nrow(melanoma)
set.seed(5)
idxTrain<-sample(1:l.data,100)
idxTest<-setdiff(1:l.data,idxTrain)
set.seed(3)
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
result<-train(net,
melanoma[idxTrain,-2],
melanoma$status,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)
The problem I have is I have an error in result: "target - non-conformable arrays".
I know that it is the problem with melanoma$status, but have no idea how to alter the data accordingly. Any ideas? Couple of samples of data (if you don't use boot package from Rstudio).
melanoma:
time status sex age year thickness ulcer
1 10 3 1 76 1972 6.76 1
2 30 3 1 56 1968 0.65 0
3 35 2 1 41 1977 1.34 0
4 99 3 0 71 1968 2.90 0
5 185 1 1 52 1965 12.08 1
Your target variable should first take only the training indices. Moreover, the target should have a number of columns equal to the number of classes - with one-hot encoding. Something like this:
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
Target = matrix(data=0, nrow=length(idxTrain), ncol=3)
status_mat=matrix(nrow=length(idxTrain), ncol=2)
status_mat[,1] = c(1:length(idxTrain))
status_mat[,2] = melanoma$status[idxTrain]
Target[(status_mat[,2]-1)*length(idxTrain)+status_mat[,1]]=1
result<-train(net,
melanoma[idxTrain,-2],
Target,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)
I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])
What is the command for printin the string that are not equal to a specific character? From the data below I would like to print the number of rows where the t5-column does not start with d-. (In this example that is all the rows)
I tried
dim(df[df$t5 !="d-",])
df:
name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
6 seq_10002_x17 17 hsa-miR-10a-5p 23 44 5GT 0 d-T 0 TATATACC TGTGTAAG miRNA 1
19 seq_100091_x3 3 hsa-miR-142-3p 54 74 0 u-CA d-TG 0 AGGGTGTA TGGATGAG miRNA 1
20 seq_100092_x1 1 hsa-miR-142-3p 54 74 0 u-CT d-TG 0 AGGGTGTA TGGATGAG miRNA 1
23 seq_100108_x5 5 hsa-miR-10a-5p 23 44 4NC 0 d-T 0 TATATACC TGTGTAAG miRNA 1
26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
28 seq_100121_x1 1 hsa-miR-192-5p 25 45 1CT u-CT d-C d-A GGCTCTGA AGCCAGTG miRNA 1
df1 <- df[!grepl("^d-",df[,8]),]
nrow(df1)
print(df1)
There is one row in your data that has a t5 entry that does not start with "d-". To find this row, you could try:
df[!grepl("^(d-)",df$t5),]
# name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
#26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
If you only want to know the row number, you can get it with rownames()
> rownames(df[!grepl("^(d-)",df$t5),])
#[1] "26"
or with which(),
> which(!grepl("^(d-)",df$t5))
#[1] 5
depending on whether you want the row number counting from the top of your data frame or the row number according to the value on the left.
I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865
An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0