Ordination plot with vectors coded as centroids - r

In the vegan package, I tried to make a ordination plot with species as objects and environmental variables as vectors. However the environmental variables are regarded as centroids instead of factors. Strangely each data frame cell is seen as an environmental factor, so I think the dataframe is not structured correctly. When I plot the ordination without environmental variables I don't get any problems.
summary(gutter.dca)
environfit = envfit(gutter.dca,gutterenv)
> head(environfit)
$vectors
NULL
$factors
Centroids:
DCA1 DCA2
vocht0,246435845 -0.2185 -1.0601
vocht0,249249249 0.1932 -1.1339
vocht0,251497006 0.0331 -2.0888
vocht0,264735265 -0.3353 -1.3403
vocht0,26911315 -0.0017 -0.9498
vocht0,272369715 -1.0733 0.0021
Species dataframe
head(gutter)
Acer.campestre Acer.pseudoplantanus Adoxa.moschatellina Aegopodium.podagraria Ajuga.reptans
Q1-1 0 0 5 0 0
Q1-2 0 70 15 20 0
Q1-3 0 15 0 0 0
Q1-4 0 3 0 0 0
Q2-1 0 3 0 0 0
Q2-2 1 0 0 0 0
Environmental variables dataframe
head(gutterenv)
vocht Ph.H2O ph.KCl mg.NO3.kg.soil mg.NH4.N.kg.soil litter.depth..cm.
1 0,26911315 7,41 6,686 2,811031105 4,674304351 7,5
2 0,246435845 7,225 6,349 2,567981088 6,735395066 6,5
3 0,264735265 7,001 6,491 2,336821354 8,400116244 5,1
4 0,325123153 6,732 5,444 2,518858082 7,684506342 8,25
5 0,446875 6,87 7,45 2,443686352 9,886923756 4
6 0,548476454 8,1 7,05 3,144954614 11,3179919 3

Related

Apply a function with if inside to a dataframe to take a value in a list in R

Hello everybody and thank you in advance for any help.
I inserted a txt file named "project" in R. This dataframe called "data" and consisted of 12 columns with some information of 999 households.
head(data)
im iw r am af a1c a2c a3c a4c a5c a6c a7c
1 0.00 20064.970 5984.282 0 38 0 0 0 0 0 0 0
2 15395.61 7397.191 0.000 42 30 1 0 0 0 0 0 0
3 16536.74 18380.770 0.000 33 28 1 0 0 0 0 0 0
4 20251.87 14042.250 0.000 38 38 1 1 0 0 0 0 0
5 17967.04 12693.240 0.000 24 39 1 0 0 0 0 0 0
6 12686.43 21170.450 0.000 62 42 0 0 0 0 0 0 0
im=male income
iw=female income
r=rent
am=male age
af=female age
a1c,a2c....a7c takes the value 1 when there is a child in age under 18
and the value 0 when there is not a child in the household.
Now i have to calculate the taxed income seperately for male and female for each houshold based on some criteria, so i am trying to create 1 function which calculate 2 numbers and after that to apply this function on my data frame and return a list with these numbers.
Specificaly I want something like this:
fact<-function(im,iw,r,am,af,a1c,a2c,a3c,a4c,a5c,a6c,a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
how can i fix this function in order to apply on my dataframe?
Can a function return 2 values?
how can i apply the function?
THEN I TRIED THIS:
fact<-function(im=data$im,iw=data$iw,r=data$r,am=data$am,af=data$af,a1c=data$a1c,a2c=data$a2c,a3c=data$a3c,a4c=data$a4c,a5c=data$a5c,a6c=data$a6c,a7c=data$a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
fact(data[1,])
but i have tis error: Error in fact(data[1, ]) : object 'mti' not found
when i tried the function only for "fti" can run but wrongly.
Besides the need to return multiple values using c(mti, fti), your function doesn't have a default value if none of the conditions in the functions are TRUE. So, mti is never created.
Add mti <- NA at the start of your function, so NA is the default value.

gstat in R - Variogram cutoff distance is not working at larger specified distances with large gridded datasets

I am attempting to compute variograms in R with the gstat package of biomass data across management areas. The biomass data is a raster dataset with a 3.5 ft resolution or 1.0668m. The size of the spatialpointsDataFrame I am passing to the variogram function is 18.6 Mb (814223 elements). (I have also tried the spatialpixelsDataFrame, but it does not like the 1.0668m pixel size). When I run the code:
v = variogram(ras.grid1#data[[1]]~1, data = ras.grid1)
and look at output "v", I get distance values that are much larger than the management area (and much larger than 1/3 of the diagonal length).
When I run the variogram function on smaller management units (40 ha) it gives me results that I would expect (this is using a SpatialPointsDataFrame with the size of 7.9 Mb and 344259 elements).
If I hard code the cutoff to be smaller, with the initial larger raster dataset to 200m, it again provides distance values I expect. If I try upping the distance let's say 600m again it provides distance values much larger than the 600m cutoff specified. 300m also provides unexpected results. For example:
####variogram computation with 200m cutoff....It works
v = variogram(ras.grid1#data[[1]]~1, data = ras.grid1, cutoff=200)
v
np dist gamma dir.hor dir.ver id
1 195954282 8.874169 4990.504 0 0 var1
2 572500880 20.621626 5627.534 0 0 var1
3 958185761 33.701344 5996.423 0 0 var1
4 1288501796 46.920392 6264.396 0 0 var1
5 1652274803 60.198360 6472.187 0 0 var1
6 1947750363 73.502011 6642.960 0 0 var1
7 2282469596 86.807781 6802.124 0 0 var1
8 2551355646 100.131946 6942.277 0 0 var1
9 2849678492 113.441335 7049.838 0 0 var1
10 3093057361 126.751400 7149.102 0 0 var1
11 3375989515 140.081110 7240.848 0 0 var1
12 3585116223 153.418095 7322.990 0 0 var1
13 3821495516 166.721460 7394.616 0 0 var1
14 4036375072 180.053643 7443.040 0 0 var1
15 4235205167 193.389119 7476.061 0 0 var1
####variogram computation with 600m cutoff....It returns unexpected
####distance values
v2 = variogram(ras.grid1#data[[1]]~1, data = ras.grid1, cutoff=600)
v2
np dist gamma dir.hor dir.ver id
1 1726640923 26.54691 5759.951 0 0 var1
2 593559666 510.62232 53413.914 0 0 var1
3 3388536438 229.26702 15737.659 0 0 var1
4 1464228507 966.36789 49726.788 0 0 var1
5 3503141163 623.13559 25680.965 0 0 var1
6 878031648 3454.21122 117680.266 0 0 var1
7 2233138601 1761.91799 50996.719 0 0 var1
8 3266098834 1484.40162 37369.451 0 0 var1
9 4056578316 1420.49358 31556.527 0 0 var1
10 254561085 26030.66780 517601.669 0 0 var1
11 562144107 13256.59985 239163.649 0 0 var1
12 557621435 14631.84504 243476.857 0 0 var1
13 385648032 22771.12890 352898.971 0 0 var1
14 4285655256 2163.11091 31213.201 0 0 var1
15 3744542323 2575.19496 34709.529 0 0 var1
Also if I scale the data up to 3m I again get the expected distance values.
I am not sure if the large size of raster dataset is causing the issue and what I am trying to do is not possible, or if I doing something wrong or if there is another way?
Thank you for the help and interest.
After exploring this in more detail, it does seem to be the size of the SpatialPointsDataFrame causing the issue. On my machine keeping the size under 10 Mb seemed to do the trick. To reduce the size of the SpatialPointsDataFrame I sampled the original raster using:
ras.grid<-ras.grid[sample(1:length(ras.grid), 350000),]

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).
You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0

How to remove the csep/non-numeric error while generating heat map in R

I have a data that looks like this:
Cluster_Combined Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_6 Cluster_10
G-protein coupled receptor signaling pathway (15) 2 6 0 4 3 1 0
GTP catabolic process (69) 1 0 0 0 2 0 0
activin receptor signaling pathway (17) 0 2 0 0 0 0 0
acute inflammatory response (7) 2 1 0 0 1 0 0
acute-phase response (8) 5 2 1 0 2 0 0
aging (5) 2 1 2 0 1 0 1
Which I want to create the heat map, based on the values above, where columns refer to the cluster name and row the ontology terms.
Now I have the code below
library(gplots);
dat <- read.table("http://dpaste.com/1505883/plain/",sep="\t",header=T);
hmcols <- rev(redgreen(2750));
heatmap.2(as.matrix(dat),scale="row",cols=hmcols,trace="none",dendrogram="none",keysize=1);
Although it does generate the plot, it gave me the following error:
Error in csep + 0.5 : non-numeric argument to binary operator
Furthermore, I cannot see the red-green effect in the plot.
How can I remove the error?
There is no "cols=" argument to heatmap.2(...). Try col=hmcols.
heatmap.2(as.matrix(dat),scale="row",col=hmcols,trace="none",dendrogram="none",keysize=1)

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

Resources