I am somewhat new to R, so forgive my basic questions.
I perform a CCA on a full dataset (358 sites, 40 abiotic parameters, 100 species observation).
library(vegan)
env <- read.table("env.txt", header = TRUE, sep = "\t", dec = ",")
otu <- read.table(otu.txt", header = TRUE, sep = "\t", dec = ",")
cca <- cca(otu~., data=env)
cca.plot <- plot(cca, choices=c(1,2))
vif.cca(cca)
ccared <- cca(formula = otu ~EnvPar1,2,n, data = env)
ccared.plot <- plot(ccared, choices=c(1,2))
orditorp(ccared.plot, display="sites")
This works without using sample names in the first columns (initially, the first column containing numeric samples names got interpreted as a variable, so i used tables without that information. When i add site names to the plot via orditorp, it gives "row.name=n" in the plot.)
I want to use my sample names, however. I tried row.names=1 on both tables with sample name information:
envnames <- read.table("envwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
otunames <- read.table("otuwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
, and any combination of env/otu/envnames/otunames. cca worked out well in any case, but any plot command yielded
plot.ccarownames <- plot(cca(ccarownames, choices=c(1,2)))
Error in rowSums(X) : 'x' must be numeric
My second problem is connected to that: The 358 sites are grouped into 6 groups (4x60,2x59). The complete matrix has this information inferred as an extra column.
Since i couldnt work out the row name problem, i am even more stuck with nominal data, anyhow.
The original matrix contains a first column (sample names, numeric, but can be easily transformed to nominal) and second one (group identity, nominal), followed by biological observations.
What i would like to have:
A CCA containing all six groups that is coloring sites per group.
A CCA containing only data for one group (without manual
construction of individual input tables)
CCA plots that are using my original sample names.
Any help is appreciated! Really, i am stuck with it since yesterday morning :/
I'm using cca() from vegan myself and I have some of your own problems, however I've been able to at least solve your original "row names" problem. I'm doing a CCA analysis on data from 41 soils, with 334 species and 39 environmental factors.
In my case I used
rownames(MyDataSet) <- MyDataSet$ObservationNamesColumn
(I used default names such as MyDataSet for the sake of example here)
However I still had environmental factors which weren't numerical (such as soil texture). You could try checking for non numerical factors in case you have a mistake in your original dataset or an abiotic factor which is not interpreted as numerical for any other reason. To do this you can either use the command str(MyDataSet) which tells you the nature of each of your variable, or lapply(MyDataSet, class) which also tells you the same but in a different output.
In case you have abiotic factors which are not numerical (again, such as texture) and you want to remove them, you can do so by creating a whole new dataset using only the numerical variables (you will still keep your observation names as they were defined as row names), this is rather easy to do and can be done using something similar to this:
MyDataSet.num <- MyDataSet[,sapply(MyDataSet, is.numeric)]
This creates a new data set which has the same rows as the original but only columns (variables) with numeric values. You should be able then to continue your work using this new data set.
I am very new to both R programming and statistics (I'm a microbiologist) but I hope this helps!
Related
Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R
I have a data.table in R that I'm trying to write out to a .txt file, and then input back into R.
It's sizeable table of 6.5M observations and 20 variables, so I want to use fread().
When I use
write.table(data, file = "data.txt")
a table of about 2.2GB is written in data.txt. In manually inspecting it, I can see that there are column names, that it's separated by " ", and that there are quotes on character variables. So everything should be fine.
However,
data <- fread("data.txt")
returns a data.table of 6.5M observations and 1 variable. OK, maybe for some reason fread() isn't automatically understanding the separator string:
data <- fread("data.txt", sep = " ")
All the data is in the proper variables now, but
R has added an unnecessary row-number column
in one (only one) of my columns all NAs have been replaced by 9218868437227407266
All variable names are missing
Maybe fread() isn't recognizing the header, somehow.
data <- fread("data.txt", sep = " ", header = T)
Now my first set of observations is my column names. Not very useful.
I'm completely baffled. Does anyone understand what's happening here?
EDIT:
row.names = F solved the names problem, thanks Ananda Mahto.
Ran
datasub <- data[runif(1000,1,6497651), ]
write.table(datasub, file = "datasub.txt", row.names = F)
fread("datasub.txt")
fread() seems to work fine for the smaller dataset.
EDIT:
Here is the subset of data I created above:
https://github.com/cbcoursera1/ExploratoryDataAnalysisProject2/blob/master/datasub.txt
This data comes from the National Emissions Inventory (NEI) and is made available by the EPA. More information is available here:
http://www.epa.gov/ttn/chief/eiinformation.html
EDIT:
I can no longer reproduce this issue. It may be that row.names = F solved the issue, or possibly restarting R/clearing my environment/something random fixed the problem.
I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:
Error in update.biglm(model, new) : model matrices incompatible
The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels
specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.
Here's an example piece of code to illustrate my problem:
df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df
df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)
Thanks for any advice you have.
I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?
large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)
If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.
If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:
df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]
model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2) # this is now compatible
I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara
Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here
I have two csv files. One containing measurements at several points and one containing the description of the single points. It has about a 100 different points and 10000's of measurements but for simplification let's assume there are only two points and measurements.
data.csv:
point1,point2,date
25,80,11.06.2013
26,70,10.06.2013
description.csv:
point,name,description
point1,tempA,Temperature in room A
point2,humidA,Humidity in room A
Now I read both of the csv's into dataframes. Then I change the column names in the dataframe to make it more readable.
options(stringsAsFactors=F)
DataSource <- read.csv("data.csv")
DataDescription <- read.csv("description.csv")
for (name.source in names(DataSource))
{
count = 1
for (name.target in DataDescription$point)
{
if (name.source == name.target)
{
names(DataSource)[names(DataSource)==name.source] <- DataDescription[count,'name']
}
count = count + 1
}
}
So, my questions now are: Is there a way to do this without the loops? And would you change the names for readability as I did or not? If not, why?
The trick with replacements is sometimes to match the indexing on both sides of hte assignment:
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
DataDescription$name[match(DataDescription$point, names(DataSource))]
#> DataSource
tempA humidA date
1 25 80 11.06.2013
2 26 70 10.06.2013
Earlier effort :
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
gsub(" ", "_", DataDescription$description)[
match(DataDescription$point, names(DataSource))]
#> DataSource
Temperature_in_room_A Humidity_in_room_A date
1 25 80 11.06.2013
2 26 70 10.06.2013
Notice that I did not put non-syntactic names on that dataframe. To do so would have been a disservice. Anando Mahto's comment is well considered. I would not want to do this unless it were are the very end of data-processing or a side excursion on the way to a plotting effort. In that case I might not substitute the underscores. In the case where you wanted plotting lables there might be a further need for insertion of "\n" to fold the text within space constraints.
ok, I ordered the columns in the first one and the rows in the second one to work around the problem with the same order of the points. Now the description only need to have the same points as the data source. Here is my final code:
# set options to get strings right
options(stringsAsFactors=F)
# read in original data
DataOriginal <- read.csv("data.csv", sep = ";")
DataDescriptionOriginal <- read.csv("description.csv", sep = ";")
# sort the data
DataOrdered <- DataOriginal[,order(names(DataOriginal))]
DataDescriptionOrdered <- DataDescriptionOriginal[order(DataDescriptionOriginal$points),]
# copy data into final dataframe and replace names
Data <- DataOrdered
names(Data)[match(DataDescriptionOrdered$points, names(Data))] <- gsub(" ", "_", DataDescriptionOrdered$description)[match(DataDescriptionOrdered$points, names(Data))]
Thx a lot to everyone contributing to find a good solution for me!