I'm trying to carry out tip.disparity function in the geiger package in R.
My data:
Family Length Wing Tail
Alced 2.21416 1.88129 1.66744
Brachypt 2.36734 2.02373 2.03335
Bucco 2.23563 1.91364 1.80675
When I use the function "name.check" to check the names from my data match those on my tree, it returns
$data.not.tree
[1] "1" "10" "11" "12" "2" etc
Showing that it is referring to the names by number. Ive tried converting to character vector etc
I've tried running it with
data.names=NULL
I'm looking simply to edit my data frame so that the package matches the names to those in my tree (tree is newick format)
Hope this is clearer
Thanks
I believe the clue is in the documentation (?check.names):
data.names: names of the tips in the order of the data; if this is not
given, names will be taken from the names or rownames of the
object data
If you want the program to return the names of the taxa that are included in the data frame but not present in the tree, you either need to assign the corresponding names as row names of your data frame, or specify them separately in the data.names argument. Note that the default row names of a data frame are the character equivalent of the row number, exactly what you're seeing above ...
edit based on additional information above:
R can't guess (or doesn't want to) that the names are contained in the Family element of your data frame. Try:
check.names(traitdata,tree,data.names=as.character(traitdata$Family))
Probably better in the long run to do:
rownames(traitdata) <- as.character(traitdata$Family)
traitdata <- subset(traitdata,-Family)
check.names(traitdata,tree)
Because you don't want to have Family included in your data set of traits -- it's an identifier, not a trait ...
If you look at the structure of the example data given in the package:
data(geospiza)
geospiza.data
you can see that the taxon names are included as row names, not as a column in the data frame itself ...
PS it's not as nice an interface as StackOverflow, but there's a very friendly and active R-for-phylogeny mailing list at r-sig-phylo#r-projects.org ...
Related
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I have a list, e.g. mylist=c("A","B","C"), and I wish to use list elements to extract factors of a data frame in R.
If MyDataFrame has a column name "A", I can extract the column/factor as MyDataFrame$A. However,
MyDataFrame$mylist[1]
fails. At first I thought that this was because mycolumn[3] is "A" whereas I need $A without the quotes. However, using
MyDataFrame$as.name(mylist[1])
fails as well, presumably because R looks for the string as.name(mylist[1]) as a factor name rather than processing the function (the rror it gives is "attempt to apply non-function". Setting x=as.name(mylist[1]) and then using MyDataFrame$x runs into the same problem of x not being treated as a variable.
Is there a straightforward way to do this, as I need to loop over a long list of column names in order to call the factors of interest.
Try this : rather than $
MyDataFrame[,mylist[1]]
This first example shows a large list with the elements: "3.13" "3.3" "3.47" from a split dataframe based on the VDD1 col:
>Data_Char_VDD1 <- split(Data_Char, Data_Char$VDD1)
Looking up names of elements in the large list "Data_Char_VDD1" would look like this:
>names(Data_Char_VDD1)
[1] "3.13" "3.3" "3.47"
I want to look up names in several lists, how many will differ from time to time. Lets say it is 4 this time.
I am trying to do something like this, which should create 4 variables called VDD1..4 containing their respective VDD combinations:
for(i in 1:length(Configuration$VDDlist[!is.na(Configuration$VDDlist)])){
assign(paste0("VDD",i), names(paste0("Data_Char_VDD",i)))
}
Resulting in 4 empty variables.
Debugging shows that my method of getting names from lists where the names are constructed using paste0 does not work:
>i <- 2
>names(paste0("Data_Char_VDD",i))
NULL
How can I construct names in the correct data format in a way so it is useful to function names() ?
Try :
names(eval(parse(text=paste0("Data_Char_VDD",i))))
I am working with a long list of data frames.
Here is a simple hypothetical example of a data frame:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
I am trying to retrieve a specified column of the data frame using paste function.
get(paste("DFrame","$","ColTwo",sep=""))
The get function returns the following error, when trying to retrieve a specified column:
Error in get(paste("DFrame", "$", "ColTwo", sep = "")) :object 'DFrame$ColTwo' not found
When I enter the constructed name of the data frame DFrame$ColTwo it returns the desired output of the second column.
If I reconstruct an example without the '$' sign then I get the desired answer from the get function. For example the code yields 2:
enter code here
Ans <- 2
get(paste("An","s",sep=""))
[1] 2
I am looking for the same desired outcome, but struggling to get past the error that the object could not be found.
I also attempted using the following format, but the quotation in the column name breaks the paste function:
paste("DFrame","[,"ColTwo"]",sep="")
Thank you very much for the input,
Kind regards
You can do that using the following syntax:
get("DFrame")[,"ColTwo"]
You can use paste() in both of these strings, for example:
get(paste("D", "Frame", sep=""))[,paste("Col", "Two", sep="")]
Edit: Despite someone downvoting this answer without leaving a comment, this does exactly what the original poster asked for. If you feel that it does not or is in some way dangerous, I would encourage you to leave a comment.
Stop trying to use paste and get entirely.
The whole point of having a list (of data frames, say) is that you can reference them using names:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
#A list of data frames
l <- list(DFrame,DFrame)
#The data frames in the list can have names
names(l) <- c("DF1",'DF2')
# Now you just use `[[`
> l[["DF1"]][["ColOne"]]
[1] 1 0
> l[["DF1"]][["ColTwo"]]
[1] Yes No
Levels: No Yes
If you have to, you can use paste to construct the indices passed inside [[.
I'm really new to R. I have the following:
library(stringr)
data <-read.table("C:/dataAnalysis/dataset_317_1.txt")
d<-data[5]
set<-str_count(c("corn", "cornmeal", "corn on the cob", "meal"), "setosa")
ver<-str_count(d, "I.versicolor")
vir<-str_count(d, "I.virginica.")
arr<-c(set,ver,vir)
arr
R says:
> ver<-str_count(data, "I.versicolor")
Error: String must be an atomic vector
My file is table of tab delimited data, with string in the fifth column. How can I make the data from my table that I read into an atomic vector and make R happy?
If the data you want to analyze is in the fifth column, your code for defining "d" is incorrect.
d <- data[[5]]
or
d <- data[,5]
will work correctly.
data[5] keeps the data frame structure, while data[[5]] or data[,5] outputs only the vector.
Adding on to the above answer, avoid using codes like data[5] unless you want to preserve the original class of the data set, whether it is an array, a list, a matrix or a data frame. If you want to understand more about subsetting, there's an excellent ebook called "R Fundamentals & Graphics" which will be a good desk reference for you.