Turning table data into "atomic vector" in R - r

I'm really new to R. I have the following:
library(stringr)
data <-read.table("C:/dataAnalysis/dataset_317_1.txt")
d<-data[5]
set<-str_count(c("corn", "cornmeal", "corn on the cob", "meal"), "setosa")
ver<-str_count(d, "I.versicolor")
vir<-str_count(d, "I.virginica.")
arr<-c(set,ver,vir)
arr
R says:
> ver<-str_count(data, "I.versicolor")
Error: String must be an atomic vector
My file is table of tab delimited data, with string in the fifth column. How can I make the data from my table that I read into an atomic vector and make R happy?

If the data you want to analyze is in the fifth column, your code for defining "d" is incorrect.
d <- data[[5]]
or
d <- data[,5]
will work correctly.
data[5] keeps the data frame structure, while data[[5]] or data[,5] outputs only the vector.

Adding on to the above answer, avoid using codes like data[5] unless you want to preserve the original class of the data set, whether it is an array, a list, a matrix or a data frame. If you want to understand more about subsetting, there's an excellent ebook called "R Fundamentals & Graphics" which will be a good desk reference for you.

Related

How to select only numbers from a dataframe in R using which()

I have a large dataframe in R and am trying to do some stats tests on certain columns, but the non-programmers who made the csv file added a bunch of text notes that I need to ignore.
For example a column might have values: 12,20,40,missing,64,32,no input,45,10
How do I only select the numbers using the which statement?
I failed miserably trying:
my_data_frame$Column.Title[which(is.numeric(my_data_frame$Column.Title))]
What do I change in the which function to only select the numbers and ignore the text? Thanks!
You can use the built-in as.numeric() converter to do something like this:
x <- my_data_frame$Column.Title
xn <- as.numeric(x)
which(!is.na(xn))
This won't distinguish between NAs created by failed coercion and pre-existing (numeric) NA values.
If there's a small enough variety of "missing" values you could read the data in with read.csv(..., na.strings=c("NA","missing","no input"))

Creating Sub Lists from A to Z from a Master List

Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop

Paste function to construct existing data frame name and evaluate in R

I am working with a long list of data frames.
Here is a simple hypothetical example of a data frame:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
I am trying to retrieve a specified column of the data frame using paste function.
get(paste("DFrame","$","ColTwo",sep=""))
The get function returns the following error, when trying to retrieve a specified column:
Error in get(paste("DFrame", "$", "ColTwo", sep = "")) :object 'DFrame$ColTwo' not found
When I enter the constructed name of the data frame DFrame$ColTwo it returns the desired output of the second column.
If I reconstruct an example without the '$' sign then I get the desired answer from the get function. For example the code yields 2:
enter code here
Ans <- 2
get(paste("An","s",sep=""))
[1] 2
I am looking for the same desired outcome, but struggling to get past the error that the object could not be found.
I also attempted using the following format, but the quotation in the column name breaks the paste function:
paste("DFrame","[,"ColTwo"]",sep="")
Thank you very much for the input,
Kind regards
You can do that using the following syntax:
get("DFrame")[,"ColTwo"]
You can use paste() in both of these strings, for example:
get(paste("D", "Frame", sep=""))[,paste("Col", "Two", sep="")]
Edit: Despite someone downvoting this answer without leaving a comment, this does exactly what the original poster asked for. If you feel that it does not or is in some way dangerous, I would encourage you to leave a comment.
Stop trying to use paste and get entirely.
The whole point of having a list (of data frames, say) is that you can reference them using names:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
#A list of data frames
l <- list(DFrame,DFrame)
#The data frames in the list can have names
names(l) <- c("DF1",'DF2')
# Now you just use `[[`
> l[["DF1"]][["ColOne"]]
[1] 1 0
> l[["DF1"]][["ColTwo"]]
[1] Yes No
Levels: No Yes
If you have to, you can use paste to construct the indices passed inside [[.

Extracting out numbers in a list from R

I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")

Referring to row names as numbers in analysis (geiger package)

I'm trying to carry out tip.disparity function in the geiger package in R.
My data:
Family Length Wing Tail
Alced 2.21416 1.88129 1.66744
Brachypt 2.36734 2.02373 2.03335
Bucco 2.23563 1.91364 1.80675
When I use the function "name.check" to check the names from my data match those on my tree, it returns
$data.not.tree
[1] "1" "10" "11" "12" "2" etc
Showing that it is referring to the names by number. Ive tried converting to character vector etc
I've tried running it with
data.names=NULL
I'm looking simply to edit my data frame so that the package matches the names to those in my tree (tree is newick format)
Hope this is clearer
Thanks
I believe the clue is in the documentation (?check.names):
data.names: names of the tips in the order of the data; if this is not
given, names will be taken from the names or rownames of the
object data
If you want the program to return the names of the taxa that are included in the data frame but not present in the tree, you either need to assign the corresponding names as row names of your data frame, or specify them separately in the data.names argument. Note that the default row names of a data frame are the character equivalent of the row number, exactly what you're seeing above ...
edit based on additional information above:
R can't guess (or doesn't want to) that the names are contained in the Family element of your data frame. Try:
check.names(traitdata,tree,data.names=as.character(traitdata$Family))
Probably better in the long run to do:
rownames(traitdata) <- as.character(traitdata$Family)
traitdata <- subset(traitdata,-Family)
check.names(traitdata,tree)
Because you don't want to have Family included in your data set of traits -- it's an identifier, not a trait ...
If you look at the structure of the example data given in the package:
data(geospiza)
geospiza.data
you can see that the taxon names are included as row names, not as a column in the data frame itself ...
PS it's not as nice an interface as StackOverflow, but there's a very friendly and active R-for-phylogeny mailing list at r-sig-phylo#r-projects.org ...

Resources