What is the difference between data and data.frame in R? - r

I know data.frame is a 2-D matrix with columns with different types. I think data is another type of data structure in R, which can take multiple data.frames.
In RStudio, now I have two data: dcd and pdb:
I was trying to understand the properties of them:
> dcd
Total Frames#: 101
Total XYZs#: 19851, (Atoms#: 6617)
[1] 65.59 84.65 90.92 <...> 59.76 55.48 83.68 [2004951]
+ attr: Matrix DIM = 101 x 19851
> class(dcd)
[1] "xyz" "matrix"
> dcd$xyz
Error in dcd$xyz : $ operator is invalid for atomic vectors
> pdb
Call: read.pdb(file = pdbfile)
Total Models#: 1
Total Atoms#: 6598, XYZs#: 19794 Chains#: 2 (values: L H)
Protein Atoms#: 6598 (residues/Calpha atoms#: 442)
Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
Non-protein/nucleic Atoms#: 0 (residues: 0)
Non-protein/nucleic resid values: [ none ]
Protein sequence:
DIQMTQSPSSLSASVGDRVTITCKASQNVRTVVAWYQQKPGKAPKTLIYLASNRHTGVPS
RFSGSGSGTDFTLTISSLQPEDFATYFCLQHWSYPLTFGQGTKVEIKRTVAAPSVFIFPP
SDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLT
LSKADYEKHKVYACEVTHQGLSSPVTKSFNRGECEVQLVESGGGL...<cut>...TSAA
+ attr: atom, xyz, calpha, call
> class(pdb)
[1] "pdb" "sse"
> pdb$xyz
Total Frames#: 1
Total XYZs#: 19794, (Atoms#: 6598)
[1] 24.33 14.711 -3.854 <...> -34.374 -6.315 14.986 [19794]
+ attr: Matrix DIM = 1 x 19794
My questions are:
Is dcd similar to a matrix with 101 rows and 19851 columns?
class(dcd) outputs "xyz" and "matrix", does it mean the dcd belongs to both "xyz" and "matrix" types in the same time?
How can I create a data like pdb which includes multiple data.frame?
e.g. if I have
students <- data.frame(c("Cedric","Fred","George"),c(3,2,2))
names(students) <- c("name", "year")
teachers <- data.frame(c("John","Alice","Mike"),c(6,9,5))
names(teachers) <- c("name", "year")
how can I combine students and teachers into a data called people, so that I can use people$students or people$teachers?

If you're asking how to create a dataframe named people, so you can access the names of the people using people$students or people$teachers, then the code to achieve that is:
people <- data.frame(students = students$name, teachers = teachers$name)
people$students
people would be a dataframe that looks like this:
If you want a list, you can create a list object like the following:
people2 <- as.list(c("students" = students, "teachers" = teachers))
people2$students.name
# returns [1] Cedric Fred George
And people2 would be a list:
See the $ (dollar sign) next to each item in the list? That tells you how to access them. If you wanted teachers.name, then print(people2$teachers.name) will do that for you.
As for your other questions:
Is dcd similar to a matrix with 101 rows and 19851 columns?
You can verify the dimension of a matrix-like object using dim(), ncol() or nrow(). In your case yes it has 101 rows and 19851 columns.
class(dcd) outputs "xyz" and "matrix", does it mean the dcd belongs to both "xyz" and "matrix" types in the same time?
Simplistically, you can think of it inheriting a matrix class as well as xyz. You may want to read about classes and inheritance in R.
How can I create a data like pdb which includes multiple data.frame?
Look at my code above. people2 <- as.list(c("students" = students, "teachers" = teachers)) creates a list of "multiple" dataframes.

Related

Using a function on a column from tree file class Phylo

I have a phylogenetic tree with many tips and internal nodes. I have a list of node ids from the tree. These are part of a separate table. I want to add a new column to the table, children. To get the descendants (nodes and tips), I am using phangorn::Descendants(tree, NODEID, type = 'all'). I can add length to get the number of descendants. For example,
phangorn::Descendants(tree, 12514, type = 'all')
[1] 12515 12517 12516 5345 5346 5347 5343 5344
length(phangorn::Descendants(tree, 12514, type = 'all'))
[1] 8
I would like to very simply take the column in my dataframe 'nodes', and use the function above length(phangorn::Descendants(tree, 12514, type = 'all')) to create a new column in the dataframe based off the input nodes.
Here is an example:
tests <- data.frame(nodes=c(12551, 12514, 12519))
length(phangorn::Descendants(tree, 12519, type = 'all'))
[1] 2
length(phangorn::Descendants(tree, 12514, type = 'all'))
[1] 8
length(phangorn::Descendants(tree, 12551, type = 'all'))
[1] 2
tests$children <- length(phangorn::Descendants(tree, tests$nodes, type = 'all'))
tests
nodes children
1 12551 3
2 12514 3
3 12519 3
As shown above, the number of children is the length of the data.frame and not the actual number of children calculated above. It should be:
tests
nodes children
1 12551 2
2 12514 8
3 12519 2
If you have any tips or idea on how I can have this behave as expected, that would be great. I have a feeling I have to use apply() or I need to index inside before using the length() function. Thank you in advance.
You're super close! Here's one quick solution using sapply! There are more alternatives but this one seems to follow the structure of your question!
Generating some data
library(ape)
ntips <- 10
tree <- rtree(ntips)
targetNodes <- data.frame(nodes=seq(ntips+1, ntips+tree$Nnode))
Note that I'm storing all the relevant nodes in the targetNodes object. This is equivalent to the following object in your question:
tests <- data.frame(nodes=c(12551, 12514, 12519))
Using sapply
Now, let's use sapply to repeat the same operation across all the relevant nodes in targetNodes:
targetNodes$children<- sapply(targetNodes$nodes, function(x){
length(phangorn::Descendants(tree, x, type = 'all'))
})
I'm saving the output of our sapply function by creating a new column in targetNodes.
Good luck!
You were even closer: using lengths instead of length should work.
tests$children <- lengths(phangorn::Descendants(tree, tests$nodes, type = 'all'))

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][["seq"]] : subscript out of bounds in R)

I have a bed file which contains DNA sequences information as follow:
**
track name="194" description="194 methylation (sites)" color=0,60,120 useScore=1
chr1 15864 15866 FALSE 894 +
chr1 534241 534243 FALSE 921 -
chr1 710096 710098 FALSE 729 +
chr1 714176 714178 FALSE 12 -
chr1 720864 720866 FALSE 988 -
**
I loaded the bed file in R and named the matrix DataSet.
I used the follow code to get the sequences:
mydataSet_Test1<-dataSet[,1:3]
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
chr<-as.matrix(as.character(mydataSet_Test1[,1]))
#50
start<-as.matrix(as.integer(as.character(mydataSet_Test1[,2]))-50)
end<-as.matrix(as.integer(as.character(mydataSet_Test1[,3]))+50)
Seqs50_Test1<-getSeq(genome,chr,start=start,end=end)
Now, Seqs50_Test1 is Large DNAStringSet.
I want now to load the BioSeqClass package in R, to do a homolog reduction in my sequences.
I want to use the hr() function, which, according to the package manual, is like this:
Description
Filter homolog sequences by sequence similarity.
hr(seq, method, identity, cdhit.path)
Arguments
seq a list with one element for each protein/gene sequence. The elements are in two parts, one the description ("desc") and the second is a character string of the biological sequence ("seq").
identity a numeric value ranged from 0 to 1. It is used as a maximum identity cutoff among input sequences.
method a string for the method of homolog redunction. This must be one of the strings "cdhit" or "aligndis".
My question is how can I convert my DNAStringSet to the list of elements the function hr() wants? I tried using the list() function but when I run the hr() function, it gives me an error Error in seq[1][["seq"]] : subscript out of bounds
FULL CODE:
mydataSet = dataSet[,1:3]
library(BSgenome.Hsapiens.UCSC.hg19)
genome = BSgenome.Hsapiens.UCSC.hg19
chr = as.matrix(as.character(mydataSet[,1]))
start = as.matrix(as.integer(as.character(mydataSet[,2]))-200)
end = as.matrix(as.integer(as.character(mydataSet[,3]))+200)
Seqs = getSeq(genome,chr,start=start,end=end)
writeXStringSet(Seqs, "C:\\Users\\JL009\\Desktop\\Seqs.fasta", append=FALSE, format = "fasta")
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("BioSeqClass")
library(BioSeqClass)
library(Biostrings)
seq = as.character(readAAStringSet("C:\\Users\\JL009\\Desktop\\Seqs.fasta"))
reducSeqs = hr(seq, method="aligndis", identity=0.4)
Try something like this then:
library(BioSeqClass)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
gr = GRanges(seqnames="chr1",IRanges(start=seq(10e6,11e6,length.out=10),width=50))
S = getSeq(BSgenome.Hsapiens.UCSC.hg19,gr)
names(S) = paste("seq",1:length(S))
input = lapply(seq_along(S),function(i)list(desc=names(S)[i],seq=as.character(S[[i]])))
hr(input,method="aligndis",identity=0.5)

How to write rownames into a spreadsheet with the googlesheets package in R?

I would like to write a data frame in a Google spreadsheet with the googlessheets package but the rownames isn't written in the first column.
My data frame looks like this :
> str(stats)
'data.frame': 4 obs. of 2 variables:
$ Offensive: num 194.7 87 62.3 10.6
$ Defensive: num 396.28 51.87 19.55 9.19
> stats
Offensive Defensive
Annualized Return 194.784261 396.278385
Annualized Standard Deviation 87.04125 51.872826
Worst Drawdown 22.26618 9.546208
Annualized Sharpe Ratio (Rf=0%) 1.61126 0.9193734
I load the library as recommanded in the documentation, create spreadsheet & worksheet then write the data with the gs_edit_cells command :
> install.packages("googlesheets")
> library("googlesheets")
> suppressPackageStartupMessages(library("dplyr"))
> mySpreadsheet <- gs_new("mySpreadsheet")
> mySpreadsheet <- mySpreadsheet %>% gs_ws_new("Stats")
> mySpreadsheet <- mySpreadsheet %>% gs_edit_cells(ws = "Stats", input = stats, trim = TRUE)
Everything goes well but googlesheets doesn't create a column with the rownames. Only two columns are created with their data (Offensive and Defensive).
I have try to convert the data frame into a matrix but still the same.
Any idea how I could achieve this ?
Thank you
Doesn't look like there is a row names argument for gs_edit_cells(). If you just want the row names to show up in the first column of the sheet you could try:
stats$Rnames = rownames(stats) ## add column equal to the row names
stats[,c("Rnames","Offensive", "Defensive")] ## re order so names are first
# names(stats) = c("","Offensive", "Defensive") optional if you want the names col to not have a "name"
From here just pass stats to the functions from the googlessheets package just like you did before

How to store a "complex" data structure in R (not "complex numbers")

I need to train, store, and use a list/array/whatever of several ksvm SVM models, which once I get a set of sensor readings, I can call predict() on each of the models in turn. I want to store these models and metadata about tham in some sort of data structure, but I'm not very familiar with R, and getting a handle on its data structures has been a challenge. My familiarity is with C++, C, and C#.
I envision some sort of array or list that contains both the ksvm models as well as the metadata about them. (The metadata is necessary, among other things, for knowing how to select & organize the input data presented to each model when I call predict() on it.)
The data I want to store in this data structure includes the following for each entry of the data structure:
The ksvm model itself
A character string saying who trained the model & when they trained it
An array of numbers indicating which sensors' data should be presented to this model
A single number between 1 and 100 that represents how much I, the trainer, trust this model
Some "other stuff"
So in tinkering with how to do this, I tried the following....
First I tried what I thought would be really simple & crude, hoping to build on it later if this worked: A (list of (list of different data types))...
>
> uname = Sys.getenv("USERNAME", unset="UNKNOWN_USER")
> cname = Sys.getenv("COMPUTERNAME", unset="UNKNOWN_COMPUTER")
> trainedAt = paste("Trained at", Sys.time(), "by", uname, "on", cname)
> trainedAt
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
> sensorsToUse = c(12,14,15,16,24,26)
> sensorsToUse
[1] 12 14 15 16 24 26
> trustFactor = 88
>
> TestModels = list()
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
> TestModels
[[1]]
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
>
...wha? What did it think I was trying to replace? I was just trying to populate element 1 of TestModels. Later I would add an element [2], [3], etc... but this didn't work and I don't know why. Maybe I need to define TestModels as a list of lists right up front...
> TestModels = list(list())
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Hmm. That no workie either. Let's try something else...
> TestModels = list(list())
> TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Drat. Still no workie.
Please clue me in on how I can do this. And I'd really like to be able to access the fields of my data structure by name, perhaps something along the lines of...
> print(TestModels[1]["TrainedAt"])
Thank you very much!
You were very close. To avoid the warning, you shouldn't use
TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
but instead
TestModels[[1]] = list(trainedAt, sensorsToUse, trustFactor)
To access a list element you use [[ ]]. Using [ ] on a list will return a list containing the elements inside the single brackets. The warning is shown because you were replacing a list containing one element (because this is how you created it) with a list containing 3 elements. This wouldn't be a problem for other elements:
TestModels[2] = list(trainedAt, sensorsToUse, trustFactor) # This element did not exist, so no replacement warning
To understand list subsetting better, take a look at this:
item1 <- list("a", 1:10, c(T, F, T))
item2 <- list("b", 11:20, c(F, F, F))
mylist <- list(item1=item1, item2=item2)
mylist[1] #This returns a list containing the item 1.
#$item1 #Note the item name of the container list
#$item1[[1]]
#[1] "a"
#
#$item1[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#$item1[[3]]
#[1] TRUE FALSE TRUE
#
mylist[[1]] #This returns item1
#[[1]] #Note this is the same as item1
#[1] "a"
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#[[3]]
#[1] TRUE FALSE TRUE
To access the list items by name, just name them when creating the list:
mylist <- list(var1 = "a", var2 = 1:10, var3 = c(T, F, T))
mylist$var1 #Or mylist[["var1"]]
# [1] "a"
You can nest this operators like you suggested. So you coud use
containerlist <- list(mylist)
containerlist[[1]]$var1
#[1] "a"

Resources