Fixed it. What is the option_description used for in the build_dict function in the dataMeta package in R? - r

I have a dataset with some 100,000 tweets and their sentiment scores attached. The original dataset just has two columns one for the tweets and one for their sentiment scores.
I am trying to build a data dictionary for it using the dataMeta package. Here is the code that I have writtern so far:
#Data Dictionary
var_desc<-c("Sentiment Score 0 for Negative sentences and 4 for Positive sentences","The tweets collected")
var_type<-c(0,1)
#Creating the Linker Data Frame
linker <- build_linker(tweets_train, variable_description = var_desc, variable_type = var_type)
linker
#Build the data dictionary
dict<-build_dict(my.data = tweets_train,linker=linker,option_description = NULL, prompt_varopts = F)
kable(dict,format="html",caption="Data dictionary for the Training dataset")
My problem is in the data dictionary I have provided the Variable Name and the Variable Description but I think in the Variable Options column it is trying to print the entire 100,000 tweets which I want to avoid. Is it possible for me to set that column up too manually. Would the option_description in the build_dict function be of any help to do it?
I tried getting some idea about it from online but to no use. Here is the link that I have followed till now:
https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html
This is the first time I am trying to build a data dictionary and hence the struggle. Any suggestions would be extremely appreciated. Thanks in advance.

Related

R - [DESeq2] - Making DESeq Dataset object from csv of already normalized counts

I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?
Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)

Retain SPSS value labels when working with data

I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")

Changing the name of a dataframe with an = sign in it

My question is regarding changing the name of a dataframe that I imported using the quantmod package. I ran the following lines,
library(quantmod)
data <- getSymbols("GBP=x", from = "2013-01-01", to = "2017-06-01", src="yahoo")
Which then saved the data as GBP=x
I now want to change the name of this dataframe to something called "GBP".
I keep getting values and not a dataframe.
GBP GBP=x
When I run GBP <- as.data.frame('GBP=x') I just get a dataframe with the value of GBP=x - 1 observation of 1 variable.
Any help is much appreciated
(Alternatively if you can suggest a way to download FX data from quantmod storing it as a more convenient name that would do the trick also.
If I understand the documentation correctly,
data <- getSymbols("GBP=x", from = "2013-01-01", to = "2017-06-01", src="yahoo",auto.assign=FALSE)
will result in the FX data being stored in data.
Also, in case you have trouble finding the ` key, it's on the top left of most keyboards. It's used generally in R to enclose troublesome characters.
You need to use '`':
GBP = `GBP=X`
# remove the original dataframe from your workspace
rm(`GBP=X`)

Error implementing anc.clim in the Phyloclim R Package

I am attempting to use the anc.clim function in phyloclim, but am stuck on an error I don't know how to fix.
I have three items in my workspace:
etopo is a 50X14 double matrix with the first column corresponding to 50 bins of an environmental variable. Each subsequent column is labeled with a taxon name.
targetTree is an object of class phylo containing 13 taxa with tip labels corresponding with the taxa in etopo (generated by reading in a .tre file from MrBayes using read.nexus)
prunedPosteriorTrees is an object of class multiphylo containing 1000 phylogenetic trees with 13 taxa with tip labels corresponding with the taxa in etopo (generated by reading in a .t file from MrBayes using read.nexus)
I have confirmed that the taxa in all three match using geiger's treedata function.
When I go to implement anc.clim with these data, the following occurs:
> climateReconstruction <- anc.clim(targetTree, posterior = prunedPosteriorTrees, pno = etopo, n = 2)
Error in noi(old, clades, monophyletic = TRUE) :
tips are not numbered consecutively. Type '?fixTips' for help.
When I type ?fixtips, or ??fixTips for that matter, no documentation is found. I have also searched the web, and the package documentation, to no avail. Has anyone had experience with this error? What do I do?
I have solved this problem. For the aid of others:
targetTree nees to be an ultrametric tree such as one that would result from a BEAST analysis. Mr. Bayes trees are not ultrametric. The same is true for the prunedPosteriorTrees file.
fixTips no longer exists. It has been replaced with fixNodes. Using this function solves the error.

Transform a matrix txt file in spectra data for ChemoSpec package

I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)

Resources