"Duplicated rows" error in AnnotationForge function makeOrgPackage - r

I'm creating an organism package using the AnnotationForge package, specifically the function makeOrgPackage. I've been following this vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/MakingNewOrganismPackages.html
When I call the function:
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I receive this error:
Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows
The "..." refers to the set of dataframes containing the annotation data. I've ensured that these dataframes are in the exact same structure as the example in the vignette. In the "gene_info" and "chromosome" dfs, I deleted all duplicated rows.
The "go" df has repeated values in the "GID" (gene ID) column, but all GO values are unique, and I've checked that no duplicate rows exist. For example:
GID GO EVIDENCE
1 PA14_00010 GO:0005524 ISM
2 PA14_00010 GO:0006270 ISM
3 PA14_00010 GO:0006275 ISM
4 PA14_00010 GO:0043565 ISM
5 PA14_00010 GO:0003677 ISM
6 PA14_00010 GO:0003688 ISM
7 PA14_00020 GO:0003677 ISM
8 PA14_00020 GO:0006260 ISM
The same goes for the sample finch data provided by the vignette; repeated GIDs, but unique GO numbers. Frustratingly, when I run the makeOrgPackage function for the sample data in the vignette, there are no errors. What am I missing here?
Full script:
# Load in GO annotated PA14 file, downloaded from Psuedomonas.com
PA14file <- read.csv("../data/GO_annotations/GO_PA14.csv")
colnames(PA14file)
> colnames(PA14file)
[1] "LocusTag" "GeneName" "ProductDescription"
[4] "StrainName" "Accession" "GOTerm"
[7] "Namespace" "GOEvidenceCode" "EvidenceOntologyECOCode"
[10] "EvidenceOntologyTerm" "SimilarToBindsTo" "PMID"
[13] "chrom"
# PA14 only has 1 chromosome, so create a new column and populate it with 1s.
PA14file$chrom <- '1'
# Create gene_info df, remove duplicate rows
PA14Sym <- PA14file[,c("LocusTag", "GeneName", "ProductDescription")]
PA14Sym <- PA14Sym[PA14Sym[,"GeneName"]!="-",]
PA14Sym <- PA14Sym[PA14Sym[,"ProductDescription"]!="-",]
colnames(PA14Sym) <- c("GID","SYMBOL","GENENAME")
PA14Sym <- PA14Sym[!duplicated(PA14Sym), ]
# Create chromosome df, remove duplicate rows
PA14Chr <- PA14file[,c("LocusTag", "chrom")]
PA14Chr <- PA14Chr[PA14Chr[,"chrom"]!="-",]
colnames(PA14Chr) <- c("GID","CHROMOSOME")
PA14Chr %>% distinct(GID, .keep_all = TRUE)
PA14Chr <- PA14Chr[!duplicated(PA14Chr), ]
# Create go df
PA14Go <- PA14file[,c("LocusTag", "Accession", "GOEvidenceCode")]
PA14Go <- PA14Go[PA14Go[,"GOEvidenceCode"]!="",]
colnames(PA14Go) <- c("GID","GO","EVIDENCE")
# Call the function
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")

I also met this question today, and just after I changed to use the distinct() in dplyr, this function can work correctly.(My function is the same as yours.)
Just try to add a piece of %>% dplyr::distinct() to the tail of each part of creating or use dplyr::distinct() after all operations to remove the duplications in your variable.
In your case:
library(dplyr)
PA14Sym <- dplyr::distinct(PA14Sym)
PA14Chr <- dplyr::distinct(PA14Chr)
PA14Go <- dplyr::distinct(PA14Go)
Hope these can help you.

Related

How can I copy and rename a bunch of variables at once?

I have created some variables. I would like to duplicate these so that they exist twice, once with the name you see below, and once with Ireland_ in front of their name, i.e.,
c_PFS_Folfox = 307.81 would become:
Ireland_c_PFS_Folfox = 307.81
I initially define these as follows:
1. Cost of treatment in this country
c_PFS_Folfox <- 307.81
c_PFS_Bevacizumab <- 2580.38
c_OS_Folfiri <- 326.02
administration_cost <- 365.00
2. Cost of treating the AE conditional on it occurring
c_AE1 <- 2835.89
c_AE2 <- 1458.80
c_AE3 <- 409.03
3. Willingness to pay threshold
n_wtp = 45000
Then I put them together to rename all at once:
kk <- data.frame(c_PFS_Folfox, c_PFS_Bevacizumab, c_OS_Folfiri, administration_cost, c_AE1, c_AE2, c_AE3, n_wtp)
colnames(kk) <- paste("Ireland", kk, sep="_")
kk
Ireland_307.81 Ireland_2580.38 Ireland_326.02 Ireland_365 Ireland_2835.89 Ireland_1458.8
1 307.8 2580 326 365 2836 1459
Ireland_409.03 Ireland_45000
1 409 45000
Obviously this isn't the output I intended. These also don't exist as new variables in the environment.
What can I do?
If we want to create objects with Ireland_ as prefix, either use
list2env(setNames(kk, paste0("Ireland_", names(kk))), .GlobalEnv)
Once we created the objects in the global env, we may remove the original objects
> rm(list = names(kk))
> ls()
[1] "Ireland_administration_cost" "Ireland_c_AE1" "Ireland_c_AE2" "Ireland_c_AE3" "Ireland_c_OS_Folfiri"
[6] "Ireland_c_PFS_Bevacizumab" "Ireland_c_PFS_Folfox" "Ireland_n_wtp" "kk"
or with %=% from collapse
library(collapse)
paste("Ireland", colnames(kk), sep="_") %=% kk
-checking
> Ireland_administration_cost
[1] 365
> Ireland_c_PFS_Folfox
[1] 307.81
First put all your variables in a vector, then use sapply to iterate the vector to assign the existing variables to a new variable with the prefix "Ireland_".
your_var <- c("c_PFS_Folfox", "c_PFS_Bevacizumab", "c_OS_Folfiri",
"administration_cost", "c_AE1", "c_AE2", "c_AE3", "n_wtp")
sapply(your_var, \(x) assign(paste0("Ireland_", x), get(x), envir = globalenv()))

Find differences betwen 2 dataframes with different lengths

I have two dataframes with each two columns c("price", "size") with different lengths.
Each price must be linked to its size. It's two lists of trade orders. I have to discover the differences between the two dataframes knowing that the two databases can have orders that the other doesn't have and vice versa. I would like an output with the differences or two outputs, it doesn't matter. But I need the row number in the output to find where are the differences in the series.
Here is sample data :
> out
price size
1: 36024.86 0.01431022
2: 36272.00 0.00138692
3: 36272.00 0.00277305
4: 36292.57 0.05420000
5: 36292.07 0.00403948
---
923598: 35053.89 0.30904890
923599: 35072.76 0.00232000
923600: 35065.60 0.00273000
923601: 35049.36 0.01760000
923602: 35037.23 0.00100000
>bit
price size
1: 37279.89 0.01340020
2: 37250.84 0.00930000
3: 37250.32 0.44284049
4: 37240.00 0.00056491
5: 37215.03 0.99891906
---
923806: 35053.89 0.30904890
923807: 35072.76 0.00232000
923808: 35065.60 0.00273000
923809: 35049.36 0.01760000
923810: 35037.23 0.00100000
For example, I need to know if the first row of the database out is in the database bit.
I've tried many functions : comparedf()
summary(comparedf(bit, out, by = c("price","size"))
but I've got error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, :
I've tried compare_df() :
compareout=compare_df(out,bit,c("price","size"))
But I know the results are wrong, I've only 23 results and I know that there are more than 200 differences minimum.
I've tried match(), which() functions but it doesn't get the results I search.
If you have any other methods, I will take them.
Perhaps you could just do inner_join on out and bit by price and size? But first make id variable for both data.frame's
library(dplyr)
out$id <- 1:nrow(out)
bit$id <- 1:nrow(bit)
joined <- inner_join(bit, out, by = c("price", "size"))
Now we can check which id from out and bit are not present in joined table:
id_from_bit_not_included_in_out <- bit$id[!bit$id %in% joined$id.x]
id_from_out_not_included_in_bit <- out$id[!out$id %in% joined$id.y]
And these ids are the rows not included in out or bit, i.e. variable id_from_bit_not_included_in_out contains rows present in bit, but not in out and variable id_from_out_not_included_in_bit contains rows present in out, but not in bit
First attempt here. It will be difficult to do a very clean job with this data tho.
The data I used:
out <- read.table(text = "price size
36024.86 0.01431022
36272.00 0.00138692
36272.00 0.00277305
36292.57 0.05420000
36292.07 0.00403948
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
bit <- read.table(text = "price size
37279.89 0.01340020
37250.84 0.00930000
37250.32 0.44284049
37240.00 0.00056491
37215.03 0.99891906
37240.00 0.00056491
37215.03 0.99891906
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
Assuming purely that row 1 of out should match with row 1 of bit a simple solution could be:
df <- cbind(distinct(out), distinct(bit))
names(df) <- make.unique(names(df))
However judging from the data you have provided I am not sure if this is the way to go (big differences in the first few rows) so maybe try sorting the data first?:
df <- cbind(distinct(out[order(out$price, out$size),]), distinct(bit[order(bit$price, bit$size),]))
names(df) <- make.unique(names(df))

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

How to write rownames into a spreadsheet with the googlesheets package in R?

I would like to write a data frame in a Google spreadsheet with the googlessheets package but the rownames isn't written in the first column.
My data frame looks like this :
> str(stats)
'data.frame': 4 obs. of 2 variables:
$ Offensive: num 194.7 87 62.3 10.6
$ Defensive: num 396.28 51.87 19.55 9.19
> stats
Offensive Defensive
Annualized Return 194.784261 396.278385
Annualized Standard Deviation 87.04125 51.872826
Worst Drawdown 22.26618 9.546208
Annualized Sharpe Ratio (Rf=0%) 1.61126 0.9193734
I load the library as recommanded in the documentation, create spreadsheet & worksheet then write the data with the gs_edit_cells command :
> install.packages("googlesheets")
> library("googlesheets")
> suppressPackageStartupMessages(library("dplyr"))
> mySpreadsheet <- gs_new("mySpreadsheet")
> mySpreadsheet <- mySpreadsheet %>% gs_ws_new("Stats")
> mySpreadsheet <- mySpreadsheet %>% gs_edit_cells(ws = "Stats", input = stats, trim = TRUE)
Everything goes well but googlesheets doesn't create a column with the rownames. Only two columns are created with their data (Offensive and Defensive).
I have try to convert the data frame into a matrix but still the same.
Any idea how I could achieve this ?
Thank you
Doesn't look like there is a row names argument for gs_edit_cells(). If you just want the row names to show up in the first column of the sheet you could try:
stats$Rnames = rownames(stats) ## add column equal to the row names
stats[,c("Rnames","Offensive", "Defensive")] ## re order so names are first
# names(stats) = c("","Offensive", "Defensive") optional if you want the names col to not have a "name"
From here just pass stats to the functions from the googlessheets package just like you did before

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Resources