"Duplicated rows" error in AnnotationForge function makeOrgPackage - r
I'm creating an organism package using the AnnotationForge package, specifically the function makeOrgPackage. I've been following this vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/MakingNewOrganismPackages.html
When I call the function:
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I receive this error:
Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows
The "..." refers to the set of dataframes containing the annotation data. I've ensured that these dataframes are in the exact same structure as the example in the vignette. In the "gene_info" and "chromosome" dfs, I deleted all duplicated rows.
The "go" df has repeated values in the "GID" (gene ID) column, but all GO values are unique, and I've checked that no duplicate rows exist. For example:
GID GO EVIDENCE
1 PA14_00010 GO:0005524 ISM
2 PA14_00010 GO:0006270 ISM
3 PA14_00010 GO:0006275 ISM
4 PA14_00010 GO:0043565 ISM
5 PA14_00010 GO:0003677 ISM
6 PA14_00010 GO:0003688 ISM
7 PA14_00020 GO:0003677 ISM
8 PA14_00020 GO:0006260 ISM
The same goes for the sample finch data provided by the vignette; repeated GIDs, but unique GO numbers. Frustratingly, when I run the makeOrgPackage function for the sample data in the vignette, there are no errors. What am I missing here?
Full script:
# Load in GO annotated PA14 file, downloaded from Psuedomonas.com
PA14file <- read.csv("../data/GO_annotations/GO_PA14.csv")
colnames(PA14file)
> colnames(PA14file)
[1] "LocusTag" "GeneName" "ProductDescription"
[4] "StrainName" "Accession" "GOTerm"
[7] "Namespace" "GOEvidenceCode" "EvidenceOntologyECOCode"
[10] "EvidenceOntologyTerm" "SimilarToBindsTo" "PMID"
[13] "chrom"
# PA14 only has 1 chromosome, so create a new column and populate it with 1s.
PA14file$chrom <- '1'
# Create gene_info df, remove duplicate rows
PA14Sym <- PA14file[,c("LocusTag", "GeneName", "ProductDescription")]
PA14Sym <- PA14Sym[PA14Sym[,"GeneName"]!="-",]
PA14Sym <- PA14Sym[PA14Sym[,"ProductDescription"]!="-",]
colnames(PA14Sym) <- c("GID","SYMBOL","GENENAME")
PA14Sym <- PA14Sym[!duplicated(PA14Sym), ]
# Create chromosome df, remove duplicate rows
PA14Chr <- PA14file[,c("LocusTag", "chrom")]
PA14Chr <- PA14Chr[PA14Chr[,"chrom"]!="-",]
colnames(PA14Chr) <- c("GID","CHROMOSOME")
PA14Chr %>% distinct(GID, .keep_all = TRUE)
PA14Chr <- PA14Chr[!duplicated(PA14Chr), ]
# Create go df
PA14Go <- PA14file[,c("LocusTag", "Accession", "GOEvidenceCode")]
PA14Go <- PA14Go[PA14Go[,"GOEvidenceCode"]!="",]
colnames(PA14Go) <- c("GID","GO","EVIDENCE")
# Call the function
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I also met this question today, and just after I changed to use the distinct() in dplyr, this function can work correctly.(My function is the same as yours.)
Just try to add a piece of %>% dplyr::distinct() to the tail of each part of creating or use dplyr::distinct() after all operations to remove the duplications in your variable.
In your case:
library(dplyr)
PA14Sym <- dplyr::distinct(PA14Sym)
PA14Chr <- dplyr::distinct(PA14Chr)
PA14Go <- dplyr::distinct(PA14Go)
Hope these can help you.
Related
How can I copy and rename a bunch of variables at once?
I have created some variables. I would like to duplicate these so that they exist twice, once with the name you see below, and once with Ireland_ in front of their name, i.e., c_PFS_Folfox = 307.81 would become: Ireland_c_PFS_Folfox = 307.81 I initially define these as follows: 1. Cost of treatment in this country c_PFS_Folfox <- 307.81 c_PFS_Bevacizumab <- 2580.38 c_OS_Folfiri <- 326.02 administration_cost <- 365.00 2. Cost of treating the AE conditional on it occurring c_AE1 <- 2835.89 c_AE2 <- 1458.80 c_AE3 <- 409.03 3. Willingness to pay threshold n_wtp = 45000 Then I put them together to rename all at once: kk <- data.frame(c_PFS_Folfox, c_PFS_Bevacizumab, c_OS_Folfiri, administration_cost, c_AE1, c_AE2, c_AE3, n_wtp) colnames(kk) <- paste("Ireland", kk, sep="_") kk Ireland_307.81 Ireland_2580.38 Ireland_326.02 Ireland_365 Ireland_2835.89 Ireland_1458.8 1 307.8 2580 326 365 2836 1459 Ireland_409.03 Ireland_45000 1 409 45000 Obviously this isn't the output I intended. These also don't exist as new variables in the environment. What can I do?
If we want to create objects with Ireland_ as prefix, either use list2env(setNames(kk, paste0("Ireland_", names(kk))), .GlobalEnv) Once we created the objects in the global env, we may remove the original objects > rm(list = names(kk)) > ls() [1] "Ireland_administration_cost" "Ireland_c_AE1" "Ireland_c_AE2" "Ireland_c_AE3" "Ireland_c_OS_Folfiri" [6] "Ireland_c_PFS_Bevacizumab" "Ireland_c_PFS_Folfox" "Ireland_n_wtp" "kk" or with %=% from collapse library(collapse) paste("Ireland", colnames(kk), sep="_") %=% kk -checking > Ireland_administration_cost [1] 365 > Ireland_c_PFS_Folfox [1] 307.81
First put all your variables in a vector, then use sapply to iterate the vector to assign the existing variables to a new variable with the prefix "Ireland_". your_var <- c("c_PFS_Folfox", "c_PFS_Bevacizumab", "c_OS_Folfiri", "administration_cost", "c_AE1", "c_AE2", "c_AE3", "n_wtp") sapply(your_var, \(x) assign(paste0("Ireland_", x), get(x), envir = globalenv()))
Find differences betwen 2 dataframes with different lengths
I have two dataframes with each two columns c("price", "size") with different lengths. Each price must be linked to its size. It's two lists of trade orders. I have to discover the differences between the two dataframes knowing that the two databases can have orders that the other doesn't have and vice versa. I would like an output with the differences or two outputs, it doesn't matter. But I need the row number in the output to find where are the differences in the series. Here is sample data : > out price size 1: 36024.86 0.01431022 2: 36272.00 0.00138692 3: 36272.00 0.00277305 4: 36292.57 0.05420000 5: 36292.07 0.00403948 --- 923598: 35053.89 0.30904890 923599: 35072.76 0.00232000 923600: 35065.60 0.00273000 923601: 35049.36 0.01760000 923602: 35037.23 0.00100000 >bit price size 1: 37279.89 0.01340020 2: 37250.84 0.00930000 3: 37250.32 0.44284049 4: 37240.00 0.00056491 5: 37215.03 0.99891906 --- 923806: 35053.89 0.30904890 923807: 35072.76 0.00232000 923808: 35065.60 0.00273000 923809: 35049.36 0.01760000 923810: 35037.23 0.00100000 For example, I need to know if the first row of the database out is in the database bit. I've tried many functions : comparedf() summary(comparedf(bit, out, by = c("price","size")) but I've got error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : I've tried compare_df() : compareout=compare_df(out,bit,c("price","size")) But I know the results are wrong, I've only 23 results and I know that there are more than 200 differences minimum. I've tried match(), which() functions but it doesn't get the results I search. If you have any other methods, I will take them.
Perhaps you could just do inner_join on out and bit by price and size? But first make id variable for both data.frame's library(dplyr) out$id <- 1:nrow(out) bit$id <- 1:nrow(bit) joined <- inner_join(bit, out, by = c("price", "size")) Now we can check which id from out and bit are not present in joined table: id_from_bit_not_included_in_out <- bit$id[!bit$id %in% joined$id.x] id_from_out_not_included_in_bit <- out$id[!out$id %in% joined$id.y] And these ids are the rows not included in out or bit, i.e. variable id_from_bit_not_included_in_out contains rows present in bit, but not in out and variable id_from_out_not_included_in_bit contains rows present in out, but not in bit
First attempt here. It will be difficult to do a very clean job with this data tho. The data I used: out <- read.table(text = "price size 36024.86 0.01431022 36272.00 0.00138692 36272.00 0.00277305 36292.57 0.05420000 36292.07 0.00403948 35053.89 0.30904890 35072.76 0.00232000 35065.60 0.00273000 35049.36 0.01760000 35037.23 0.00100000", header = T) bit <- read.table(text = "price size 37279.89 0.01340020 37250.84 0.00930000 37250.32 0.44284049 37240.00 0.00056491 37215.03 0.99891906 37240.00 0.00056491 37215.03 0.99891906 35053.89 0.30904890 35072.76 0.00232000 35065.60 0.00273000 35049.36 0.01760000 35037.23 0.00100000", header = T) Assuming purely that row 1 of out should match with row 1 of bit a simple solution could be: df <- cbind(distinct(out), distinct(bit)) names(df) <- make.unique(names(df)) However judging from the data you have provided I am not sure if this is the way to go (big differences in the first few rows) so maybe try sorting the data first?: df <- cbind(distinct(out[order(out$price, out$size),]), distinct(bit[order(bit$price, bit$size),])) names(df) <- make.unique(names(df))
In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?
This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin. Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file. Example data file ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten 1,1,34,1,,1,,1,1,4, 2,1,21,0,1,,1,3,14,3,2 3,1,54,1,,,1,3,6,4,4 4,2,32,1,1,1,,3,7,4, 5,3,66,0,,,1,3,9,3,3 6,2,43,1,,1,,1,12,2,1 7,2,26,0,,,1,2,11,1, 8,3,,1,1,,,2,15,1,4 9,1,34,1,,1,,1,12,3,4 10,2,46,0,,,,3,13,2, 11,3,39,1,1,1,,3,7,1,2 12,1,28,0,,,1,1,6,5,1 13,2,64,0,,1,,2,11,,3 14,3,34,1,1,,,3,10,1,1 15,1,52,1,,1,1,1,8,6, Example metadata file Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support vallable,,Two,,Yes,,,,y,Two,B,Neutral vallable,,Three,,,,,,z,Three,C,Oppose vallable,,,,,,,,,Four,D,Dont know vallable,,,,,,,,,Five,E, vallable,,,,,,,,,Six,F, vallable,,,,,,,,,Seven,G, vallable,,,,,,,,,Eight,, vallable,,,,,,,,,Nine,, vallable,,,,,,,,,Ten,, vallable,,,,,,,,,Eleven,, vallable,,,,,,,,,Twelve,, vallable,,,,,,,,,Thirteen,, vallable,,,,,,,,,Fourteen,, vallable,,,,,,,,,Fifteen,, SO the common elements are the column names which are the key to both files The first column of the metadata file describes the role of the row for the data file so varlabel provides the variable label for each column varrole describes the analytic purpose of the variable missing describes how to treat missing data varlabel describes the label for a factor level starting at one on up to as many labels as there are. Right! Here's the code that works: ```#Libraries library(expss) library(data.table) library(magrittr)``` readcsvdata <- function(dfile) { # TESTED - Working print("OK Lets read some comma separated values") rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE, na.strings = getOption("datatable.na.strings","")) return(rdata) } rawdatafilename <- "testdata.csv" rawmetadata <- "metadata.csv" mdt <- readcsvdata(rawmetadata) rdt <- readcsvdata(rawdatafilename) names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies commonnames <- commonnames[-(1)] # remove ID qlabels <- as.list(mdt[1, commonnames, with = FALSE]) (Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out. # set var names to columns for (each_name in commonnames) # loop through commonnames and qlabels { expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]] } OK this is where I fall down. Failure from here factorcols <- as.vector(commonnames) # create a vector of column names (for later use) for (col in factorcols) { print( is.na(mdt[4, ..col])) # print first row of value labels (as test) if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col] # if not a factor column, remove it from the factorcol list and dont try to factor it else { # if it is a vector factorise print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column} validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list print(as.character(tlabels)) # for testing print(validrange) # for testing tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels)) # expss::val_lab(tdt[, ..col]) <- tlabels tlabels = c() # flush loop variable validrange = c() # flush loop variable } } So the problem is revealed here when we check the data table. tdt the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive) tdt id (int) 1 varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1) vartwo (S3: labelled) 34 varthree (fctr) c("No", "Yes") 1 (should be "No" 1) varfour (fctr) NA varfive (fctr) Checked And a mystery this code works just fine on a single columns when I don't use a for loop variable # test using column name tlabels <- c("one","two","three") validrange <- c(1,2,3) factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]). Example: library(data.table) mdt = as.data.table(mtcars) col = "am" tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table str(tlabels) # Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable: # $ am: num 1 0 0 0 # - attr(*, ".internal.selfref")=<externalptr> as.character(tlabels) # character vector of length 1 # [1] "c(1, 0, 0, 0)" tlabels <- na.omit(mdt[[col]][3:6]) # vector str(tlabels) # num [1:4] 1 0 0 0 as.character(tlabels) # character vector of length 4 # [1] "1" "0" "0" "0"
How to write rownames into a spreadsheet with the googlesheets package in R?
I would like to write a data frame in a Google spreadsheet with the googlessheets package but the rownames isn't written in the first column. My data frame looks like this : > str(stats) 'data.frame': 4 obs. of 2 variables: $ Offensive: num 194.7 87 62.3 10.6 $ Defensive: num 396.28 51.87 19.55 9.19 > stats Offensive Defensive Annualized Return 194.784261 396.278385 Annualized Standard Deviation 87.04125 51.872826 Worst Drawdown 22.26618 9.546208 Annualized Sharpe Ratio (Rf=0%) 1.61126 0.9193734 I load the library as recommanded in the documentation, create spreadsheet & worksheet then write the data with the gs_edit_cells command : > install.packages("googlesheets") > library("googlesheets") > suppressPackageStartupMessages(library("dplyr")) > mySpreadsheet <- gs_new("mySpreadsheet") > mySpreadsheet <- mySpreadsheet %>% gs_ws_new("Stats") > mySpreadsheet <- mySpreadsheet %>% gs_edit_cells(ws = "Stats", input = stats, trim = TRUE) Everything goes well but googlesheets doesn't create a column with the rownames. Only two columns are created with their data (Offensive and Defensive). I have try to convert the data frame into a matrix but still the same. Any idea how I could achieve this ? Thank you
Doesn't look like there is a row names argument for gs_edit_cells(). If you just want the row names to show up in the first column of the sheet you could try: stats$Rnames = rownames(stats) ## add column equal to the row names stats[,c("Rnames","Offensive", "Defensive")] ## re order so names are first # names(stats) = c("","Offensive", "Defensive") optional if you want the names col to not have a "name" From here just pass stats to the functions from the googlessheets package just like you did before
Scrape number of articles on a topic per year from NYT and WSJ?
I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is: NYT WSJ 2011 2 3 2012 10 7 I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error: > cts <- as.data.frame(table(dat)) Error in provideDimnames(x) : length of 'dimnames' [1] not equal to array extent Any help would be much appreciated. Thanks! PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register) # Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz # then load: library(RJSONIO) ### set parameters ### api <- "API key goes here" ###### <<<API key goes here!! q <- "MOOCs" # Query string, use + instead of space records <- 500 # total number of records to return, note limitations above # calculate parameter for offset os <- 0:(records/10-1) # read first set of data in uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="") raw.data <- readLines(uri, warn="F") # get them res <- fromJSON(raw.data) # tokenize dat <- unlist(res$results) # convert the dates to a vector # read in the rest via loop for (i in 2:length(os)) { # concatenate URL for each offset uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="") raw.data <- readLines(uri, warn="F") res <- fromJSON(raw.data) dat <- append(dat, unlist(res$results)) # append } # aggregate counts for dates and coerce into a data frame cts <- as.data.frame(table(dat)) # establish date range dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this daterange <- c(min(dat.conv), max(dat.conv)) dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days # compare dates from counts dataframe with the whole data range # assign 0 where there is no count, otherwise take count # (take out PSD at the end to make it comparable) dat.all <- strptime(dat.all, format="%Y-%m-%d") # cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this: freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0) plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date") axis(1, 1:length(freqs), dat.all) lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you. I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github install.packages("devtools") library(devtools) install_github("rOpenGov/RNYTimes") Then try your search with that, e.g, library(RNYTimes); library(plyr) moocs <- searchArticles("MOOCs", key = "<yourkey>") This gives you number of articles found moocs$response$meta$hits [1] 121 You could get word counts for each article by as.numeric(sapply(moocs$response$docs, "[[", 'word_count')) [1] 157 362 1316 312 2936 2973 355 1364 16 880