R: M3C library - Duplicate row.names error message - r

I am trying to run consensus clustering using M3C library in R. My dataset contains 451 samples and ~2500 genes. The row names are the ENTREZ IDs (numeric values) of the genes. I have crosschecked the dataset using "any(duplicated(colnames(MyData)))" command to make sure that there is no duplicate entries in the row names. I ran the following command to perform the consensus clustering using M3C library:
res <- M3C(MyData, cores=8, seed = 123, des = annotation, removeplots = TRUE, analysistype = 'chi', doanalysis = TRUE, variable = 'class')
I am getting the following error:
Warning message:
"non-unique values when setting 'row.names': "
Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
Traceback:
1. M3C(MyData, cores = 8, seed = 123, des = meta, removeplots = TRUE,
. analysistype = "chi", doanalysis = TRUE, variable = "class")
2. M3Creal(as.matrix(mydata), maxK = maxK, reps = repsreal, pItem = 0.8,
. pFeature = 1, clusterAlg = clusteralg, distance = distance,
. title = "/home/christopher/Desktop/", printres = printres,
. showheatmaps = showheatmaps, printheatmaps = printheatmaps,
. des = des, x1 = pacx1, x2 = pacx2, seed = seed, removeplots = removeplots,
. silent = silent, doanalysis = doanalysis, analysistype = analysistype,
. variable = variable, fsize = fsize, method = method)
3. `row.names<-`(`*tmp*`, value = newerdes$ID)
4. `row.names<-.data.frame`(`*tmp*`, value = newerdes$ID)
5. `.rowNamesDF<-`(x, value = value)
6. stop("duplicate 'row.names' are not allowed")
Can anyone please help me to resolve the issue?
Thanks

I ran the equivalent of the following using M3C:
df_wide_matrix # my expression matrix
any(duplicated(colnames(df_wide_matrix))) # result = FALSE
M3C::M3C(df_wide_matrix, iters=2, repsref=2, repsreal=2, clusteralg="hc", objective="PAC")
I ran into the exact same error message with M3C, in addition to:
In addition: Warning message:
non-unique values when setting 'row.names': ‘ABCDEF’, ‘ABCDGH’
I assumed the issue is caused by the fact the first four characters of each of these features are equal. I therefore temporarily changed their respective names prior to running M3C:
dup_ids <- which(colnames(dissADJ) %in% c("ABCDEF", "ABCDGH"))
colnames(dissADJ)[dup_ids] <- c("A", "B")
M3C::M3C(df_wide_matrix, iters=2, repsref=2, repsreal=2, clusteralg="hc", objective="PAC")
M3C then runs correctly. Not an ideal solution but worked for me - I've posted it as an issue: https://github.com/crj32/M3C/issues/6.

Related

Consistent error message while running grouping analysis in 'plspm' package

I am looking for some help in resolving an error using the partial least squares path modeling package ('plspm').
I can get results running a basic PLS-PM analysis but run into issues when using the grouping function, receiving the error message:
Error in if (w_dif < specs$tol || iter == specs$maxiter) break : missing value where TRUE/FALSE needed
I have no missing values and all variables have the proper classification. Elsewhere I read that there is a problem with processing observations with the exact same values across all variables, I have deleted those and still face this issue. I seem to be facing the issue only when I run the groups using the "bootstrap" method as well.
farmwood = read.csv("farmwood_groups(distance).csv", header = TRUE) %>%
slice(-c(119:123))
Control = c(0,0,0,0,0,0)
Normative = c(0,0,0,0,0,0)
B_beliefs = c(0,0,0,0,0,0)
P_control = c(1,0,0,0,0,0)
S_norm = c(0,1,0,0,0,0)
Behavior = c(0,0,1,1,1,0)
farmwood_path = rbind(Control, Normative, B_beliefs, P_control, S_norm, Behavior)
colnames(farmwood_path) = rownames(farmwood_path)
farmwood_blocks = list(14:18,20:23,8:13,24:27,19,4:7)
farmwood_modes = rep("A", 6)
farmwood_pls = plspm(farmwood, farmwood_path, farmwood_blocks, modes = farmwood_modes)
ames(farmwood)[names(farmwood) == "QB3"] <- "Distance"
farmwood$Distance <- as.factor(farmwood$Distance)
distance_boot = plspm.groups(farmwood_pls, farmwood$Distance, method = "bootstrap")
distance_perm = plspm.groups(farmwood_pls, farmwood$Distance, method = "permutation")
The data is contained here:
https://www.dropbox.com/s/8vewuupywpi1jkt/farmwood_groups%28distance%29.csv?dl=0
Any help would be appreciated. Thank you in advance

Making a function that builds a dataframe

I'm trying to make a function that basically builds a dataframe and returns it. This new dataframe is made of columns taken from another dataframe that I have, called metadata.. in addetion to some additional data that I want to control, by passing the TRUE or FALSE values when calling the function.
Here is what I did:
make_data = function(metric, use_additions = FALSE){
data = data.frame(my_metric = metadata[['metric']], gender = metadata$Gender ,
age = as.numeric(metadata$Age) , use_additions = t(additional_data))
data = data %>% dplyr::select(my_metric, everything())
return(data)
}
data = make_data(CR, FALSE)
I want to pass different metric values each time, and all other features stay the same. So here for example I called the function with metric as CR which is the name of the column I want in the metadata. The argument I want to control is use_additions, sometines I want to add it and sometimes I don't.
metadata and additional_data have the exact same row names and the same rows number. It's just adding the data or not.
I get this error(s):
Error in data.frame(metric = metadata[["metric"]], gender = metadata$Gender, :
arguments imply differing number of rows: 0, 1523
In addition: Warning message:
In data.frame(metric = metadata[["metric"]], gender = metadata$Gender, :
Error in data.frame(my_metric = metadata[["metric"]], gender = metadata$Gender, :
arguments imply differing number of rows: 0, 1523
I've tried several ways to do this, with '' and without, using the $, but non of these worked. So for example when I type metric = metadata[[metric]] I get this:
Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :
object 'CR' not found
make_data = function(colname, use_additions = FALSE){
data = data.frame(my_metric = metadata[colname], gender = metadata$Gender ,
age = as.numeric(metadata$Age))
if (use_additions) data$use_additions=additional_data
return(data)
}
data = make_data(“CR”, FALSE)

Problem with for loop when downloading species occurrence data

I want to download the occurrence data from gbif website and I use the following R script. When I run the script, I got an error with the following message "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0)". It would be highly appreciated if anyone could help me with this.
My data: data
My R script:
flist<-read_excel("Mekong fish.xlsx",sheet="Sheet1")
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list)) {
# create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
datfish[[i]]<-as.data.frame(fname[[i]]$data)
}
I got a different error:
Expecting logical in D1424 / R1424C4: got 'in Lao'Expecting logical in D1426 / R1426C4: got 'in China'Expecting logical in D1467 / R1467C4: got 'only Cambodia'Expecting logical in D1469 / R1469C4: got 'only in VN'Expecting logical in D1473 / R1473C4: got 'only in China'Expecting logical in D1486 / R1486C4: got 'only in Malaysia'Expecting logical in D1488 / R1488C4: got 'only 1 point in VN'
I think the problem is caused in some fields in the 4th column. I don't have the right packages installed to run your code. But I got a different error (package missing) once i dropped the fourth column.
flist<-read_excel("~/Downloads/Mekong fish.xlsx",sheet="Sheet1")
flist <=subset(flist, select = -4)
...
EDIT:
This worked for me. read_excel assigned column 4 the type boolean. When I explicitly set it to text it worked.
library(readxl)
library(rgbif)
library(raster)
flist<-read_excel("~/Downloads/Mekong fish.xlsx",
sheet="Sheet1",
col_types = c("numeric", "text", "numeric", "text"))
flist
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list[1:2])) {
message(i)
# # create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
message(Occ[[i]])
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
message(fname[[i]])
datfish[[i]]<-as.data.frame(fname[[i]]$data)
message(datfish[[i]])
}
> 1
> list(offset = 0, limit = 2, endOfRecords = FALSE, count = >15)list(list(name = c("Animalia", "Chordata", "Actinopterygii",
> "Cypriniformes", "Cyprinidae", "Aaptosyax", "Aaptosyax grypus"), key = > > c("1", "44", "204", "1153", "7336", "2363805", "2363806"),
> etc...

Clustering by M3C package : Error in `[.data.frame`(df, neworder2) : undefined columns selected

I had a similar problem to what posted here. To resolve the issue, followed the answer by #Jack Gisby there. Now a new error showed up:
Working on TCGA data , I am getting the same error (first error):
Error in `.rowNamesDF<-`(x, value = value) :
duplicate 'row.names' are not allowed
running duplicated() on each relevant field returned FALSE.
Her is the second error (just after trimming identifiers to not start with a common string like "TCGA-"):
Error in `[.data.frame`(df, neworder2) : undefined columns selected
> traceback()
5: stop("undefined columns selected")
4: `[.data.frame`(df, neworder2)
3: df[neworder2]
2: M3Creal(as.matrix(mydata), maxK = maxK, reps = repsreal, pItem = pItem,
pFeature = 1, clusterAlg = clusteralg, distance = distance,
title = "/home/christopher/Desktop/", des = des, lthick = lthick,
dotsize = dotsize, x1 = pacx1, x2 = pacx2, seed = seed, removeplots = removeplots,
silent = silent, fsize = fsize, method = method, objective = objective)
1: M3C(pro.vst, des = clin, removeplots = FALSE, iters = 25, objective = "PAC",
fsize = 8, lthick = 1, dotsize = 1.25)
I've added to an opened issue on the M3C GitHub.
I got the same error as Hamid Ghaedi while running M3C. I managed to track it down to the following line of code (line 476 on the M3C.R file):
df <- data.frame(m_matrix)
Many of my sample names (column names) started with a number and the data.frame() function added an "X" to the beginning of each name that started with a number ("1" becomes "X1"). This caused a mismatch with the names listed in neworder2.
To get around this problem, I changed all of my sample names to start with a letter and M3C is now running correctly.
Edit: This workaround can be easily applied by using the data.frame() function on your input dataset before running M3C.

Difficulty in downloading TCGA data

I am trying to download the TCGA data but I am getting this error:
Error in summarizeMaf(maf = maf, anno = clinicalData, chatty =
verbose): Tumor_Sample_Barcode column not found in provided clinical
data. Rename column containing sample names to Tumor_Sample_Barcode if
necessary.
This is my code:
library("TCGAbiolinks")
library("tidyverse")
library(maftools)
query <- GDCquery( project = "TCGA-LIHC",
data.category = "Clinical",
file.type = "xml",
legacy = FALSE)
GDCdownload(query,directory = ".")
clinical <- GDCprepare_clinic(query, clinical.info = "patient",directory = ".")
#getting the survival time of event data
survival_data <- as_tibble(clinical[,c("days_to_last_followup","days_to_death","vital_status","bcr_patient_barcode","patient_id")])
survival_data <- filter(survival_data,!is.na(days_to_last_followup)|!is.na(days_to_death)) #not both NA
survival_data <- filter(survival_data,!is.na(days_to_last_followup)|days_to_last_followup>0 &is.na(days_to_death)|days_to_death > 0 ) #ensuring positive values
survival_data <- survival_data[!duplicated(survival_data$patient_id),] #ensuring no duplicates
dim(survival_data) #should be 371
maf <- GDCquery_Maf("LIHC", pipelines = "muse")
#maf <- GDCquery_Maf("LIHC", pipelines = "somaticsniper")
#clin <- GDCquery_clinic("TCGA-LIHC","clinical")
#print(clin )
laml = read.maf(
maf,
clinicalData = clinical,
removeDuplicatedVariants = TRUE,
useAll = TRUE,
gisticAllLesionsFile = NULL,
gisticAmpGenesFile = NULL,
gisticDelGenesFile = NULL,
gisticScoresFile = NULL,
cnLevel = "all",
cnTable = NULL,
isTCGA = TRUE,
vc_nonSyn = NULL,
verbose = TRUE
)
You should have: a) loaded with library(maftools) and b) included what was printed out before that error message:
-Validating
-Silent variants: 18306
-Summarizing
--Possible FLAGS among top ten genes:
TTN
MUC16
OBSCN
FLG
-Processing clinical data
Available fields in provided annotations..
[1] "bcr_patient_barcode" "additional_studies"
[3] "tissue_source_site" "patient_id"
# snipped remaining 78 column names
Notice that the first column is not named "Tumor_Sample_Barcode", so you need to follow the helpful error message directions and rename the appropriate column which appears to be the first one:
ns. After doing so I get:
-Validating
-Silent variants: 18306
-Summarizing
--Possible FLAGS among top ten genes:
TTN
MUC16
OBSCN
FLG
-Processing clinical data
-Finished in 1.911s elapsed (2.470s cpu)

Resources