Duplicate Row Name Error for FAMD Visualization - r

I'm trying to perform this function in R: fviz_famd_ind() and keep getting an error. It works on the wine dataset provided in the package, but not on my cleaned data set from Telco.Customer.Churn from IBM.
I've created the object of the FAMD function using the cleaned data set called dfcfamd1. I've verified there are no duplicate row or column names in the sets using any(duplicated(rownames())) for both Telco.Customer.Churn and dfcfamd1 which both return FALSE.
fviz_famd_ind(dfcfamd1)
> Error in `.rowNamesDF<-`(x, value = value) :
> duplicate 'row.names' are not allowed
> In addition: Warning message:
> non-unique values when setting 'row.names': ‘No’, ‘Yes’
Sample Data below
head(Telco.Customer.Churn)
customerID gender SeniorCitizen Partner Dependents tenure
1 7590-VHVEG Female 0 Yes No 1
2 5575-GNVDE Male 0 No No 34
3 3668-QPYBK Male 0 No No 2
PhoneService MultipleLines InternetService OnlineSecurity
1 No No DSL No
2 Yes No DSL Yes
3 Yes Yes Fiber optic No
OnlineBackup DeviceProtection TechSupport StreamingTV
1 Yes No No No
2 No No No No
3 No Yes No Yes
StreamingMovies Contract PaperlessBilling PaymentMethod
1 No Month-to-month Yes Electronic check
2 No One year No Mailed check
3 No Month-to-month Yes Mailed check
MonthlyCharges TotalCharges Churn
1 29.85 29.85 No
2 56.95 1889.50 No
3 53.85 108.15 Yes
The output should give me a graphical output which it does for the package data, but not for my data.
Attempting to set names to unique, I get a vector error.
rownames(dfcfamd1) = make.names(names, unique=TRUE)
> Error in as.character(names) :
> cannot coerce type 'builtin' to vector of type 'character'

The issue is that names is a function
rownames(dfcfamd1) = make.names(names, unique=TRUE)
instead it should be
row.names(dfcfamd1) = make.names(row.names(dfcfamd1), unique=TRUE)

Try:
fviz_pca_ind(dfcfamd1)
PS: I met the same problem! It could be solved by simply using the function fviz_pca_ind rather than using the function fviz_famd_ind, as the two functions use data with similar structures.

It seems that fviz_famd_ind cannot handle the same values across multiple categorical columns.
One way to solve this is to rename the values to be unique across columns:
# Define factors
cols <- c("Partner","Dependents ", "PhoneService", "MultipleLines", "InternetService","OnlineSecurity" "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies","PaperlessBilling","Churn")
dfcfamd1[cols] <- lapply(dfcfamd1[cols], factor)
rm(cols)
# Rename the factors
# Do this for every column until only unique values remain.
dfcfamd1$Partner<- recode_factor(dfcfamd1$Partner,"Yes" = "yesParnter", "No" = "noPartner")
#[...]
dfcfamd1$Churn<- recode_factor(dfcfamd1$Churn,"Yes" = "yesChurn", "No" = "noChurn")
# Run the function on dfcfamd1
fviz_famd_ind(dfcfamd1)

Related

How to convert dataframe from Charachter to Numeric

I know this question may be repeated but i tried all the solutions in :
How to convert entire dataframe to numeric while preserving decimals?
https://statisticsglobe.com/convert-data-frame-column-to-numeric-in-r
But didn't work
i imported excel data : from my computer manually :
File > import data > excel and i set the type of data as numeric
i checked my data using
View(Old_data)
and it s true of type numeric
head(Old_data)
QC_G.F9_01_4768 QC_G.F9_01_4765
M95T834 70027.02 69578.19
M97T834 95774.14 81479.30
M105T541 75686.39 68455.65
M109T834 72093.07 70942.65
M111T834_2 77502.98 77527.54
M114T834 68132.06 70296.73
M121T834 52233.05 56074.64
M125T834 44559.99 35831.79
M128T834 59257.48 59574.73
M135T834 105136.55 105274.98
but after data i Converted rows into columns and columns into rows using R :
New_data <- as.data.frame(t(Old_data))
When i checked my new data using :
View(New_data)
I found that my columns are of type character and not numeric
i tried to convert New_data to numeric
New_data_B -> as.numeric(New_data)
i checked my data using
dim(New_data_B)
17 1091
Here's example of my data
New_data_B
#> Name MT95T843 MT95T756
#> 1 QC_G.F9_01_4768 70027.02132 95774.13597
#> 2 QC_G.F9_01_4765 69578.18634 81479.29575
#> 3 QC_G.F9_01_4762 69578.18634 87021.95427
#> 4 QC_G.F9_01_4759 68231.14338 95558.76738
#> 5 QC_G.F9_01_4756 64874.12936 96780.77245
#> 6 QC_G.F9_01_4753 63866.65780 91854.35304
#> 7 CtrF01R5_G.D1_01_4757 66954.38799 128861.36163
#> 8 CtrF01R4_G.D5_01_4763 97352.55229 101353.25927
#> 9 CtrF01R3_G.C8_01_4754 61311.78576 7603.60896
#> 10 CtrF01R2_G.D3_01_4760 85768.36117 109461.75445
#> 11 CtrF01R1_G.C9_01_4755 85302.81947 104253.84537
#> 12 BtiF01R5_G.D7_01_4766 61252.42545 115683.73755
#> 13 BtiF01R4_G.D6_01_4764 81873.96379 112164.14229
#> 14 BtiF01R3_G.D2_01_4758 84981.21914 0.00000
#> 15 BtiF01R2_G.D4_01_4761 36629.02462 124806.49101
#> 16 BtiF01R1_G.D8_01_4767 0.00000 109927.26425
#> 17 rt 13.90181 13.90586
also i converted my data to csv file and i imported it :
Old_data <- as.data.frame(read.csv("data.csv" , sep="," , header=TRUE,stringsAsFactors=FALSE))
And also using :
#install.packages("readxl")
library("readxl")
Old_data <- read_excel("data.xlsx")
I tried the solution suggested by Mr sveer
New_data <- cbind(Name=Old_data[1,],as.data.frame(t(Old_data[-1,])))
it gives this result
head(New_data)
When i tried
View(New_data)
Name.QC_G.F9_01_4768 Name.QC_G.F9_01_4765
70027.02 69578.19
95774.14 81479.30
75686.39 68455.65
72093.07 70942.65
77502.98 77527.54
68132.06 70296.73
52233.05 56074.64
4559.99 35831.79
59257.48 59574.73
105136.55 105274.98
it delets the rownames !
Im just confused of this problem, i think the problem is because i converted rows into columns and columns into rows
Please tell me for any clarification and also if i can send the data to someone so he can try
Thank you very much
Reason why you get character type and not numeric:
Transponsing the data will lead to a matrix. A matrix can take only a single class ie. character when there are mixed class.
Solution:
I am still not sure about the structure of your data. It is always a good idea to add a reproducible example, if the data is large you could also use pastebin or just reproduce as described.
I assume that when you load the data via: File > import data > excel that the first column is called "Name".
To get your desired output (especially rownames) you could try:
setNames(as.data.frame(t(Old_data[,-1])),Old_data[[1]]) -> df
If you want to transform the rownames to a column:
tibble::rownames_to_column(df, "Name")

goseq package in R "missing value where TRUE/FALSE needed" error

I am attempting to run a GO Analysis in R (I have never done this analysis, so I am trying different packages), and I am struggling to find the problem with my code in the goseq package.
I start with this code which produces a list of the differentially expressed gene names:
de.genes <- rownames(res)[ which(res$padj < fdr.threshold & !is.na(res$padj)) ]
Then I try to run this code (based on page 7 of the vignette (https://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf)
pwf <- nullp(de.genes, "hg38","geneSymbol")
but I get the following error:
Can't find hg38/geneSymbol length data in genLenDataBase...
Found the annotation package, TxDb.Hsapiens.UCSC.hg38.knownGene
Trying to get the gene lengths from it.
Error in if (matched_frac == 0) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In grep(txdbPattern, installedPackages):argument 'pattern' has length > 1 and only the first element will be used
I found this forum: https://support.bioconductor.org/p/38580/ that says I need an "indicator variable" but I do not know what this is.
Any help with this error would be greatly appreciated, or if you know of any other GO packages that are easy to learn. Thanks!
You can check the supported databases, hg38 is not one of them:
library(org.Hs.eg.db)
library(goseq)
supported[grep("hg38|hg19",supported$Genome),]
Genome Id Id Description Lengths in geneLeneDataBase
4 hg19 knownGene Entrez Gene ID TRUE
36 hg19 ensGene Ensembl gene ID TRUE
81 hg19 geneSymbol Gene Symbol TRUE
98 hg38 FALSE
GO Annotation Available
4 TRUE
36 TRUE
81 TRUE
98 TRUE
You can get a rough idea of what it looks like by using hg19, you will have some missing or unmatched by should be ok. You need to have a binary vector and it should be named, for example:
set.seed(111)
allgenes = keys(org.Hs.eg.db,keytype="SYMBOL")
de.genes = rbinom(100,1,0.3)
names(de.genes) = sample(allgenes,100)
It looks like this:
GALNT5 TPRKB CD48 OR52R1 LOC105372708 LOC112163649
0 1 0 0 0 0
LOC105369203 LOC110121115 LOC105377654 LOC105371502 LOC101929964 HPC14
0 0 0 0 0 0
IGHD4-17 LOC101927993 HINT1 BCC3 RPL18P3 LOC108281192
0 0 0 0 0 1
RNU6-793P JUN
0 0
This will be ok:
res = nullp(de.genes,"hg19","geneSymbol")

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Change value column a if column b contains conditional string

This issue is giving me a lot of trouble, even though it should be fixed eaily. I have a dataset with the columns id and poster. I want to change the poster's value if the id value contains a certain string. See data below:
test_df
id poster
143537222999_2054 Kevin
143115551234_2049 Dave
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 Patrick
143115551234_4321 Lars
143537222999_56743 Iris
I would like to get
test_df
id poster
143537222999_2054 User
143115551234_2049 User
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 User
143115551234_4321 User
143537222999_56743 User
Both the columns are characters. I would like to change the poster's value to "User" if id value contains "143537222999", OR "143115551234". I have tried the following codes:
Match within/which
test_df <- within(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
This code gave me no errors, but it didn't change any of the values in the poster column. When I replace within for which, I get the error:
test_df <- which(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
Error in which(test_df, poster[match("143115551234", test_df$id) | :
argument to 'which' is not logical
Match different variant
test_df <- test_df[match(id, test_df, "143115551234") | match(id, test_df, "143537222999"), test_df$poster] <- 'User'
This code gives me the error:
Error in `[<-.data.frame`(`*tmp*`, match(id, test_df, "143115551234") | :
missing values are not allowed in subscripted assignments of data frames
In addition: Warning messages:
1: In match(id, test_df, "143115551234") :
NAs introduced by coercion to integer range
2: In match(id, test_df, "143537222999") :
NAs introduced by coercion to integer range
After looking up this error I found out that the integers in R are 32-bits and the maximum value of an integer is 2147483647. I'm not sure why i'm getting this error because R states that my column is a character.
> lapply(test_df, class)
$poster
[1] "character"
$id
[1] "character"
Grepl
test_df[grepl("143115551234", id | "143537222999", id), poster := "User"]
This code raises the error:
Error in `:=`(poster, "User") : could not find function ":="
I'm not sure what the best way is to fix this error, I have tried multiple variaties and keep getting across different errors.
I have tried multiple answers from multiple questions that were asked before on here, but I still can't get to fix some errors.
Use grepl with ifelse:
df$poster <- ifelse(grepl("143537222999|143115551234", df$id), "User", df$poster)
Demo
You may try this using grepl.
df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
So, all the true for above matched in poster column getting replaced by "User"
> df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
> df
id poster
1 143537222999_2054 User
2 143115551234_2049 User
3 14334_5334 Eric
4 1456322_4334 Mandy
5 143115551234_445633 User
6 143115551234_4321 User
7 143537222999_56743 User

Cannot coerce class ....to a data.frame error

R subject
I have an "cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame" error when trying to save the summary in a file. I have tried to fix that with as.data.frame with no success.
code :
library(plyr)
library(pastecs)
data <- read.table("C:\\Users\\Ron\\Desktop\\dataset.txt", header=F, col.name="A")
data.tp=turnpoints(data$A)
print(data.tp)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
data.sum=summary(data.tp)
print(data.sum)
point type proba info
1 11 pit 7.232437e-15 46.97444
2 21 peak 7.594058e-14 43.58212
3 30 pit 3.479857e-27 87.89303
4 51 peak 5.200612e-29 93.95723
5 62 pit 7.594058e-14 43.58212
6 70 peak 6.213321e-14 43.87163
7 81 pit 6.276081e-16 50.50099
8 91 peak 5.534016e-23 73.93602
.....................................
write.table(data.sum, file = "C:\\Users\\Ron\\Desktop\\datasetTurnP.txt")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame
In addition: Warning messages:
1: package ‘plyr’ was built under R version 3.0.1
2: package ‘pastecs’ was built under R version 3.0.1
How can I save these summary results to a text file?
Thank you.
Look at the Value section of:
?pastecs::summary.turnpoints
It should be clear that this will not be a set of lists all of which have the same length. Hence the error message. So rather than asking for the impossible, ... tell us what you wanted to save.
It's actually not impossible, just not possible with write.table, since it's not a dataframe. The dump function would allow you to construct an ASCII representation of the structure(...) representation of that summary-object.
dump(data.sum, file="dump_data_sum.asc")
This could then be source()-ed

Resources