mlogit duplicate 'row.names' are not allowed - r

New to R and want to use mlogit function.
However after putting my data into a data frame and run
x <- mlogit.data(mlogit, choice="PlacedN", shape="long", alt.var="RaceID")
I get duplicate 'row.names' are not allowed
I can upload my file if needed I've spent days trying to get this to work, so any help will be appreciated

You may want to put "RaceID" into the alt.levels argument instead of alt.var. From the mlogit.data help file:
alt.levels
the name of the alternatives: if null, for a wide data.frame, they are guessed from the variable names and the choice variable (both should be the same), for a long data.frame, they are guessed from the alt.var argument.
Give this a try.
library(mlogit)
m <- read.csv("mlogit.csv")
mlogd <- mlogit.data(m, choice="PlacedN", shape="long", alt.levels="RaceID")
head(mlogd)
# RaceID PlacedN RSP TrA JoA aDS bDS mDS aDH bDH mDH LDH MR eMR
# 1.RaceID 20119552 TRUE 3.00 13 12 0 0 0 0 0 0 0 0 131
# 2.RaceID 20119552 FALSE 4.00 23 26 91 94 94 139 153 145 153 150 150
# 3.RaceID 20119552 FALSE 0.83 15 15 99 127 99 150 153 150 153 159 159
# 4.RaceID 20119552 FALSE 18.00 21 15 0 0 0 0 0 0 0 0 131
# 5.RaceID 20119552 FALSE 16.00 16 12 92 127 92 134 135 134 135 136 136
# 6.RaceID 20119617 TRUE 2.50 12 10 0 0 0 0 0 0 0 0 152

Related

Cannot read .data file under R?

Good morning,
I need to read the following .data file : https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
For this , I tried without success :
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data", open="r" ,encoding="UTF-16LE")
data <- read.table(f, dec=",", header=F)
Thank you a lot for help!
I would try to use the coatless/ucidata package to access the data.
https://github.com/coatless/ucidata
Here you can see how the package loads in the data file and processing:
https://github.com/coatless/ucidata/blob/master/data-raw/heart_disease_build.R
If you wish to try out the package, you will need devtools installed. Here is what you can try:
# install.packages("devtools")
devtools::install_github("coatless/ucidata")
# load data
data("heart_disease_cl", package = "ucidata")
# show beginning rows of data
head(heart_disease_cl)
Output
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
1 63 Male typical angina 145 233 1 probable/definite hypertrophy 150 No 2.3 downsloping 0 fixed defect 0
2 67 Male asymptomatic 160 286 0 probable/definite hypertrophy 108 Yes 1.5 flat 3 normal 2
3 67 Male asymptomatic 120 229 0 probable/definite hypertrophy 129 Yes 2.6 flat 2 reversable defect 1
4 37 Male non-anginal pain 130 250 0 normal 187 No 3.5 downsloping 0 normal 0
5 41 Female atypical angina 130 204 0 probable/definite hypertrophy 172 No 1.4 upsloping 0 normal 0
6 56 Male atypical angina 120 236 0 normal 178 No 0.8 upsloping 0 normal 0
I had found another solution with RCurl :
library (RCurl)
download <- getURL("http://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
data <- read.csv (text = download)
head(data)
#Output :
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine
1 75 0 582 0 20 1 265000 1.9
2 55 0 7861 0 38 0 263358 1.1
3 65 0 146 0 20 0 162000 1.3
4 50 1 111 0 20 0 210000 1.9
5 65 1 160 1 20 0 327000 2.7
6 90 1 47 0 40 1 204000 2.1
serum_sodium sex smoking time DEATH_EVENT
1 130 1 0 4 1
2 136 1 0 6 1
3 129 1 1 7 1
4 137 1 0 7 1
5 116 0 0 8 1
6 132 1 1 8 1

counting the number of times a value appears in a column in relation to other columns in r

I am new to r and I have a dataframe very close to the one below and I would love to find a general way that tells me how many times plus 1, the number "0" appears for each country (intro4) and id.
Intro4 number id
221 TAN 0 19
222 TAN 0 73
223 TAN 0 73
224 TOG 0 37
225 TOG 0 58
226 UGA 0 96
227 UGA 0 112
228 UGA 0 96
229 ZAM 0 40
230 ZAM 0 99
231 ZAM 0 139
I can do it by hand by it is a big data frame and would take forever, count () gives me the frequency but doesn't divide it between different countries. I have found a way to do it but I will have to select and filter for each individual county (intro4) and add 1 to the result. I was wondering if there was any quicker way to fo it. The code I have tried was this one:
projects <- finalr %>% select (Intro4,number,id)
projects1<-projects %>% filter (str_detect (number, "0"))
projects2<-projects1 %>%arrange (Intro4)
projects3<-sum(projects2$Intro4 == "TAN", na.rm = TRUE)
projects4<-sum(projects2$Intro4=="UGA",na.rm=TRUE)
I would be extremely grateful for any help, thank you :)
You can also do it as followed:
library(dplyr)
dat <- read.table(header = T, text =
"Intro4 number id
TAN 0 19
TAN 0 73
TAN 0 73
TOG 0 37
TOG 0 58
UGA 0 96
UGA 0 112
UGA 0 96
ZAM 0 40
ZAM 0 99
ZAM 0 139", stringsAsFactors = F)
dat %>% group_by(Intro4, id, number) %>% tally()
Which produces:
Intro4 id number n
<chr> <int> <int> <int>
1 TAN 19 0 1
2 TAN 73 0 2
3 TOG 37 0 1
4 TOG 58 0 1
5 UGA 96 0 2
6 UGA 112 0 1
7 ZAM 40 0 1
8 ZAM 99 0 1
9 ZAM 139 0 1
Assuming number can be anything like 0, 1, 2 etc. one can count occurrence of 0 by sum(number==0). A solution using dplyr can be as:
library(dplyr)
df %>% group_by(Intro4, id) %>%
summarise(count = sum(number==0))
# # A tibble: 9 x 3
# # Groups: Intro4 [?]
# Intro4 id count
# <chr> <int> <int>
# 1 TAN 19 1
# 2 TAN 73 2
# 3 TOG 37 1
# 4 TOG 58 1
# 5 UGA 96 2
# 6 UGA 112 1
# 7 ZAM 40 1
# 8 ZAM 99 1
# 9 ZAM 139 1
Data:
df <- read.table(text="
Intro4 number id
221 TAN 0 19
222 TAN 0 73
223 TAN 0 73
224 TOG 0 37
225 TOG 0 58
226 UGA 0 96
227 UGA 0 112
228 UGA 0 96
229 ZAM 0 40
230 ZAM 0 99
231 ZAM 0 139",
header = TRUE, stringsAsFactors = FALSE)

Correlation between variables and components using pls package

Using the data.frame below (Source: http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_PLSR_Software_Comparison.pdf)
Data
df <- read.table(text = c("
diesel twodoors sportsstyle wheelbase length width height curbweight enginesize horsepower horse_per_weight conscity price symboling
0 1 0 97 172 66 56 2209 109 85 0.0385 8.7 7975 2
0 0 0 100 177 66 54 2337 109 102 0.0436 9.8 13950 2
0 0 0 116 203 72 57 3740 234 155 0.0414 14.7 34184 -1
0 1 1 103 184 68 52 3016 171 161 0.0534 12.4 15998 3
0 0 0 101 177 65 54 2765 164 121 0.0438 11.2 21105 0
0 1 0 90 169 65 52 2756 194 207 0.0751 13.8 34028 3
1 0 0 105 175 66 54 2700 134 72 0.0267 7.6 18344 0
0 0 0 108 187 68 57 3020 120 97 0.0321 12.4 11900 0
0 0 1 94 157 64 51 1967 90 68 0.0346 7.6 6229 1
0 1 0 95 169 64 53 2265 98 112 0.0494 9.0 9298 1
1 0 0 96 166 64 53 2275 110 56 0.0246 6.9 7898 0
0 1 0 100 177 66 53 2507 136 110 0.0439 12.4 15250 2
0 1 1 94 157 64 51 1876 90 68 0.0362 6.4 5572 1
0 0 0 95 170 64 54 2024 97 69 0.0341 7.6 7349 1
0 1 1 95 171 66 52 2823 152 154 0.0546 12.4 16500 1
0 0 0 103 175 65 60 2535 122 88 0.0347 9.8 8921 -1
0 0 0 113 200 70 53 4066 258 176 0.0433 15.7 32250 0
0 0 0 95 165 64 55 1938 97 69 0.0356 7.6 6849 1
1 0 0 97 172 66 56 2319 97 68 0.0293 6.4 9495 2
0 0 0 97 172 66 56 2275 109 85 0.0374 8.7 8495 2"), header = T)
and this
Code
library(plsdepot)
df.plsdepot = plsreg1(df[, 1:11], df[, 14, drop = FALSE], comps = 3)
data<-df.plsdepot$cor.xyt
data<-as.data.frame(data)
I got this data.frame of the correlation between variables and components
data
# t1 t2 t3
#diesel -0.23513860 -0.38154681 0.439221649
#twodoors 0.71849247 0.45622386 0.055982798
#sportsstyle 0.51909329 -0.02381952 -0.672617464
#wheelbase -0.86843937 0.34114664 -0.254589548
#length -0.75311884 0.62404991 -0.085596033
#width -0.67444970 0.62282146 -0.158675019
#height -0.67228557 -0.14675385 0.317166599
#curbweight -0.59305898 0.73532560 -0.241983833
#enginesize -0.39475651 0.82353941 -0.252270394
#horsepower 0.04843256 0.96637015 -0.148407288
#horse_per_weight 0.50515322 0.81502376 -0.006045151
#symboling 0.64900253 0.23673633 0.346902434
and I managed to plot them as below
library(plsdepot)
df.plsdepot = plsreg1(df[, 1:11], df[, 14, drop = FALSE], comps = 3)
plot(df.plsdepot, comps = c(1, 2))
I had to use pls package instead of plsdepot. I need to get the correlations between variables and components and plot them
Using pls, I managed to plot the correlation between variables and components as below
library(pls)
Y <- as.matrix(df[,14])
X <- as.matrix(df[,1:11])
df.pls <- mvr(Y ~ X, ncomp = 3, method = "oscorespls", scale = T)
plot(df.pls, "correlation")
However, I couldn't find a way to get these values (correlation between variables and components) and convert them to data.frame using pls package.
Any help how can I get these correlation values using pls package will be highly appreciated?
Thanks to Bjørn-Helge Mevik (the maintainer of pls package), for his answer below
==========================================================================
If you look at the corrplot code:
> corrplot
function (object, comps = 1:2, labels, radii = c(sqrt(1/2), 1),
identify = FALSE, type = "p", xlab, ylab, ...) {
nComps <- length(comps)
if (nComps < 2)
stop("At least two components must be selected.")
if (is.matrix(object)) {
cl <- object[, comps, drop = FALSE]
varlab <- colnames(cl)
}
else {
S <- scores(object)[, comps, drop = FALSE]
if (is.null(S))
stop("`", deparse(substitute(object)), "' has no scores.")
cl <- cor(model.matrix(object), S)
varlab <- compnames(object, comps, explvar = TRUE)
}
you will see that it basically does
S <- scores(object)[, comps, drop = FALSE]
cl <- cor(model.matrix(object), S)
to calculate the correlation loadings. Using df.pls in place of object should give you a matrix of correlation loadings.
S <- scores(df.pls)[, comps= 1:2, drop = FALSE]
cl <- cor(model.matrix(df.pls), S)
df.cor <- as.data.frame(cl)
df.cor
# Comp 1 Comp 2
#diesel -0.23513860 -0.38154681
#twodoors 0.71849247 0.45622386
#sportsstyle 0.51909329 -0.02381952
#wheelbase -0.86843937 0.34114664
#length -0.75311884 0.62404991
#width -0.67444970 0.62282146
#height -0.67228557 -0.14675385
#curbweight -0.59305898 0.73532560
#enginesize -0.39475651 0.82353941
#horsepower 0.04843256 0.96637015
#horse_per_weight 0.50515322 0.81502376

Removing 0 Vaules from Subgroups in R studio

Hello I am new to R and am trying to take out row with an ==0 value .
I really am new and might be making simple mistake but I can't seem to figure it out
This is what i've tried
Simplecount <- na.omit[,Simple$Counts >=1,]
object of type 'closure' is not subsettable
The data table below is called Simple
row.names Time INT Counts
19 234 13703 4 0
20 235 13803 4 0
21 236 13903 4 0
22 237 14104 5 1
23 238 14204 5 0
61 276 18403 6 0
62 277 18503 7 1
63 278 18604 7 0
64 279 18704 7 0
You have a misplaced comma in your code.
Try:
> simple[simple$Counts >= 1, ]
row.names Time INT Counts
22 237 14104 5 1
62 277 18503 7 1
Or, in this particular case, even the following would work:
simple[as.logical(simple$Counts), ]

How to remove rows with 0 values using R

Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)

Resources