How to combine multiple rows and check if variables match - r

I am quite new with R. I was wondering if there could be a simple solution for my situation.
I have this kind of data set with 3 duplicated datas.
A
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
ASEC.r tr_in tr_out
2406 22234 0 0
2406.1 22234 0 0
2406.2 22234 0 0
4037 20547 0 0
4037.1 20547 0 0
4037.2 20547 0 0
And another data set that looks like this. The second dataset consits of columns. They are subsections of the columns in the first data set
B
BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
1 150 0 0 0 0 0 0 0 0 0 151 0
2 150 426 422 205 0 0 0 0 0 0 151 201
3 150 4201 4203 239 0 0 0 0 0 0 151 201
Is there a way I can cbind these two data sets?
I tried duplicating the second data set and used cbind(A,B) but the results were "LARGE MATRIX" form that I cant see.
Is there a way I can compare the first data set and the second data set to check if they match?
Which is why I tried to column bind them but would there be a simpler solution?
========Edited=========
What I would like to create is
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
ASEC.r tr_in tr_out BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out match
2406 22234 0 0 150 0 0 0 0 0 0 0 0 0 151 0 1
2406.1 22234 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
2406.2 22234 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
4037 20547 0 0 150 0 0 0 0 0 0 0 0 0 151 0 1
4037.1 20547 0 0 150 426 422 205 0 0 0 0 0 0 151 201 0
4037.2 20547 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
So if I compare data set A, B , I want to add a new column to A showing 1 if they match and 0 if they don't

I think there might be holes in the logic here, but I'll state some assumptions:
nrow(A) is always an integer-multiple of nrow(B); this means that A[1,] pairs only with B[1,], A[2,] with B[2,], ..., A[4,] with B[1,], A[5,] with B[2,], etc.
the test of comparison is "equality of in-common columns"
If those are true, then
incommon <- intersect(colnames(A), colnames(B))
incommon
# [1] "BSTN" "ASTN" "ASTN1" "BSTN2" "ASTN2" "BSTN3" "ASTN3" "BSTN4" "ASTN4" "BSTN5"
# [11] "tr_in" "tr_out"
Bplus <- do.call(rbind.data.frame,
c(replicate(nrow(A) / nrow(B), B, simplify = FALSE),
list(stringsAsFactors = FALSE)))
Bplus
# BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
# 1 150 0 0 0 0 0 0 0 0 0 151 0
# 2 150 426 422 205 0 0 0 0 0 0 151 201
# 3 150 4201 4203 239 0 0 0 0 0 0 151 201
# 11 150 0 0 0 0 0 0 0 0 0 151 0
# 21 150 426 422 205 0 0 0 0 0 0 151 201
# 31 150 4201 4203 239 0 0 0 0 0 0 151 201
A$match <- +(rowSums(A[,incommon] == Bplus[,incommon]) == length(incommon))
A
# CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2
# 2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# 4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# 4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out match
# 2406 0 0 0 0 0 0 0 0 151 22234 0 0 1
# 2406.1 0 0 0 0 0 0 0 0 151 22234 0 0 0
# 2406.2 0 0 0 0 0 0 0 0 151 22234 0 0 0
# 4037 0 0 0 0 0 0 0 0 151 20547 0 0 1
# 4037.1 0 0 0 0 0 0 0 0 151 20547 0 0 0
# 4037.2 0 0 0 0 0 0 0 0 151 20547 0 0 0
The use of +(...) is a trick to convert logical to integer 0 and 1. It is just as easy to keep $match as a logical field by removing that portion of the assignment. (I only used it because you had that in your intended output. I prefer logical for my own use, since 1 or 1L implies ordinality and perhaps that there can be more than two values of 0 and 1. In a declarative sense, logical clearly states that you expect only FALSE and TRUE, and possibly NA when it is indeterminant.)
Also, the rowSums(...) == length(incommon) checks that all of the in-common fields are identical. Another way to calculate it is
apply(A[,incommon] == Bplus[,incommon], 1, all)
which might be more intuitive and/or declarative. The choice of which to use is a lot based on preference and a little on performance ... the rowSums method is slightly faster than the apply method.
Data
A <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0")
B <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
1 150 0 0 0 0 0 0 0 0 0 151 0
2 150 426 422 205 0 0 0 0 0 0 151 201
3 150 4201 4203 239 0 0 0 0 0 0 151 201")

Is there a way I can compare the first data set and the second data set to check if they match?
Code:
library('data.table')
col_nm <- names(df2)[names(df2) %in% names(df1)]
setDT(df1)[df2, on = col_nm, nomatch = 0]
Output:
# CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out trips
# 1: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 2: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 3: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 4: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
# 5: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
# 6: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
Data:
df1 <- read.table(text = 'CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out trips
1 150 0 0 0 0 0 0 0 0 0 151 0 1143
2 150 426 422 205 0 0 0 0 0 0 151 201 2
3 150 4201 4203 239 0 0 0 0 0 0 151 201 2', header = TRUE, stringsAsFactors = FALSE)

Related

Rstudio error: the table must the same classes in the same order

I keep getting this error: the table must the same classes in the same order
when implementing KNN and confusion matrix to get the accuracy
df_train <- df_trimmed_n[1:10,]
df_test <- df_trimmed_n[11:20,]
df_train_labels <- df_trimmed[1:10, 1]
df_test_labels <- df_trimmed[11:20, 1]
library(class)
library(caret)
df_knn<-knn(df_train,df_test,cl=df_train_labels,k=10)
confusionMatrix(table(df_knn,df_test_labels))
Error in confusionMatrix.table(table(df_knn, df_test_labels)) :
the table must the same classes in the same order
> print(df_knn)
[1] 28134 5138 4820 3846 1216 1885 1885 22021 5138 15294
Levels: 106 1216 1885 3846 4820 5138 15294 22021 22445 28134
> print(df_test_labels)
[1] 33262 6459 5067 7395 22720 1217 3739 84 16219 17819
> table(df_knn,df_test_labels)
df_test_labels
df_knn 84 1217 3739 5067 6459 7395 16219 17819 22720 33262
106 0 0 0 0 0 0 0 0 0 0
1216 0 0 0 0 0 0 0 0 1 0
1885 0 1 1 0 0 0 0 0 0 0
3846 0 0 0 0 0 1 0 0 0 0
4820 0 0 0 1 0 0 0 0 0 0
5138 0 0 0 0 1 0 1 0 0 0
15294 0 0 0 0 0 0 0 1 0 0
22021 1 0 0 0 0 0 0 0 0 0
22445 0 0 0 0 0 0 0 0 0 0
28134 0 0 0 0 0 0 0 0 0 1
evn though both knn and test dataset have the same number of rows=10 but i'm not too sure what is wrong with the same classes and order?

R: LS Means Analysis produces NAs?

I am running an linear model regression analysis script and I am running emmeans (ls means) on my model but I am getting a whole of NA's not sure why... Here is what I have run:
setwd("C:/Users/wkmus/Desktop/R-Stuff")
### yeild-twt
ASM_Data<-read.csv("ASM_FIELD_18_SUMM_wm.csv",header=TRUE, na.strings=".")
head(ASM_Data)
str(ASM_Data)
####"NA" values in table are labeled as "." colored orange
ASM_Data$REP <- as.factor(ASM_Data$REP)
head(ASM_Data$REP)
ASM_Data$ENTRY_NO <-as.factor(ASM_Data$ENTRY_NO)
head(ASM_Data$ENTRY_NO)
ASM_Data$RANGE<-as.factor(ASM_Data$RANGE)
head(ASM_Data$RANGE)
ASM_Data$PLOT_ID<-as.factor(ASM_Data$PLOT_ID)
head(ASM_Data$PLOT_ID)
ASM_Data$PLOT<-as.factor(ASM_Data$PLOT)
head(ASM_Data$PLOT)
ASM_Data$ROW<-as.factor(ASM_Data$ROW)
head(ASM_Data$ROW)
ASM_Data$REP <- as.numeric(as.character(ASM_Data$REP))
head(ASM_Data$REP)
ASM_Data$TWT_g.li <- as.numeric(as.character(ASM_Data$TWT_g.li))
ASM_Data$Yield_kg.ha <- as.numeric(as.character(ASM_Data$Yield_kg.ha))
ASM_Data$PhysMat_Julian <- as.numeric(as.character(ASM_Data$PhysMat_Julian))
ASM_Data$flowering <- as.numeric(as.character(ASM_Data$flowering))
ASM_Data$height <- as.numeric(as.character(ASM_Data$height))
ASM_Data$CLEAN.WT <- as.numeric(as.character(ASM_Data$CLEAN.WT))
ASM_Data$GRAV.TEST.WEIGHT <-as.numeric(as.character(ASM_Data$GRAV.TEST.WEIGHT))
str(ASM_Data)
library(lme4)
#library(lsmeans)
library(emmeans)
Here is the data frame:
> str(ASM_Data)
'data.frame': 270 obs. of 20 variables:
$ TRIAL_ID : Factor w/ 1 level "18ASM_OvOv": 1 1 1 1 1 1 1 1 1 1 ...
$ PLOT_ID : Factor w/ 270 levels "18ASM_OvOv_002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLOT : Factor w/ 270 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ROW : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ RANGE : Factor w/ 15 levels "1","2","3","4",..: 2 3 4 5 6 7 8 9 10 12 ...
$ REP : num 1 1 1 1 1 1 1 1 1 1 ...
$ MP : int 1 1 1 1 1 1 1 1 1 1 ...
$ SUB.PLOT : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 2 2 2 2 2 3 ...
$ ENTRY_NO : Factor w/ 139 levels "840","850","851",..: 116 82 87 134 77 120 34 62 48 136 ...
$ height : num 74 70 73 80 70 73 75 68 65 68 ...
$ flowering : num 133 133 134 134 133 131 133 137 134 132 ...
$ CLEAN.WT : num 1072 929 952 1149 1014 ...
$ GRAV.TEST.WEIGHT : num 349 309 332 340 325 ...
$ TWT_g.li : num 699 618 663 681 650 684 673 641 585 646 ...
$ Yield_kg.ha : num 2073 1797 1841 2222 1961 ...
$ Chaff.Color : Factor w/ 3 levels "Bronze","Mixed",..: 1 3 3 1 1 1 1 3 1 3 ...
$ CHAFF_COLOR_SCALE: int 2 1 1 2 2 2 2 1 2 1 ...
$ PhysMat : Factor w/ 3 levels "6/12/2018","6/13/2018",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PhysMat_Julian : num 163 163 163 163 163 163 163 163 163 163 ...
$ PEDIGREE : Factor w/ 1 level "OVERLEY/OVERLAND": 1 1 1 1 1 1 1 1 1 1 ...
This is the head of ASM Data:
head(ASM_Data)
`TRIAL_ID PLOT_ID PLOT ROW RANGE REP MP SUB.PLOT ENTRY_NO height flowering CLEAN.WT GRAV.TEST.WEIGHT TWT_g.li`
1 18ASM_OvOv 18ASM_OvOv_002 2 1 2 1 1 A 965 74 133 1071.5 349.37 699
2 18ASM_OvOv 18ASM_OvOv_003 3 1 3 1 1 A 931 70 133 928.8 309.13 618
3 18ASM_OvOv 18ASM_OvOv_004 4 1 4 1 1 A 936 73 134 951.8 331.70 663
4 18ASM_OvOv 18ASM_OvOv_005 5 1 5 1 1 A 983 80 134 1148.6 340.47 681
5 18ASM_OvOv 18ASM_OvOv_006 6 1 6 1 1 B 926 70 133 1014.0 324.95 650
6 18ASM_OvOv 18ASM_OvOv_007 7 1 7 1 1 B 969 73 131 1076.6 342.09 684
Yield_kg.ha Chaff.Color CHAFF_COLOR_SCALE PhysMat PhysMat_Julian PEDIGREE
1 2073 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
2 1797 White 1 6/12/2018 163 OVERLEY/OVERLAND
3 1841 White 1 6/12/2018 163 OVERLEY/OVERLAND
4 2222 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
5 1961 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
6 2082 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
I am looking at a linear model dealing with test weight.
This is what I ran:
ASM_Data$TWT_g.li <- as.numeric(as.character((ASM_Data$TWT_g.li)))
head(ASM_Data$TWT_g.li)
ASM_YIELD_1 <- lm(TWT_g.li~ENTRY_NO + REP + SUB.BLOCK, data=ASM_Data)
anova(ASM_YIELD_1)
summary(ASM_YIELD_1)
emmeans(ASM_YIELD_1, "ENTRY_NO") ###########ADJ. MEANS
I get an output for anova
anova(ASM_YIELD_1)
Analysis of Variance Table
Response: TWT_g.li
Df Sum Sq Mean Sq F value Pr(>F)
ENTRY_NO 138 217949 1579 7.0339 < 2e-16 ***
REP 1 66410 66410 295.7683 < 2e-16 ***
SUB.BLOCK 4 1917 479 2.1348 0.08035 .
Residuals 125 28067 225
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but for emmeans I get something like this:
ENTRY_NO emmean SE df asymp.LCL asymp.UCL
840 nonEst NA NA NA NA
850 nonEst NA NA NA NA
851 nonEst NA NA NA NA
852 nonEst NA NA NA NA
853 nonEst NA NA NA NA
854 nonEst NA NA NA NA
855 nonEst NA NA NA NA
857 nonEst NA NA NA NA
858 nonEst NA NA NA NA
859 nonEst NA NA NA NA
I do have outliers in my data which is indicated by a "." in my data but that's the only thing I can think of which is off.
When I run with(ASM_Data, table(ENTRY_NO, REP, SUB.BLOCK))
this is what I have:
with(ASM_Data, table(ENTRY_NO,REP,SUB.BLOCK))
, , SUB.BLOCK = A
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 0 0
862 0 0
863 1 0
864 0 0
865 1 0
866 1 0
867 0 0
868 0 0
869 1 0
870 1 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 1 0
880 0 0
881 0 0
882 0 0
883 0 0
884 0 0
885 1 0
886 0 0
887 1 0
888 1 0
889 1 0
890 0 0
891 1 0
892 0 0
893 0 0
894 0 0
895 0 0
896 1 0
897 0 0
898 0 0
899 0 0
900 1 0
901 1 0
902 0 0
903 0 0
904 1 0
905 1 0
906 0 0
907 1 0
908 1 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 0
914 0 0
915 0 0
916 1 0
917 0 0
918 0 0
919 1 0
920 0 0
921 0 0
922 0 0
923 1 0
924 0 0
925 0 0
926 0 0
927 1 0
928 0 0
929 0 0
930 0 0
931 1 0
932 0 0
933 0 0
934 0 0
935 0 0
936 1 0
937 0 0
938 1 0
939 1 0
940 0 0
941 1 0
942 0 0
943 1 0
944 0 0
945 0 0
946 0 0
947 0 0
948 1 0
949 0 0
950 1 0
951 0 0
952 0 0
953 0 0
954 0 0
955 1 0
956 1 0
957 1 0
958 1 0
959 0 0
960 0 0
961 0 0
962 0 0
963 0 0
964 0 0
965 1 0
966 0 0
967 1 0
968 0 0
969 0 0
970 1 0
971 0 0
972 0 0
973 0 0
974 1 0
975 0 0
976 0 0
977 0 0
978 1 0
979 0 0
980 0 0
981 0 0
982 0 0
983 1 0
984 1 0
985 0 0
986 1 0
987 3 0
988 0 0
, , SUB.BLOCK = B
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 1 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 1 0
862 0 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 1 0
872 0 0
873 0 0
874 0 0
875 0 0
876 1 0
877 1 0
878 1 0
879 0 0
880 1 0
881 0 0
882 1 0
883 1 0
884 1 0
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 1 0
891 0 0
892 1 0
893 1 0
894 1 0
895 1 0
896 0 0
897 1 0
898 0 0
899 0 0
900 0 0
901 0 0
902 1 0
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 1 0
910 0 0
911 1 0
912 0 0
913 1 0
914 0 0
915 0 0
916 0 0
917 0 0
918 0 0
919 0 0
920 1 0
921 1 0
922 0 0
923 0 0
924 0 0
925 1 0
926 1 0
927 0 0
928 0 0
929 0 0
930 1 0
931 0 0
932 1 0
933 0 0
934 1 0
935 0 0
936 0 0
937 1 0
938 0 0
939 0 0
940 1 0
941 0 0
942 0 0
943 0 0
944 0 0
945 1 0
946 0 0
947 1 0
948 0 0
949 0 0
950 0 0
951 1 0
952 0 0
953 0 0
954 1 0
955 0 0
956 0 0
957 0 0
958 0 0
959 1 0
960 0 0
961 0 0
962 1 0
963 0 0
964 0 0
965 0 0
966 0 0
967 0 0
968 0 0
969 1 0
970 0 0
971 0 0
972 0 0
973 0 0
974 0 0
975 0 0
976 1 0
977 1 0
978 0 0
979 0 0
980 0 0
981 1 0
982 1 0
983 0 0
984 0 0
985 3 0
986 0 0
987 1 0
988 1 0
, , SUB.BLOCK = C
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 1 0
858 0 0
859 1 0
860 0 0
861 0 0
862 1 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 1 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 0 0
880 0 0
881 1 0
882 0 0
883 0 0
884 0 0
885 0 0
886 1 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 0
893 0 0
894 0 0
895 0 0
896 0 0
897 0 0
898 1 0
899 1 0
900 0 0
901 0 0
902 0 0
903 1 0
904 0 0
905 0 0
906 1 0
907 0 0
908 0 0
909 0 0
910 1 0
911 0 0
912 1 0
913 0 0
914 1 0
915 1 0
916 0 0
917 1 0
918 1 0
919 0 0
920 0 0
921 0 0
922 1 0
923 0 0
924 1 0
925 0 0
926 0 0
927 0 0
928 1 0
929 1 0
930 0 0
931 0 0
932 0 0
933 1 0
934 0 0
935 1 0
936 0 0
937 0 0
938 0 0
939 0 0
940 0 0
941 0 0
942 1 0
943 0 0
944 1 0
945 0 0
946 1 0
947 0 0
948 0 0
949 1 0
950 0 0
951 0 0
952 1 0
953 1 0
954 0 0
955 0 0
956 0 0
957 0 0
958 0 0
959 0 0
960 1 0
961 1 0
962 0 0
963 1 0
964 1 0
965 0 0
966 1 0
967 0 0
968 1 0
969 0 0
970 0 0
971 1 0
972 1 0
973 1 0
974 0 0
975 1 0
976 0 0
977 0 0
978 1 0
979 2 0
980 0 0
981 0 0
982 0 0
983 0 0
984 0 0
985 1 0
986 3 0
987 0 0
988 0 0
, , SUB.BLOCK = D
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 1
853 0 0
854 0 0
855 0 0
857 0 0
858 0 1
859 0 0
860 0 1
861 0 0
862 0 0
863 0 0
864 0 1
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 1
876 0 0
877 0 0
878 0 1
879 0 0
880 0 1
881 0 1
882 0 1
883 0 1
884 0 1
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 1
893 0 0
894 0 0
895 0 0
896 0 0
897 0 1
898 0 0
899 0 1
900 0 0
901 0 0
902 0 1
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 1
914 0 1
915 0 1
916 0 0
917 0 1
918 0 1
919 0 0
920 0 0
921 0 1
922 0 1
923 0 0
924 0 0
925 0 0
926 0 0
927 0 0
928 0 0
929 0 1
930 0 1
931 0 0
932 0 0
Can someone please give me an idea of what is going wrong??
Thanks !
I've been able to create a situation like this. Consider this dataset:
> junk
trt rep blk y
1 A 1 1 -1.17415687
2 B 1 1 -0.20084854
3 C 1 1 0.64797806
4 A 1 2 -1.69371434
5 B 1 2 -0.35835442
6 C 1 2 1.35718782
7 A 1 3 0.20510482
8 B 1 3 1.00857651
9 C 1 3 -0.20553167
10 A 2 4 0.31261523
11 B 2 4 0.47989115
12 C 2 4 1.27574085
13 A 2 5 -0.79209520
14 B 2 5 1.07151315
15 C 2 5 -0.04222769
16 A 2 6 -0.80571767
17 B 2 6 0.80442988
18 C 2 6 1.73526561
This has 6 complete blocks, separately labeled with 3 blocks per rep. Not obvious, but true, is that rep is a numeric variable having values 1 and 2, while blk is a factor having 6 levels 1 -- 6:
> sapply(junk, class)
trt rep blk y
"factor" "numeric" "factor" "numeric"
With this complete dataset, I have no problem obtaining EMMs for modeling situations parallel to what was used in the original posting. However, if I use only a subset of these data, it is different. Consider:
> samp
[1] 1 2 3 5 8 11 13 15 16
> junk.lm = lm(y ~ trt + rep + blk, data = junk, subset = samp)
> emmeans(junk.lm, "trt")
trt emmean SE df asymp.LCL asymp.UCL
A nonEst NA NA NA NA
B nonEst NA NA NA NA
C nonEst NA NA NA NA
Results are averaged over the levels of: blk
Confidence level used: 0.95
Again, recall that rep is numeric in this model. If instead, I make rep a factor:
> junk.lmf = lm(y ~ trt + factor(rep) + blk, data = junk, subset = samp)
> emmeans(junk.lmf, "trt")
NOTE: A nesting structure was detected in the fitted model:
blk %in% rep
If this is incorrect, re-run or update with `nesting` specified
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk, rep
Confidence level used: 0.95
We get non-NA estimates, in part because it is able to detect the fact that blk is nested in rep, and thus performs the EMM computations separately in each rep. Note in the annotations in this last output that averaging is done over the 2 reps and 6 blocks; whereas in fiber.lm averaging is done only over blocks, while rep, a numeric variable, is set at its average. Compare the reference grids for the two models:
> ref_grid(junk.lm)
'emmGrid' object with variables:
trt = A, B, C
rep = 1.4444
blk = 1, 2, 3, 4, 5, 6
> ref_grid(junk.lmf)
'emmGrid' object with variables:
trt = A, B, C
rep = 1, 2
blk = 1, 2, 3, 4, 5, 6
Nesting structure: blk %in% rep
An additional option is to avoid the nesting issue by simply omitting rep from the model:
> junk.lm.norep = lm(y ~ trt + blk, data = junk, subset = samp)
> emmeans(junk.lm.norep, "trt")
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk
Confidence level used: 0.95
Note that exactly the same results are produced. The reason is the levels of blk already predict the levels of rep, so there is no need for it to be in the model.
In summary:
The situation is due in part to the fact that there are missing data
and in part because rep was in the model as a numeric predictor rather than a factor.
In your situation, I suggest re-fitting the model with factor(REP) instead of REPas a numeric predictor. This may be enough to produce estimates.
If, indeed, as in my example, the SUB.BLOCK levels predict the REP levels, just leave REP out of the model altogether.
EMMs are obtained by averaging predictions over 2 reps and 5 blocks (or maybe more?). Look at
coef(ASM_YIELD_1)
If any of the rep or block effects are NA, then you can’t estimate all of the rep or block effects, and that makes the average of them non-estimable.
You can see exactly which factor combinations are non-estimable by doing:
summary(ref_grid(ASM_YIELD_1))
addendum
Here is a reformatting of the tables I requested in comments:
ENTRY ---------- BLOCK -------------
NO A B C D
840 0 0 0 0 0 0 0 0
850 0 0 0 0 0 0 0 0
851 0 0 0 0 0 0 0 0
852 0 0 0 0 0 0 0 1
853 0 0 1 0 0 0 0 0
854 0 0 0 0 0 0 0 0
855 0 0 0 0 0 0 0 0
857 0 0 0 0 1 0 0 0
858 0 0 0 0 0 0 0 1
859 0 0 0 0 1 0 0 0
... etc ...
This is extremely sparse data. I think there are two more blocks not shown. But I see very few instances where a given ENTRY_NO is observed in more than one rep or block. So I think it is seriously over-fitting to try to account for rep or block effects in this model.
MAYBE omitting REP from the model will make it work. MAYBE re-fitting the model with factor(REP) in place of REP will enable emmeans to detect a nesting structure. Otherwise, there's some really subtle dependence in the blocking structure and treatments, and I don't know what to suggest.

Error Messages in R for MaxEnt model

I am having some trouble running MaxEnt in R. I keep getting two error messages:
1) Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1, 2, 0
and
2) Warning message:
In mean.default(sapply(e, function(x) { :
argument is not numeric or logical: returning NA
I'm somewhat new to R and am not sure what these error messages mean. The columns all have the same number of rows and there is no non-numeric data. Any help would be greatly appreciated. I have included the data and the R script. Thanks for your time and any help-
#MaxEnt (Maximum Entropy) Modeling is a species distribution model and machine-learning technique
#The package "dismo" calls in the java program.
#The package "maxnet" is an R-based Maxent implementation that uses glmnet with regularization to approximate the fit of maxent.jar
#The packages "raster" and "sp" are also required to run the model.
library(dismo)
library(maxnet)
library(raster)
library(sp)
#Load the Data
KCStatic = read.csv(file="D:/KCs3.csv", row.names="ID")
#Need to partition 'training dataset' and 'validation dataset'
KCwoT.Tng<-subset(KCStatic, Valid==0)
KCwoT.Val<-subset(KCStatic, Valid==1)
pvtest<-KCwoT.Val[KCwoT.Val[,1] == 1, 2:7]
avtest<-KCwoT.Val[KCwoT.Val[,1] == 0, 2:7]
#MaxEnt Model
KCwoT.ME.x<-KCwoT.Tng[,2:7]
KCwoT.ME.p<-KCwoT.Tng[,1]
KCwoT.ME<-maxent(KCwoT.ME.x,KCwoT.ME.p)
KCwoT.ME
KCwoT.ME.testp <- predict(KCwoT.ME, pvtest)
KCwoT.ME.testa <- predict(KCwoT.ME, avtest)
KCwoT.ME.eval <- evaluate(p=KCwoT.ME.testp, a=KCwoT.ME.testa)
KCwoT.ME.eval
#10 fold x validation:
xVal<- subset(KCwoT.Tng, select = c(ag, cent, jor, kliv, traw, silt))
k <- 10
group <- kfold(xVal, k)
e <- list()
for (i in 1:k) {
train <- xVal[group != i,]
test <- xVal[group == i,]
trainx<- train[,2:length(xVal)]
trainp<- train[,1]
me <- maxent(trainx,trainp)
testxp<-test[test[,1] == 1,2:length(xVal)]
testxa<-test[test[,1] == 0,2:length(xVal)]
testp <- predict(me, testxp)
testa <- predict(me, testxa)
e[[i]] <- evaluate(p=testp, a=testa)
}
mean(sapply( e, function(x){slot(x, 'auc')} ))
median(sapply( e, function(x){slot(x, 'auc')} ))
mean(sapply( e, function(x){slot(x, 'cor')} ))
median(sapply( e, function(x){slot(x, 'cor')} ))
KCwoT.ME.eval#TPR[match(threshold(KCwoT.ME.eval)[2],KCwoT.ME.eval#t)]
KCwoT.ME.eval#TNR[match(threshold(KCwoT.ME.eval)[2],KCwoT.ME.eval#t)]
KCwoT.ME.eval#CCR[match(threshold(KCwoT.ME.eval)[2],KCwoT.ME.eval#t)]
s3_Pa ag cent jor kliv traw silt ID Valid
0 2712.814007 5944.620219 7664 7545 0 0 3174 0
0 2732.815985 5646.817407 7527 7516 0 0 3221 0
0 5383.266705 5383.266705 7970 7970 0 0 3230 0
0 4857.46024 6259.344198 7726 5781 0 0 3300 0
0 8352.51324 8352.51324 7198 7198 0 0 3356 0
0 49378.96152 16984.2808 12172 7890 0 0 3415 0
0 4319.437464 4319.437464 6297 6297 0 0 3461 0
0 9444.516272 9444.516272 7394 7394 0 0 3552 0
0 3589.880714 3265.163078 6131 5188 0 0 3605 0
0 28749.74389 28749.74389 6466 6466 0 0 3653 0
0 6959.890193 4073.928736 5213 5412 0 0 3764 0
0 5118.247173 3272.811018 4705 4705 0 0 3829 0
0 2559.80507 3984.939677 4965 5422 0 0 3857 0
0 5189.140315 5189.140315 4864 4864 0 0 3903 0
0 2243.265175 2513.775258 5407 5285 0 0 3942 0
0 2798.840052 2798.840052 5284 5284 0 0 3943 0
0 3049.900227 3049.900227 5044 5044 0 0 4034 0
0 5314.032326 5314.032326 5336 5336 0 0 4049 0
0 2416.851993 2483.681392 4529 4204 0 0 4093 0
0 2527.316522 2527.316522 4898 4898 0 0 4199 0
0 2281.407824 2281.407824 4848 4848 0 0 4216 0
0 2257.802423 2257.802423 4873 4873 0 0 4285 0
0 2678.746278 2678.746278 5137 5137 0 0 4360 0
0 2319.915204 2319.915204 5138 5138 0 0 4362 0
0 3516.384174 3516.384174 4725 4725 0 0 4557 0
0 2218.583063 2218.583063 4464 4464 0 0 4583 0
0 7433.978369 6832.621571 3963 5527 0 0 4586 0
0 2604.437581 2604.437581 4565 4565 0 0 4630 0
0 2372.751422 3930.504765 4787 5560 0 0 4739 0
0 2516.984733 7087.776431 4219 5885 0 0 4818 0
0 2437.56414 2437.56414 4596 4596 0 0 4825 0
0 2440.659167 2440.659167 4556 4556 0 0 4933 0
0 2416.905821 2416.905821 4540 4540 0 0 4942 0
0 2428.521085 2428.521085 4867 4867 0 0 5121 0
0 2463.594566 2463.594566 5125 5125 0 0 5196 0
0 2487.539855 2487.539855 4803 4803 0 0 5249 0
0 3302.718352 3302.718352 4958 4958 0 0 5252 0
0 2605.906908 2605.906908 4846 4846 0 0 5332 0
0 2577.784698 2577.784698 5463 5463 0 0 5402 0
0 2764.191937 8861.087669 4848 6376 0 0 5494 0
0 2566.989482 2566.989482 4938 4938 0 0 5565 0
0 2592.787269 2592.787269 4889 4889 0 0 5626 0
0 2757.964558 2757.964558 5077 5077 0 0 5693 0
0 2620.543732 2620.543732 5216 5216 0 0 5769 0
0 5309.434576 5309.434576 5867 5867 0 0 5908 0
1 5921.125287 6881.922922 7736 7707 0 0 3217 0
1 2774.199514 21747.29759 7669 9197 0 0 3280 0
1 15495.78183 15495.78183 7826 7826 0 0 3307 0
1 2548.237657 4019.296229 7503 7421 0 0 3310 0
1 7666.402192 7666.402192 7501 7501 0 0 3342 0
1 4891.62472 4891.62472 7384 7384 0 0 3350 0
1 5042.36752 4343.180161 7456 7344 0 0 3373 0
1 5193.293844 5772.4049 7390 6359 0 0 3377 0
1 3534.172197 16711.2551 7446 7646 0 0 3423 0
1 14070.79994 14070.79994 7601 7601 0 0 3443 0
1 3255.725951 3255.725951 7345 7345 0 0 3450 0
1 4786.893258 4786.893258 6125 6125 0 0 3493 0
1 40210.85968 4484.216909 12105 5479 0 0 3517 0
1 4333.860544 4333.860544 7262 7262 0 0 3535 0
1 7795.332317 7795.332317 6679 6679 0 0 3542 0
1 3364.563525 3364.563525 6608 6608 0 0 3545 0
1 3303.389553 3303.389553 6879 6879 0 0 3547 0
1 3619.497747 3619.497747 6561 6561 0 0 3551 0
1 5450.874516 3356.834908 6633 5425 0 0 3570 0
1 2725.057799 2725.057799 6024 6024 0 0 3583 0
1 5691.763254 5691.763254 6377 6377 0 0 3602 0
1 3169.96849 3169.96849 5673 5673 0 0 3645 0
1 3275.840301 3250.607347 5876 5165 0 0 3660 0
1 2889.723967 2889.723967 6250 6250 0 0 3662 0
1 7669.776341 7669.776341 6345 6345 0 0 3686 0
1 3834.198391 2710.905485 5632 5238 0 0 3687 0
1 2460.824512 2626.740765 5489 5011 0 0 3714 0
1 2486.475314 5285.960072 5944 5571 0 0 3743 0
1 3044.274943 3780.675875 6001 5272 0 0 3779 0
1 2330.782119 2330.782119 5254 5254 0 0 3798 0
1 6918.421155 4441.631393 5837 5400 0 0 3807 0
1 2283.770553 2283.770553 5352 5352 0 0 3843 0
1 4017.235221 4017.235221 5268 5268 0 0 3897 0
1 7856.113523 7856.113523 5280 5280 0 0 3936 0
1 5723.619574 5723.619574 5414 5414 0 0 3985 0
1 3204.713159 2870.945921 5145 5082 0 0 3988 0
1 4486.528634 2810.032147 4970 4900 0 0 3992 0
1 4014.434161 2728.881238 5213 5069 0 0 4005 0
1 2752.918679 2752.918679 4704 4704 0 0 4007 0
1 2277.264207 2277.264207 4998 4998 0 0 4019 0
1 5711.280538 3392.648953 5123 4870 0 0 4020 0
1 3203.948015 3203.948015 4714 4714 0 0 4067 0
1 2767.113359 2767.113359 4886 4886 0 0 4091 0
1 2865.961261 2865.961261 4892 4892 0 0 4110 0
1 2911.739735 2911.739735 4780 4780 0 0 4116 0
1 2361.708077 2361.708077 4724 4724 0 0 4117 0
1 2286.082622 2360.683427 5355 5185 0 0 4118 0
1 2331.814226 2331.814226 5037 5037 0 0 4137 0
1 2308.986958 2308.986958 4682 4682 0 0 4140 0
1 2300.537289 2300.537289 4852 4852 0 0 4177 0
1 2321.675121 2321.675121 4638 4638 0 0 4238 0
1 2237.444686 2237.444686 5043 5043 0 0 4239 0
1 2390.690086 2390.690086 4395 4395 0 0 4241 0
1 2229.52996 2229.52996 5109 5109 0 0 4279 0
1 2520.728579 2520.728579 5119 5119 0 0 4283 0
1 2258.03747 2258.03747 5071 5071 0 0 4284 0
1 2278.607183 2278.607183 4785 4785 0 0 4298 0
1 2505.662096 2505.662096 4083 4083 0 0 4320 0
1 2225.789635 2225.789635 4880 4880 0 0 4331 0
1 2183.306088 2183.306088 4425 4425 0 0 4525 0
1 2263.787317 2263.787317 4964 4964 0 0 4540 0
1 2245.845341 2245.845341 4750 4750 0 0 4640 0
1 2283.423493 2283.423493 4806 4806 0 0 4662 0
1 2266.360563 2266.360563 4765 4765 0 0 4721 0
1 2260.095621 2260.095621 5038 5038 0 0 4732 0
1 2301.432888 2301.432888 4854 4854 0 0 4736 0
1 2329.630779 2329.630779 5708 5708 0 0 4762 0
1 2454.336191 3935.49151 4318 5562 0 0 4772 0
1 2361.297036 3624.034037 4331 5232 0 0 4779 0
1 2323.874056 2323.874056 4757 4757 0 0 4790 0
1 2382.420745 2382.420745 5349 5349 0 0 4798 0
1 2352.659926 2352.659926 4452 4452 0.0000846 0.020143027 4799 0
1 2409.321197 2409.321197 4475 4475 0 0 4815 0
1 2360.118466 2360.118466 4369 4369 0 0 4819 0
1 2339.601538 2339.601538 5518 5518 0 0 4861 0
1 2358.67411 2358.67411 4574 4574 0 0 4880 0
1 2386.410926 3718.035094 4393 5278 0 0 4894 0
1 2528.234053 2528.234053 4703 4703 0 0 4910 0
1 2291.851083 2291.851083 5101 5101 0 0 4925 0
1 2459.766511 2459.766511 4814 4814 0 0 4973 0
1 2490.775395 2490.775395 5044 5044 0 0 5025 0
1 2514.079723 5099.787095 4459 5380 0 0 5032 0
1 2427.873473 2427.873473 4754 4754 0 0 5037 0
1 2380.611838 2380.611838 5461 5461 0 0 5055 0
1 2511.565392 2511.565392 5059 5059 0 0 5056 0
1 3622.514274 3622.514274 4827 4827 0 0 5109 0
1 2475.631468 2475.631468 4908 4908 0 0 5118 0
1 2492.200769 2492.200769 4822 4822 0 0 5143 0
1 2509.438134 2509.438134 4628 4628 0 0 5185 0
1 2556.501335 2556.501335 4737 4737 0 0 5309 0
1 2548.650994 2548.650994 4802 4802 0 0 5316 0
1 2530.378885 2530.378885 4952 4952 0 0 5363 0
1 2528.110558 2551.811713 4629 4785 0 0 5392 0
1 2590.935464 2590.935464 4645 4645 0.000394667 0.093930809 5480 0
1 2631.168824 2958.393588 5187 5380 0 0 5521 0
1 2581.504472 2581.504472 5062 5062 0 0 5531 0
1 2575.585115 2575.585115 5185 5185 0 0 5538 0
1 2551.676567 2551.676567 4892 4892 0 0 5542 0
1 2569.698254 2569.698254 5053 5053 0 0 5557 0
1 2624.237765 2624.237765 4912 4912 0 0 5604 0
1 2614.385919 2614.385919 5301 5301 0 0 5640 0
1 2598.787723 2598.787723 5364 5364 0 0 5642 0
1 2578.060432 2578.060432 5090 5090 0 0 5656 0
1 4001.119207 5895.989889 4925 5693 0 0 5662 0
1 2623.749151 2623.749151 5440 5440 0 0 5673 0
1 2644.030557 2644.030557 5377 5377 0 0 5710 0
1 2669.872842 2669.872842 5177 5177 0 0 5734 0
1 3646.204271 3646.204271 5193 5193 0 0 5794 0
1 2618.429271 2618.429271 5035 5035 0 0 5815 0
1 2690.323195 2690.323195 4821 4821 0 0 5818 0
1 2633.516256 2633.516256 4956 4956 0 0 5883 0
1 2701.966232 2701.966232 5470 5470 0 0 5898 1
1 2946.4581 6141.938828 5043 5935 0 0 5919 1
1 2658.347761 2658.347761 5162 5162 0 0 5938 1
0 2424.833017 2424.833017 5726 5726 0 0 3737 1
0 2644.544075 2644.544075 4857 4857 0 0 3799 1
1 2410.138972 2280.102484 4816 4905 0 0 3974 1
1 2968.445375 2968.445375 4705 4705 0 0 4006 1
1 2267.857714 2267.857714 5088 5088 0 0 4330 1
1 2293.989337 2293.989337 5007 5007 0 0 4612 1
1 2308.364875 2281.548644 4560 4720 0 0 4922 1
1 2492.057156 2492.057156 4737 4737 0 0 5089 1
1 2566.653701 2566.653701 4989 4989 0 0 5478 1

How can I extract distinct element from matrix's rows in R?

I have a matrix as shown and I want to extract from it an other matrix where without any duplicated element in each row.
This is the input matrix
head(Data_Achat2)
ID_Achat 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
1 1349 433 405 451 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 4890 405 405 416 416 388 464 416 388 392 405 393 405 433 453 392 416 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 7881 405 384 390 395 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 8081 442 405 405 475 464 405 442 405 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 9465 457 417 416 391 441 441 392 441 401 441 432 388 395 466 464 399 475 466 464 481 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 10626 432 390 433 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In other word I want to get for example for the second row like this:
2 4890 405 416 388 464 388 392 393 433 453
Then, each row of the new matrix has only distincts element of the input one and all of results is in matrix (which include also 0 values for missing values).
I would row-wise apply a function that only retains the m unique values and then "pad" that vector to a length N with zeros, by adding N - m zeros to the unique values:
N <- ncol(Data_Achat2)
t(apply(Data_Achat2, 1, function(x){
uniques <- unique(x)
return(c(uniques, rep(0, N-length(uniques))))
}))
Which results in:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] --- [,36] [,37]
1 1349 433 405 451 0 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
2 4890 405 416 388 464 392 393 433 453 0 0 0 0 0 0 0 0 --- 0 0
3 7881 405 384 390 395 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
4 8081 442 405 475 464 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
5 9465 457 417 416 391 441 392 401 432 388 395 466 464 399 475 481 0 --- 0 0
6 10626 432 390 433 0 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0

Remove duplicate header lines in dataframe

My raw data contains numeric values with a recall of the headers every 20 lines.
I wish to remove the repeated header lines with R. I know it's quite easy with sed command but I wish the R script to handle all steps of tidying data.
> raw <- read.delim("./vmstat_archiveadm_s.txt")
> head(raw)
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100097600 97779056 285 426 53 0 0 0 367 86 6 0 0 1206 7711 2630 1 0 99
0 0 0 96908192 94414488 7 31 0 0 0 0 0 120 0 0 0 2782 5775 5042 2 0 97
0 0 0 96889840 94397152 0 1 0 0 0 0 0 122 0 0 0 2737 5591 4958 2 0 97
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100065744 97745448 282 422 52 0 0 0 363 89 6 0 0 1233 7690 2665 1 0 99
0 0 0 96725312 94222040 7 31 0 0 0 0 0 604 69 0 0 5269 5703 7910 2 1 97
0 0 0 96668624 94170784 0 0 0 0 0 0 0 155 53 0 0 3047 5505 5317 2 0 97
0 0 0 96595104 94086816 0 0 0 0 0 0 0 174 0 0 0 2879 5567 5068 2 0 97
1 0 0 96521376 94025504 0 0 0 0 0 0 0 121 0 0 0 2812 5471 5105 2 0 97
0 0 0 96503256 93994896 0 0 0 0 0 0 0 121 0 0 0 2731 5621 4981 2 0 97
(...)
Try this :
where df is the dataframe
x = seq(6,100,21)
df = df[-x,]
Sequence will generate a string of numbers from 6 till 100 at an interval of 21.
Therefore, in this case :
6 27 48 69 90
Remove them from the dataframe by
df[-x,]
EDIT:
To do this for the entire dataframe, replace 100 with number of rows. i.e
seq(6,nrow(df),21)
Instead of processing the output in R I will clean it at the generation level:
$ vmstat 1 | egrep -v '^ kthr|^ r'
0 0 0 154831904 153906536 215 471 0 0 0 0 526 33 32 0 0 1834 14171 5253 0 0 99
1 0 0 154805632 153354296 9 32 0 0 0 0 0 0 0 0 0 1463 610 739 0 0 100
1 0 0 154805632 153354696 0 4 0 0 0 0 0 0 0 0 0 1408 425 634 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1341 381 658 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1299 353 610 0 0 100
1 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1319 375 638 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1308 367 614 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1336 395 650 0 0 100
1 0 0 154805632 153354640 0 0 0 0 0 0 0 44 44 0 0 1594 378 878 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 66 65 0 0 1763 382 1015 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1312 411 645 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1342 390 647 0 0 100

Resources