How can I extract distinct element from matrix's rows in R? - r

I have a matrix as shown and I want to extract from it an other matrix where without any duplicated element in each row.
This is the input matrix
head(Data_Achat2)
ID_Achat 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
1 1349 433 405 451 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 4890 405 405 416 416 388 464 416 388 392 405 393 405 433 453 392 416 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 7881 405 384 390 395 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 8081 442 405 405 475 464 405 442 405 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 9465 457 417 416 391 441 441 392 441 401 441 432 388 395 466 464 399 475 466 464 481 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 10626 432 390 433 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In other word I want to get for example for the second row like this:
2 4890 405 416 388 464 388 392 393 433 453
Then, each row of the new matrix has only distincts element of the input one and all of results is in matrix (which include also 0 values for missing values).

I would row-wise apply a function that only retains the m unique values and then "pad" that vector to a length N with zeros, by adding N - m zeros to the unique values:
N <- ncol(Data_Achat2)
t(apply(Data_Achat2, 1, function(x){
uniques <- unique(x)
return(c(uniques, rep(0, N-length(uniques))))
}))
Which results in:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] --- [,36] [,37]
1 1349 433 405 451 0 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
2 4890 405 416 388 464 392 393 433 453 0 0 0 0 0 0 0 0 --- 0 0
3 7881 405 384 390 395 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
4 8081 442 405 475 464 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0
5 9465 457 417 416 391 441 392 401 432 388 395 466 464 399 475 481 0 --- 0 0
6 10626 432 390 433 0 0 0 0 0 0 0 0 0 0 0 0 0 --- 0 0

Related

Rstudio error: the table must the same classes in the same order

I keep getting this error: the table must the same classes in the same order
when implementing KNN and confusion matrix to get the accuracy
df_train <- df_trimmed_n[1:10,]
df_test <- df_trimmed_n[11:20,]
df_train_labels <- df_trimmed[1:10, 1]
df_test_labels <- df_trimmed[11:20, 1]
library(class)
library(caret)
df_knn<-knn(df_train,df_test,cl=df_train_labels,k=10)
confusionMatrix(table(df_knn,df_test_labels))
Error in confusionMatrix.table(table(df_knn, df_test_labels)) :
the table must the same classes in the same order
> print(df_knn)
[1] 28134 5138 4820 3846 1216 1885 1885 22021 5138 15294
Levels: 106 1216 1885 3846 4820 5138 15294 22021 22445 28134
> print(df_test_labels)
[1] 33262 6459 5067 7395 22720 1217 3739 84 16219 17819
> table(df_knn,df_test_labels)
df_test_labels
df_knn 84 1217 3739 5067 6459 7395 16219 17819 22720 33262
106 0 0 0 0 0 0 0 0 0 0
1216 0 0 0 0 0 0 0 0 1 0
1885 0 1 1 0 0 0 0 0 0 0
3846 0 0 0 0 0 1 0 0 0 0
4820 0 0 0 1 0 0 0 0 0 0
5138 0 0 0 0 1 0 1 0 0 0
15294 0 0 0 0 0 0 0 1 0 0
22021 1 0 0 0 0 0 0 0 0 0
22445 0 0 0 0 0 0 0 0 0 0
28134 0 0 0 0 0 0 0 0 0 1
evn though both knn and test dataset have the same number of rows=10 but i'm not too sure what is wrong with the same classes and order?

R: LS Means Analysis produces NAs?

I am running an linear model regression analysis script and I am running emmeans (ls means) on my model but I am getting a whole of NA's not sure why... Here is what I have run:
setwd("C:/Users/wkmus/Desktop/R-Stuff")
### yeild-twt
ASM_Data<-read.csv("ASM_FIELD_18_SUMM_wm.csv",header=TRUE, na.strings=".")
head(ASM_Data)
str(ASM_Data)
####"NA" values in table are labeled as "." colored orange
ASM_Data$REP <- as.factor(ASM_Data$REP)
head(ASM_Data$REP)
ASM_Data$ENTRY_NO <-as.factor(ASM_Data$ENTRY_NO)
head(ASM_Data$ENTRY_NO)
ASM_Data$RANGE<-as.factor(ASM_Data$RANGE)
head(ASM_Data$RANGE)
ASM_Data$PLOT_ID<-as.factor(ASM_Data$PLOT_ID)
head(ASM_Data$PLOT_ID)
ASM_Data$PLOT<-as.factor(ASM_Data$PLOT)
head(ASM_Data$PLOT)
ASM_Data$ROW<-as.factor(ASM_Data$ROW)
head(ASM_Data$ROW)
ASM_Data$REP <- as.numeric(as.character(ASM_Data$REP))
head(ASM_Data$REP)
ASM_Data$TWT_g.li <- as.numeric(as.character(ASM_Data$TWT_g.li))
ASM_Data$Yield_kg.ha <- as.numeric(as.character(ASM_Data$Yield_kg.ha))
ASM_Data$PhysMat_Julian <- as.numeric(as.character(ASM_Data$PhysMat_Julian))
ASM_Data$flowering <- as.numeric(as.character(ASM_Data$flowering))
ASM_Data$height <- as.numeric(as.character(ASM_Data$height))
ASM_Data$CLEAN.WT <- as.numeric(as.character(ASM_Data$CLEAN.WT))
ASM_Data$GRAV.TEST.WEIGHT <-as.numeric(as.character(ASM_Data$GRAV.TEST.WEIGHT))
str(ASM_Data)
library(lme4)
#library(lsmeans)
library(emmeans)
Here is the data frame:
> str(ASM_Data)
'data.frame': 270 obs. of 20 variables:
$ TRIAL_ID : Factor w/ 1 level "18ASM_OvOv": 1 1 1 1 1 1 1 1 1 1 ...
$ PLOT_ID : Factor w/ 270 levels "18ASM_OvOv_002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLOT : Factor w/ 270 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ROW : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ RANGE : Factor w/ 15 levels "1","2","3","4",..: 2 3 4 5 6 7 8 9 10 12 ...
$ REP : num 1 1 1 1 1 1 1 1 1 1 ...
$ MP : int 1 1 1 1 1 1 1 1 1 1 ...
$ SUB.PLOT : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 2 2 2 2 2 3 ...
$ ENTRY_NO : Factor w/ 139 levels "840","850","851",..: 116 82 87 134 77 120 34 62 48 136 ...
$ height : num 74 70 73 80 70 73 75 68 65 68 ...
$ flowering : num 133 133 134 134 133 131 133 137 134 132 ...
$ CLEAN.WT : num 1072 929 952 1149 1014 ...
$ GRAV.TEST.WEIGHT : num 349 309 332 340 325 ...
$ TWT_g.li : num 699 618 663 681 650 684 673 641 585 646 ...
$ Yield_kg.ha : num 2073 1797 1841 2222 1961 ...
$ Chaff.Color : Factor w/ 3 levels "Bronze","Mixed",..: 1 3 3 1 1 1 1 3 1 3 ...
$ CHAFF_COLOR_SCALE: int 2 1 1 2 2 2 2 1 2 1 ...
$ PhysMat : Factor w/ 3 levels "6/12/2018","6/13/2018",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PhysMat_Julian : num 163 163 163 163 163 163 163 163 163 163 ...
$ PEDIGREE : Factor w/ 1 level "OVERLEY/OVERLAND": 1 1 1 1 1 1 1 1 1 1 ...
This is the head of ASM Data:
head(ASM_Data)
`TRIAL_ID PLOT_ID PLOT ROW RANGE REP MP SUB.PLOT ENTRY_NO height flowering CLEAN.WT GRAV.TEST.WEIGHT TWT_g.li`
1 18ASM_OvOv 18ASM_OvOv_002 2 1 2 1 1 A 965 74 133 1071.5 349.37 699
2 18ASM_OvOv 18ASM_OvOv_003 3 1 3 1 1 A 931 70 133 928.8 309.13 618
3 18ASM_OvOv 18ASM_OvOv_004 4 1 4 1 1 A 936 73 134 951.8 331.70 663
4 18ASM_OvOv 18ASM_OvOv_005 5 1 5 1 1 A 983 80 134 1148.6 340.47 681
5 18ASM_OvOv 18ASM_OvOv_006 6 1 6 1 1 B 926 70 133 1014.0 324.95 650
6 18ASM_OvOv 18ASM_OvOv_007 7 1 7 1 1 B 969 73 131 1076.6 342.09 684
Yield_kg.ha Chaff.Color CHAFF_COLOR_SCALE PhysMat PhysMat_Julian PEDIGREE
1 2073 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
2 1797 White 1 6/12/2018 163 OVERLEY/OVERLAND
3 1841 White 1 6/12/2018 163 OVERLEY/OVERLAND
4 2222 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
5 1961 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
6 2082 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
I am looking at a linear model dealing with test weight.
This is what I ran:
ASM_Data$TWT_g.li <- as.numeric(as.character((ASM_Data$TWT_g.li)))
head(ASM_Data$TWT_g.li)
ASM_YIELD_1 <- lm(TWT_g.li~ENTRY_NO + REP + SUB.BLOCK, data=ASM_Data)
anova(ASM_YIELD_1)
summary(ASM_YIELD_1)
emmeans(ASM_YIELD_1, "ENTRY_NO") ###########ADJ. MEANS
I get an output for anova
anova(ASM_YIELD_1)
Analysis of Variance Table
Response: TWT_g.li
Df Sum Sq Mean Sq F value Pr(>F)
ENTRY_NO 138 217949 1579 7.0339 < 2e-16 ***
REP 1 66410 66410 295.7683 < 2e-16 ***
SUB.BLOCK 4 1917 479 2.1348 0.08035 .
Residuals 125 28067 225
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but for emmeans I get something like this:
ENTRY_NO emmean SE df asymp.LCL asymp.UCL
840 nonEst NA NA NA NA
850 nonEst NA NA NA NA
851 nonEst NA NA NA NA
852 nonEst NA NA NA NA
853 nonEst NA NA NA NA
854 nonEst NA NA NA NA
855 nonEst NA NA NA NA
857 nonEst NA NA NA NA
858 nonEst NA NA NA NA
859 nonEst NA NA NA NA
I do have outliers in my data which is indicated by a "." in my data but that's the only thing I can think of which is off.
When I run with(ASM_Data, table(ENTRY_NO, REP, SUB.BLOCK))
this is what I have:
with(ASM_Data, table(ENTRY_NO,REP,SUB.BLOCK))
, , SUB.BLOCK = A
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 0 0
862 0 0
863 1 0
864 0 0
865 1 0
866 1 0
867 0 0
868 0 0
869 1 0
870 1 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 1 0
880 0 0
881 0 0
882 0 0
883 0 0
884 0 0
885 1 0
886 0 0
887 1 0
888 1 0
889 1 0
890 0 0
891 1 0
892 0 0
893 0 0
894 0 0
895 0 0
896 1 0
897 0 0
898 0 0
899 0 0
900 1 0
901 1 0
902 0 0
903 0 0
904 1 0
905 1 0
906 0 0
907 1 0
908 1 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 0
914 0 0
915 0 0
916 1 0
917 0 0
918 0 0
919 1 0
920 0 0
921 0 0
922 0 0
923 1 0
924 0 0
925 0 0
926 0 0
927 1 0
928 0 0
929 0 0
930 0 0
931 1 0
932 0 0
933 0 0
934 0 0
935 0 0
936 1 0
937 0 0
938 1 0
939 1 0
940 0 0
941 1 0
942 0 0
943 1 0
944 0 0
945 0 0
946 0 0
947 0 0
948 1 0
949 0 0
950 1 0
951 0 0
952 0 0
953 0 0
954 0 0
955 1 0
956 1 0
957 1 0
958 1 0
959 0 0
960 0 0
961 0 0
962 0 0
963 0 0
964 0 0
965 1 0
966 0 0
967 1 0
968 0 0
969 0 0
970 1 0
971 0 0
972 0 0
973 0 0
974 1 0
975 0 0
976 0 0
977 0 0
978 1 0
979 0 0
980 0 0
981 0 0
982 0 0
983 1 0
984 1 0
985 0 0
986 1 0
987 3 0
988 0 0
, , SUB.BLOCK = B
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 1 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 1 0
862 0 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 1 0
872 0 0
873 0 0
874 0 0
875 0 0
876 1 0
877 1 0
878 1 0
879 0 0
880 1 0
881 0 0
882 1 0
883 1 0
884 1 0
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 1 0
891 0 0
892 1 0
893 1 0
894 1 0
895 1 0
896 0 0
897 1 0
898 0 0
899 0 0
900 0 0
901 0 0
902 1 0
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 1 0
910 0 0
911 1 0
912 0 0
913 1 0
914 0 0
915 0 0
916 0 0
917 0 0
918 0 0
919 0 0
920 1 0
921 1 0
922 0 0
923 0 0
924 0 0
925 1 0
926 1 0
927 0 0
928 0 0
929 0 0
930 1 0
931 0 0
932 1 0
933 0 0
934 1 0
935 0 0
936 0 0
937 1 0
938 0 0
939 0 0
940 1 0
941 0 0
942 0 0
943 0 0
944 0 0
945 1 0
946 0 0
947 1 0
948 0 0
949 0 0
950 0 0
951 1 0
952 0 0
953 0 0
954 1 0
955 0 0
956 0 0
957 0 0
958 0 0
959 1 0
960 0 0
961 0 0
962 1 0
963 0 0
964 0 0
965 0 0
966 0 0
967 0 0
968 0 0
969 1 0
970 0 0
971 0 0
972 0 0
973 0 0
974 0 0
975 0 0
976 1 0
977 1 0
978 0 0
979 0 0
980 0 0
981 1 0
982 1 0
983 0 0
984 0 0
985 3 0
986 0 0
987 1 0
988 1 0
, , SUB.BLOCK = C
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 1 0
858 0 0
859 1 0
860 0 0
861 0 0
862 1 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 1 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 0 0
880 0 0
881 1 0
882 0 0
883 0 0
884 0 0
885 0 0
886 1 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 0
893 0 0
894 0 0
895 0 0
896 0 0
897 0 0
898 1 0
899 1 0
900 0 0
901 0 0
902 0 0
903 1 0
904 0 0
905 0 0
906 1 0
907 0 0
908 0 0
909 0 0
910 1 0
911 0 0
912 1 0
913 0 0
914 1 0
915 1 0
916 0 0
917 1 0
918 1 0
919 0 0
920 0 0
921 0 0
922 1 0
923 0 0
924 1 0
925 0 0
926 0 0
927 0 0
928 1 0
929 1 0
930 0 0
931 0 0
932 0 0
933 1 0
934 0 0
935 1 0
936 0 0
937 0 0
938 0 0
939 0 0
940 0 0
941 0 0
942 1 0
943 0 0
944 1 0
945 0 0
946 1 0
947 0 0
948 0 0
949 1 0
950 0 0
951 0 0
952 1 0
953 1 0
954 0 0
955 0 0
956 0 0
957 0 0
958 0 0
959 0 0
960 1 0
961 1 0
962 0 0
963 1 0
964 1 0
965 0 0
966 1 0
967 0 0
968 1 0
969 0 0
970 0 0
971 1 0
972 1 0
973 1 0
974 0 0
975 1 0
976 0 0
977 0 0
978 1 0
979 2 0
980 0 0
981 0 0
982 0 0
983 0 0
984 0 0
985 1 0
986 3 0
987 0 0
988 0 0
, , SUB.BLOCK = D
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 1
853 0 0
854 0 0
855 0 0
857 0 0
858 0 1
859 0 0
860 0 1
861 0 0
862 0 0
863 0 0
864 0 1
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 1
876 0 0
877 0 0
878 0 1
879 0 0
880 0 1
881 0 1
882 0 1
883 0 1
884 0 1
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 1
893 0 0
894 0 0
895 0 0
896 0 0
897 0 1
898 0 0
899 0 1
900 0 0
901 0 0
902 0 1
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 1
914 0 1
915 0 1
916 0 0
917 0 1
918 0 1
919 0 0
920 0 0
921 0 1
922 0 1
923 0 0
924 0 0
925 0 0
926 0 0
927 0 0
928 0 0
929 0 1
930 0 1
931 0 0
932 0 0
Can someone please give me an idea of what is going wrong??
Thanks !
I've been able to create a situation like this. Consider this dataset:
> junk
trt rep blk y
1 A 1 1 -1.17415687
2 B 1 1 -0.20084854
3 C 1 1 0.64797806
4 A 1 2 -1.69371434
5 B 1 2 -0.35835442
6 C 1 2 1.35718782
7 A 1 3 0.20510482
8 B 1 3 1.00857651
9 C 1 3 -0.20553167
10 A 2 4 0.31261523
11 B 2 4 0.47989115
12 C 2 4 1.27574085
13 A 2 5 -0.79209520
14 B 2 5 1.07151315
15 C 2 5 -0.04222769
16 A 2 6 -0.80571767
17 B 2 6 0.80442988
18 C 2 6 1.73526561
This has 6 complete blocks, separately labeled with 3 blocks per rep. Not obvious, but true, is that rep is a numeric variable having values 1 and 2, while blk is a factor having 6 levels 1 -- 6:
> sapply(junk, class)
trt rep blk y
"factor" "numeric" "factor" "numeric"
With this complete dataset, I have no problem obtaining EMMs for modeling situations parallel to what was used in the original posting. However, if I use only a subset of these data, it is different. Consider:
> samp
[1] 1 2 3 5 8 11 13 15 16
> junk.lm = lm(y ~ trt + rep + blk, data = junk, subset = samp)
> emmeans(junk.lm, "trt")
trt emmean SE df asymp.LCL asymp.UCL
A nonEst NA NA NA NA
B nonEst NA NA NA NA
C nonEst NA NA NA NA
Results are averaged over the levels of: blk
Confidence level used: 0.95
Again, recall that rep is numeric in this model. If instead, I make rep a factor:
> junk.lmf = lm(y ~ trt + factor(rep) + blk, data = junk, subset = samp)
> emmeans(junk.lmf, "trt")
NOTE: A nesting structure was detected in the fitted model:
blk %in% rep
If this is incorrect, re-run or update with `nesting` specified
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk, rep
Confidence level used: 0.95
We get non-NA estimates, in part because it is able to detect the fact that blk is nested in rep, and thus performs the EMM computations separately in each rep. Note in the annotations in this last output that averaging is done over the 2 reps and 6 blocks; whereas in fiber.lm averaging is done only over blocks, while rep, a numeric variable, is set at its average. Compare the reference grids for the two models:
> ref_grid(junk.lm)
'emmGrid' object with variables:
trt = A, B, C
rep = 1.4444
blk = 1, 2, 3, 4, 5, 6
> ref_grid(junk.lmf)
'emmGrid' object with variables:
trt = A, B, C
rep = 1, 2
blk = 1, 2, 3, 4, 5, 6
Nesting structure: blk %in% rep
An additional option is to avoid the nesting issue by simply omitting rep from the model:
> junk.lm.norep = lm(y ~ trt + blk, data = junk, subset = samp)
> emmeans(junk.lm.norep, "trt")
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk
Confidence level used: 0.95
Note that exactly the same results are produced. The reason is the levels of blk already predict the levels of rep, so there is no need for it to be in the model.
In summary:
The situation is due in part to the fact that there are missing data
and in part because rep was in the model as a numeric predictor rather than a factor.
In your situation, I suggest re-fitting the model with factor(REP) instead of REPas a numeric predictor. This may be enough to produce estimates.
If, indeed, as in my example, the SUB.BLOCK levels predict the REP levels, just leave REP out of the model altogether.
EMMs are obtained by averaging predictions over 2 reps and 5 blocks (or maybe more?). Look at
coef(ASM_YIELD_1)
If any of the rep or block effects are NA, then you can’t estimate all of the rep or block effects, and that makes the average of them non-estimable.
You can see exactly which factor combinations are non-estimable by doing:
summary(ref_grid(ASM_YIELD_1))
addendum
Here is a reformatting of the tables I requested in comments:
ENTRY ---------- BLOCK -------------
NO A B C D
840 0 0 0 0 0 0 0 0
850 0 0 0 0 0 0 0 0
851 0 0 0 0 0 0 0 0
852 0 0 0 0 0 0 0 1
853 0 0 1 0 0 0 0 0
854 0 0 0 0 0 0 0 0
855 0 0 0 0 0 0 0 0
857 0 0 0 0 1 0 0 0
858 0 0 0 0 0 0 0 1
859 0 0 0 0 1 0 0 0
... etc ...
This is extremely sparse data. I think there are two more blocks not shown. But I see very few instances where a given ENTRY_NO is observed in more than one rep or block. So I think it is seriously over-fitting to try to account for rep or block effects in this model.
MAYBE omitting REP from the model will make it work. MAYBE re-fitting the model with factor(REP) in place of REP will enable emmeans to detect a nesting structure. Otherwise, there's some really subtle dependence in the blocking structure and treatments, and I don't know what to suggest.

Remove duplicate header lines in dataframe

My raw data contains numeric values with a recall of the headers every 20 lines.
I wish to remove the repeated header lines with R. I know it's quite easy with sed command but I wish the R script to handle all steps of tidying data.
> raw <- read.delim("./vmstat_archiveadm_s.txt")
> head(raw)
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100097600 97779056 285 426 53 0 0 0 367 86 6 0 0 1206 7711 2630 1 0 99
0 0 0 96908192 94414488 7 31 0 0 0 0 0 120 0 0 0 2782 5775 5042 2 0 97
0 0 0 96889840 94397152 0 1 0 0 0 0 0 122 0 0 0 2737 5591 4958 2 0 97
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100065744 97745448 282 422 52 0 0 0 363 89 6 0 0 1233 7690 2665 1 0 99
0 0 0 96725312 94222040 7 31 0 0 0 0 0 604 69 0 0 5269 5703 7910 2 1 97
0 0 0 96668624 94170784 0 0 0 0 0 0 0 155 53 0 0 3047 5505 5317 2 0 97
0 0 0 96595104 94086816 0 0 0 0 0 0 0 174 0 0 0 2879 5567 5068 2 0 97
1 0 0 96521376 94025504 0 0 0 0 0 0 0 121 0 0 0 2812 5471 5105 2 0 97
0 0 0 96503256 93994896 0 0 0 0 0 0 0 121 0 0 0 2731 5621 4981 2 0 97
(...)
Try this :
where df is the dataframe
x = seq(6,100,21)
df = df[-x,]
Sequence will generate a string of numbers from 6 till 100 at an interval of 21.
Therefore, in this case :
6 27 48 69 90
Remove them from the dataframe by
df[-x,]
EDIT:
To do this for the entire dataframe, replace 100 with number of rows. i.e
seq(6,nrow(df),21)
Instead of processing the output in R I will clean it at the generation level:
$ vmstat 1 | egrep -v '^ kthr|^ r'
0 0 0 154831904 153906536 215 471 0 0 0 0 526 33 32 0 0 1834 14171 5253 0 0 99
1 0 0 154805632 153354296 9 32 0 0 0 0 0 0 0 0 0 1463 610 739 0 0 100
1 0 0 154805632 153354696 0 4 0 0 0 0 0 0 0 0 0 1408 425 634 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1341 381 658 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1299 353 610 0 0 100
1 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1319 375 638 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1308 367 614 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1336 395 650 0 0 100
1 0 0 154805632 153354640 0 0 0 0 0 0 0 44 44 0 0 1594 378 878 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 66 65 0 0 1763 382 1015 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1312 411 645 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1342 390 647 0 0 100

Sum of longest string of non-zero values

I have a dataframe containing the daily rainfall values at 76 stations from 1964-2013. Each row is a different month for a particular station. Here is a snippet of the dataframe-
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0
...
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USW00093129 2013 10 31 0 0 0 0 0 0 0 0 43 15 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 3 8 0
USW00093129 2013 11 30 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 79 18 20 0 0 0 0 0 0 0 Inf
USW00093129 2013 12 31 0 0 175 33 0 0 3 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 0 0 0 0 0 0 0 0 0
I am trying to find the length of the longest stretch of non-zero rainfall values for each row and the total rainfall in that stretch. The easiest way to find the length of the longest stretch would be to convert the dataframe to 0s and 1s, use rle and apply max(y$lengths[y$values!=0]) along each row. But how do I find the sum of the values?
Thanks for helping out, in advance!
Not exactly a one-liner, but this works :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,check.names=FALSE,text=
"Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0")
res <- lapply(1:nrow(df), function(r){
monthDays <- df[r,'Days']
rain <- as.numeric(df[r,(1:monthDays) + 4])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
})
columnsToAdd <- do.call(rbind,res)
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
Result :
# We print the result without months values for better readability
> df2[,-(5:35)]
Station Year Month Days StretchLen StretchRain
1 USC00020750 1964 1 31 3 110
2 USC00020750 1964 2 29 1 48
3 USC00020750 1964 3 31 4 328
4 USC00020750 1964 4 30 4 127
5 USC00020750 1964 5 31 2 59
6 USC00020750 1964 6 30 1 38
7 USC00020750 1964 7 31 3 210
8 USC00020750 1964 8 31 3 175
9 USC00020750 1964 9 30 2 66
10 USC00020750 1964 10 31 0 0
11 USC00020750 1964 11 30 2 130
12 USC00020750 1964 12 31 2 127
BTW, if you want to stick with apply, it would be like this :
columnsToAdd <-
t(apply(df[,-(1:3)],MARGIN=1,function(r){
monthDays <- r[1]
rain <- as.numeric(r[-1])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
}))
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
I don't like using apply on data.frame's since it has been created for matrices and so it coerces the columns to the same type before calling the function (hence if you work on columns of different types you need to be careful).
Here's another solution with dplyr/tidyr
data %>%
gather(day, rain, -Station, -Year, -Month, -Days) %>%
arrange(Station, Year, Month, day) %>%
group_by(Station, Year, Month) %>%
mutate(previous_rain = lag(rain)) %>%
filter(!(rain %in% c(0, Inf))) %>%
mutate(storm = cumsum(previous_rain %in% c(0, NA))) %>%
group_by(Station, Year, Month, storm) %>%
summarize(total_rain = sum(rain),
number_of_days = n(),
start_day = first(day),
end_day = last(day)) %>%
arrange(desc(number_of_days)) %>%
slice(1)
Here's another take at it, where I've used the rle() function to find run lengths. It's protracted but primarily to make it clear what is happening - you could shorten it easily.
raindf <-
tmp <- read.table(textConnection(" Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0"), header = TRUE)
rainfall <- unlist(as.data.frame(t(raindf[1:3, -c(1:4)])), use.names = FALSE)
rainfall <- rainfall[!is.infinite(rainfall)]
rainfall[rainfall > 0] <- 1
rainyruns <- rle(rainfall)
rainyrunsDf <- data.frame(lengths = rainyruns$lengths, values = rainyruns$values)
rainyrunsDf <- subset(rainyrunsDf, values != 0)
rainyrunsDf <- rainyrunsDf[order(rainyrunsDf$lengths, decreasing = TRUE), ]
rainyrunsDf[1,1]
## [1] 4

Making a CSV file into an RData file

Please bear with an R newbie here. I'm trying to follow along with a tutorial published on the wonderful flowingdata.com site by using my own data to replace the .Rdata file included in the tutorial. The Rdata file, "unisexCnts.RData", contains unisex names and the number of times used for different years:
head(unisexCnts)
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 312 273 274 263 0 273 0 255 235 195 222 0 195 0 193 225 204 196 177 156
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 190 0 0 325 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 243 219 214
Alva 177 132 159 178 145 138 131 119 119 119 127 97 107 97 83 76 83 90 84 81 58 68
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 1264 0 0 0 0 0 0 0 1579 2145 2488 0 0
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 595 664
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 561 565 556 643 747 722 0 742 0 0
Alva 54 57 53 54 59 40 62 0 48 0 28 0 34 0 0 0 0 0 0 0 0 26
Amari 0 0 0 0 0 0 11 0 0 0 0 0 16 0 22 0 32 0 0 0 0 0
Angel 2561 2690 2779 0 0 3004 3108 3113 3187 2924 3100 3341 3229 3101 3532 3889 4066 4520 0 0 0 0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Addison 778 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 0 19 0 14 0 0 0 0 0 24 0 0 0 0 0
Amari 0 0 0 0 0 0 1181 1397 1333 1299 1265 1550 1780 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run it through the str() function I get the follwoing:
str(unisexCnts)
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
..$ : chr [1:83] "1930" "1931" "1932" "1933" ...
My data is in a csv file ,called "boysnames.csv":
,2013,2012,2011,2010,2009,2008
Jack,764,831,840,935,1068,1151
James,746,773,796,746,711,737
Daniel,678,683,711,792,842,828
Conor,610,639,709,726,776,857
I am trying to overwrite the unisexCnts.RData with the contents of my boysnames.csv. So to restructure and get my csv ready to be saved, I did:
Step1.
unisexCnts<-data <- read.csv("boysnames.csv", stringsAsFactors=FALSE, header=TRUE, check.names = FALSE)
Step2.
unisexCnts<-as.matrix(unisexCnts)
Step3.
save(file="unisexCnts.RData") ##save as Rdata file, overwriting the original unisexCnts.RData in the dir
However I get the following after steps 1 & 2 which doesn't match the structure of the original, any ideas/pointers?
> str(unisexCnts)
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "" "2013" "2012" "2011" ...
When you load a .csv file you can specify the column that should become the row names of the uploaded data using the command "row.names"
I recreated your data quickly and uploaded it using the following code:
read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
This saves you having to do this work after uploading the data. This gives you the data structure you are looking for as well:
unisexCnts = read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
unisexCnts = as.matrix(unisexCnts)
str(unisexCnts)
int [1:4, 1:6] 764 746 678 610 831 773 683 639 840 796 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:4] "Jack" "James" "Dan" "Conor"
..$ : chr [1:6] "X2013" "X2012" "X2011" "X2010" ...
However I get the following after steps 1 & 2 which doesn't match the
structure of the original, any ideas/pointers?
In the original unisexCnts the names are specified as row names. That's why the the first attribute is
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
To replicate that in your example. You can set the names as rownames by specifying
rownames(unisexCnts) <- ListorOrVectorofNamesHere
This will make the output match.
The reason this line:
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
doens't match this line
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
is the same. You have the names included in the actual matrix itself. In a matrix you can only have data of the same type . By including character data in the matrix (the names) you are converting the whole matrix itself into character/strings.
in summary
remove the name vector from the matrix and use it as row names and the str() of your two objects will match.

Resources