finger_index finger_state data_source x y eventMask Tap group
0 PalmOn PTP 1155 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1155 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1156 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1161 240 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1181 234 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 NoObject PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1190 231 0 0 a
1 PalmOn PTP 12 177 0 0 a
2 NoObject PTP 0 0 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
NA NA NA NA NA 2048 1 b
so on with this pattern
Desired output:
finger_index finger_state data_source x y eventMask Tap group
0 PalmOn PTP 1155 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1155 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1156 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
0 PalmOn PTP 1161 240 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
NA NA NA NA NA 2048 1 b
I have this group of data that I generated, finger_index from 0 to 4 are in same set.
0 PalmOn PTP 1155 241 0 0 a
1 PalmOn PTP 13 178 0 0 a
2 FingerOn PTP 255 671 0 0 a
3 NoObject PTP 0 0 0 0 a
4 NoObject PTP 0 0 0 0 a
I want to choose 5 sets of the rows from tail. I have to ignore the set that does not have "FingerOn"
.
I have few placeholder that is call group b. the reason that I created the group a, b because I want to use this code:
# tapFilter <- DT[, if (group[1] != "a") .SD else tail(.SD, 25), # by = .(group, rleid(group))][, -2]
However, I cant use this now, because I have to skip the set that does not have FingerOn in the set.
I think the easier version is using data.table. How can I achieve this task?
Thanks!
I am not sure I exactly understand what you want but take a look at this solution. Replace the name of datasets and "time_stamp" variable accordingly.
library(dplyr)
df_filtered <- df %>%
group_by(time_stamp) %>%
filter(any(finger_state == "FingerOn")) %>%
ungroup() %>%
tail(., n=25)
Related
I keep getting this error: the table must the same classes in the same order
when implementing KNN and confusion matrix to get the accuracy
df_train <- df_trimmed_n[1:10,]
df_test <- df_trimmed_n[11:20,]
df_train_labels <- df_trimmed[1:10, 1]
df_test_labels <- df_trimmed[11:20, 1]
library(class)
library(caret)
df_knn<-knn(df_train,df_test,cl=df_train_labels,k=10)
confusionMatrix(table(df_knn,df_test_labels))
Error in confusionMatrix.table(table(df_knn, df_test_labels)) :
the table must the same classes in the same order
> print(df_knn)
[1] 28134 5138 4820 3846 1216 1885 1885 22021 5138 15294
Levels: 106 1216 1885 3846 4820 5138 15294 22021 22445 28134
> print(df_test_labels)
[1] 33262 6459 5067 7395 22720 1217 3739 84 16219 17819
> table(df_knn,df_test_labels)
df_test_labels
df_knn 84 1217 3739 5067 6459 7395 16219 17819 22720 33262
106 0 0 0 0 0 0 0 0 0 0
1216 0 0 0 0 0 0 0 0 1 0
1885 0 1 1 0 0 0 0 0 0 0
3846 0 0 0 0 0 1 0 0 0 0
4820 0 0 0 1 0 0 0 0 0 0
5138 0 0 0 0 1 0 1 0 0 0
15294 0 0 0 0 0 0 0 1 0 0
22021 1 0 0 0 0 0 0 0 0 0
22445 0 0 0 0 0 0 0 0 0 0
28134 0 0 0 0 0 0 0 0 0 1
evn though both knn and test dataset have the same number of rows=10 but i'm not too sure what is wrong with the same classes and order?
I am quite new with R. I was wondering if there could be a simple solution for my situation.
I have this kind of data set with 3 duplicated datas.
A
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
ASEC.r tr_in tr_out
2406 22234 0 0
2406.1 22234 0 0
2406.2 22234 0 0
4037 20547 0 0
4037.1 20547 0 0
4037.2 20547 0 0
And another data set that looks like this. The second dataset consits of columns. They are subsections of the columns in the first data set
B
BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
1 150 0 0 0 0 0 0 0 0 0 151 0
2 150 426 422 205 0 0 0 0 0 0 151 201
3 150 4201 4203 239 0 0 0 0 0 0 151 201
Is there a way I can cbind these two data sets?
I tried duplicating the second data set and used cbind(A,B) but the results were "LARGE MATRIX" form that I cant see.
Is there a way I can compare the first data set and the second data set to check if they match?
Which is why I tried to column bind them but would there be a simpler solution?
========Edited=========
What I would like to create is
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151
ASEC.r tr_in tr_out BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out match
2406 22234 0 0 150 0 0 0 0 0 0 0 0 0 151 0 1
2406.1 22234 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
2406.2 22234 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
4037 20547 0 0 150 0 0 0 0 0 0 0 0 0 151 0 1
4037.1 20547 0 0 150 426 422 205 0 0 0 0 0 0 151 201 0
4037.2 20547 0 0 150 4201 4203 239 0 0 0 0 0 0 151 201 0
So if I compare data set A, B , I want to add a new column to A showing 1 if they match and 0 if they don't
I think there might be holes in the logic here, but I'll state some assumptions:
nrow(A) is always an integer-multiple of nrow(B); this means that A[1,] pairs only with B[1,], A[2,] with B[2,], ..., A[4,] with B[1,], A[5,] with B[2,], etc.
the test of comparison is "equality of in-common columns"
If those are true, then
incommon <- intersect(colnames(A), colnames(B))
incommon
# [1] "BSTN" "ASTN" "ASTN1" "BSTN2" "ASTN2" "BSTN3" "ASTN3" "BSTN4" "ASTN4" "BSTN5"
# [11] "tr_in" "tr_out"
Bplus <- do.call(rbind.data.frame,
c(replicate(nrow(A) / nrow(B), B, simplify = FALSE),
list(stringsAsFactors = FALSE)))
Bplus
# BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
# 1 150 0 0 0 0 0 0 0 0 0 151 0
# 2 150 426 422 205 0 0 0 0 0 0 151 201
# 3 150 4201 4203 239 0 0 0 0 0 0 151 201
# 11 150 0 0 0 0 0 0 0 0 0 151 0
# 21 150 426 422 205 0 0 0 0 0 0 151 201
# 31 150 4201 4203 239 0 0 0 0 0 0 151 201
A$match <- +(rowSums(A[,incommon] == Bplus[,incommon]) == length(incommon))
A
# CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2
# 2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0
# 4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# 4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# 4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0
# BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out match
# 2406 0 0 0 0 0 0 0 0 151 22234 0 0 1
# 2406.1 0 0 0 0 0 0 0 0 151 22234 0 0 0
# 2406.2 0 0 0 0 0 0 0 0 151 22234 0 0 0
# 4037 0 0 0 0 0 0 0 0 151 20547 0 0 1
# 4037.1 0 0 0 0 0 0 0 0 151 20547 0 0 0
# 4037.2 0 0 0 0 0 0 0 0 151 20547 0 0 0
The use of +(...) is a trick to convert logical to integer 0 and 1. It is just as easy to keep $match as a logical field by removing that portion of the assignment. (I only used it because you had that in your intended output. I prefer logical for my own use, since 1 or 1L implies ordinality and perhaps that there can be more than two values of 0 and 1. In a declarative sense, logical clearly states that you expect only FALSE and TRUE, and possibly NA when it is indeterminant.)
Also, the rowSums(...) == length(incommon) checks that all of the in-common fields are identical. Another way to calculate it is
apply(A[,incommon] == Bplus[,incommon], 1, all)
which might be more intuitive and/or declarative. The choice of which to use is a lot based on preference and a little on performance ... the rowSums method is slightly faster than the apply method.
Data
A <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0")
B <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out
1 150 0 0 0 0 0 0 0 0 0 151 0
2 150 426 422 205 0 0 0 0 0 0 151 201
3 150 4201 4203 239 0 0 0 0 0 0 151 201")
Is there a way I can compare the first data set and the second data set to check if they match?
Code:
library('data.table')
col_nm <- names(df2)[names(df2) %in% names(df1)]
setDT(df1)[df2, on = col_nm, nomatch = 0]
Output:
# CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out trips
# 1: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 2: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 3: 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0 1143
# 4: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
# 5: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
# 6: 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0 1143
Data:
df1 <- read.table(text = 'CARDID BSTN ASTN USERTYPE INVDIST INVTIME BSEC TRNID BSTN.r ASTN1 BSTN2 TRNID2 ASTN2 BSTN3 TRNID3 ASTN3 BSTN4 TRNID4 ASTN4 BSTN5 TRNID5 ASTN.r ASEC.r tr_in tr_out
2406 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.1 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
2406.2 5786 150 151 6 1100 340 21996 1672 150 0 0 0 0 0 0 0 0 0 0 0 0 151 22234 0 0
4037 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.1 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0
4037.2 9737 150 151 6 1100 320 20368 2191 150 0 0 0 0 0 0 0 0 0 0 0 0 151 20547 0 0', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'BSTN tr_in ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN tr_out trips
1 150 0 0 0 0 0 0 0 0 0 151 0 1143
2 150 426 422 205 0 0 0 0 0 0 151 201 2
3 150 4201 4203 239 0 0 0 0 0 0 151 201 2', header = TRUE, stringsAsFactors = FALSE)
I am running an linear model regression analysis script and I am running emmeans (ls means) on my model but I am getting a whole of NA's not sure why... Here is what I have run:
setwd("C:/Users/wkmus/Desktop/R-Stuff")
### yeild-twt
ASM_Data<-read.csv("ASM_FIELD_18_SUMM_wm.csv",header=TRUE, na.strings=".")
head(ASM_Data)
str(ASM_Data)
####"NA" values in table are labeled as "." colored orange
ASM_Data$REP <- as.factor(ASM_Data$REP)
head(ASM_Data$REP)
ASM_Data$ENTRY_NO <-as.factor(ASM_Data$ENTRY_NO)
head(ASM_Data$ENTRY_NO)
ASM_Data$RANGE<-as.factor(ASM_Data$RANGE)
head(ASM_Data$RANGE)
ASM_Data$PLOT_ID<-as.factor(ASM_Data$PLOT_ID)
head(ASM_Data$PLOT_ID)
ASM_Data$PLOT<-as.factor(ASM_Data$PLOT)
head(ASM_Data$PLOT)
ASM_Data$ROW<-as.factor(ASM_Data$ROW)
head(ASM_Data$ROW)
ASM_Data$REP <- as.numeric(as.character(ASM_Data$REP))
head(ASM_Data$REP)
ASM_Data$TWT_g.li <- as.numeric(as.character(ASM_Data$TWT_g.li))
ASM_Data$Yield_kg.ha <- as.numeric(as.character(ASM_Data$Yield_kg.ha))
ASM_Data$PhysMat_Julian <- as.numeric(as.character(ASM_Data$PhysMat_Julian))
ASM_Data$flowering <- as.numeric(as.character(ASM_Data$flowering))
ASM_Data$height <- as.numeric(as.character(ASM_Data$height))
ASM_Data$CLEAN.WT <- as.numeric(as.character(ASM_Data$CLEAN.WT))
ASM_Data$GRAV.TEST.WEIGHT <-as.numeric(as.character(ASM_Data$GRAV.TEST.WEIGHT))
str(ASM_Data)
library(lme4)
#library(lsmeans)
library(emmeans)
Here is the data frame:
> str(ASM_Data)
'data.frame': 270 obs. of 20 variables:
$ TRIAL_ID : Factor w/ 1 level "18ASM_OvOv": 1 1 1 1 1 1 1 1 1 1 ...
$ PLOT_ID : Factor w/ 270 levels "18ASM_OvOv_002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLOT : Factor w/ 270 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ROW : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ RANGE : Factor w/ 15 levels "1","2","3","4",..: 2 3 4 5 6 7 8 9 10 12 ...
$ REP : num 1 1 1 1 1 1 1 1 1 1 ...
$ MP : int 1 1 1 1 1 1 1 1 1 1 ...
$ SUB.PLOT : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 2 2 2 2 2 3 ...
$ ENTRY_NO : Factor w/ 139 levels "840","850","851",..: 116 82 87 134 77 120 34 62 48 136 ...
$ height : num 74 70 73 80 70 73 75 68 65 68 ...
$ flowering : num 133 133 134 134 133 131 133 137 134 132 ...
$ CLEAN.WT : num 1072 929 952 1149 1014 ...
$ GRAV.TEST.WEIGHT : num 349 309 332 340 325 ...
$ TWT_g.li : num 699 618 663 681 650 684 673 641 585 646 ...
$ Yield_kg.ha : num 2073 1797 1841 2222 1961 ...
$ Chaff.Color : Factor w/ 3 levels "Bronze","Mixed",..: 1 3 3 1 1 1 1 3 1 3 ...
$ CHAFF_COLOR_SCALE: int 2 1 1 2 2 2 2 1 2 1 ...
$ PhysMat : Factor w/ 3 levels "6/12/2018","6/13/2018",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PhysMat_Julian : num 163 163 163 163 163 163 163 163 163 163 ...
$ PEDIGREE : Factor w/ 1 level "OVERLEY/OVERLAND": 1 1 1 1 1 1 1 1 1 1 ...
This is the head of ASM Data:
head(ASM_Data)
`TRIAL_ID PLOT_ID PLOT ROW RANGE REP MP SUB.PLOT ENTRY_NO height flowering CLEAN.WT GRAV.TEST.WEIGHT TWT_g.li`
1 18ASM_OvOv 18ASM_OvOv_002 2 1 2 1 1 A 965 74 133 1071.5 349.37 699
2 18ASM_OvOv 18ASM_OvOv_003 3 1 3 1 1 A 931 70 133 928.8 309.13 618
3 18ASM_OvOv 18ASM_OvOv_004 4 1 4 1 1 A 936 73 134 951.8 331.70 663
4 18ASM_OvOv 18ASM_OvOv_005 5 1 5 1 1 A 983 80 134 1148.6 340.47 681
5 18ASM_OvOv 18ASM_OvOv_006 6 1 6 1 1 B 926 70 133 1014.0 324.95 650
6 18ASM_OvOv 18ASM_OvOv_007 7 1 7 1 1 B 969 73 131 1076.6 342.09 684
Yield_kg.ha Chaff.Color CHAFF_COLOR_SCALE PhysMat PhysMat_Julian PEDIGREE
1 2073 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
2 1797 White 1 6/12/2018 163 OVERLEY/OVERLAND
3 1841 White 1 6/12/2018 163 OVERLEY/OVERLAND
4 2222 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
5 1961 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
6 2082 Bronze 2 6/12/2018 163 OVERLEY/OVERLAND
I am looking at a linear model dealing with test weight.
This is what I ran:
ASM_Data$TWT_g.li <- as.numeric(as.character((ASM_Data$TWT_g.li)))
head(ASM_Data$TWT_g.li)
ASM_YIELD_1 <- lm(TWT_g.li~ENTRY_NO + REP + SUB.BLOCK, data=ASM_Data)
anova(ASM_YIELD_1)
summary(ASM_YIELD_1)
emmeans(ASM_YIELD_1, "ENTRY_NO") ###########ADJ. MEANS
I get an output for anova
anova(ASM_YIELD_1)
Analysis of Variance Table
Response: TWT_g.li
Df Sum Sq Mean Sq F value Pr(>F)
ENTRY_NO 138 217949 1579 7.0339 < 2e-16 ***
REP 1 66410 66410 295.7683 < 2e-16 ***
SUB.BLOCK 4 1917 479 2.1348 0.08035 .
Residuals 125 28067 225
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but for emmeans I get something like this:
ENTRY_NO emmean SE df asymp.LCL asymp.UCL
840 nonEst NA NA NA NA
850 nonEst NA NA NA NA
851 nonEst NA NA NA NA
852 nonEst NA NA NA NA
853 nonEst NA NA NA NA
854 nonEst NA NA NA NA
855 nonEst NA NA NA NA
857 nonEst NA NA NA NA
858 nonEst NA NA NA NA
859 nonEst NA NA NA NA
I do have outliers in my data which is indicated by a "." in my data but that's the only thing I can think of which is off.
When I run with(ASM_Data, table(ENTRY_NO, REP, SUB.BLOCK))
this is what I have:
with(ASM_Data, table(ENTRY_NO,REP,SUB.BLOCK))
, , SUB.BLOCK = A
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 0 0
862 0 0
863 1 0
864 0 0
865 1 0
866 1 0
867 0 0
868 0 0
869 1 0
870 1 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 1 0
880 0 0
881 0 0
882 0 0
883 0 0
884 0 0
885 1 0
886 0 0
887 1 0
888 1 0
889 1 0
890 0 0
891 1 0
892 0 0
893 0 0
894 0 0
895 0 0
896 1 0
897 0 0
898 0 0
899 0 0
900 1 0
901 1 0
902 0 0
903 0 0
904 1 0
905 1 0
906 0 0
907 1 0
908 1 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 0
914 0 0
915 0 0
916 1 0
917 0 0
918 0 0
919 1 0
920 0 0
921 0 0
922 0 0
923 1 0
924 0 0
925 0 0
926 0 0
927 1 0
928 0 0
929 0 0
930 0 0
931 1 0
932 0 0
933 0 0
934 0 0
935 0 0
936 1 0
937 0 0
938 1 0
939 1 0
940 0 0
941 1 0
942 0 0
943 1 0
944 0 0
945 0 0
946 0 0
947 0 0
948 1 0
949 0 0
950 1 0
951 0 0
952 0 0
953 0 0
954 0 0
955 1 0
956 1 0
957 1 0
958 1 0
959 0 0
960 0 0
961 0 0
962 0 0
963 0 0
964 0 0
965 1 0
966 0 0
967 1 0
968 0 0
969 0 0
970 1 0
971 0 0
972 0 0
973 0 0
974 1 0
975 0 0
976 0 0
977 0 0
978 1 0
979 0 0
980 0 0
981 0 0
982 0 0
983 1 0
984 1 0
985 0 0
986 1 0
987 3 0
988 0 0
, , SUB.BLOCK = B
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 1 0
854 0 0
855 0 0
857 0 0
858 0 0
859 0 0
860 0 0
861 1 0
862 0 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 1 0
872 0 0
873 0 0
874 0 0
875 0 0
876 1 0
877 1 0
878 1 0
879 0 0
880 1 0
881 0 0
882 1 0
883 1 0
884 1 0
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 1 0
891 0 0
892 1 0
893 1 0
894 1 0
895 1 0
896 0 0
897 1 0
898 0 0
899 0 0
900 0 0
901 0 0
902 1 0
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 1 0
910 0 0
911 1 0
912 0 0
913 1 0
914 0 0
915 0 0
916 0 0
917 0 0
918 0 0
919 0 0
920 1 0
921 1 0
922 0 0
923 0 0
924 0 0
925 1 0
926 1 0
927 0 0
928 0 0
929 0 0
930 1 0
931 0 0
932 1 0
933 0 0
934 1 0
935 0 0
936 0 0
937 1 0
938 0 0
939 0 0
940 1 0
941 0 0
942 0 0
943 0 0
944 0 0
945 1 0
946 0 0
947 1 0
948 0 0
949 0 0
950 0 0
951 1 0
952 0 0
953 0 0
954 1 0
955 0 0
956 0 0
957 0 0
958 0 0
959 1 0
960 0 0
961 0 0
962 1 0
963 0 0
964 0 0
965 0 0
966 0 0
967 0 0
968 0 0
969 1 0
970 0 0
971 0 0
972 0 0
973 0 0
974 0 0
975 0 0
976 1 0
977 1 0
978 0 0
979 0 0
980 0 0
981 1 0
982 1 0
983 0 0
984 0 0
985 3 0
986 0 0
987 1 0
988 1 0
, , SUB.BLOCK = C
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 0
853 0 0
854 0 0
855 0 0
857 1 0
858 0 0
859 1 0
860 0 0
861 0 0
862 1 0
863 0 0
864 0 0
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 1 0
873 0 0
874 0 0
875 0 0
876 0 0
877 0 0
878 0 0
879 0 0
880 0 0
881 1 0
882 0 0
883 0 0
884 0 0
885 0 0
886 1 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 0
893 0 0
894 0 0
895 0 0
896 0 0
897 0 0
898 1 0
899 1 0
900 0 0
901 0 0
902 0 0
903 1 0
904 0 0
905 0 0
906 1 0
907 0 0
908 0 0
909 0 0
910 1 0
911 0 0
912 1 0
913 0 0
914 1 0
915 1 0
916 0 0
917 1 0
918 1 0
919 0 0
920 0 0
921 0 0
922 1 0
923 0 0
924 1 0
925 0 0
926 0 0
927 0 0
928 1 0
929 1 0
930 0 0
931 0 0
932 0 0
933 1 0
934 0 0
935 1 0
936 0 0
937 0 0
938 0 0
939 0 0
940 0 0
941 0 0
942 1 0
943 0 0
944 1 0
945 0 0
946 1 0
947 0 0
948 0 0
949 1 0
950 0 0
951 0 0
952 1 0
953 1 0
954 0 0
955 0 0
956 0 0
957 0 0
958 0 0
959 0 0
960 1 0
961 1 0
962 0 0
963 1 0
964 1 0
965 0 0
966 1 0
967 0 0
968 1 0
969 0 0
970 0 0
971 1 0
972 1 0
973 1 0
974 0 0
975 1 0
976 0 0
977 0 0
978 1 0
979 2 0
980 0 0
981 0 0
982 0 0
983 0 0
984 0 0
985 1 0
986 3 0
987 0 0
988 0 0
, , SUB.BLOCK = D
REP
ENTRY_NO 1 2
840 0 0
850 0 0
851 0 0
852 0 1
853 0 0
854 0 0
855 0 0
857 0 0
858 0 1
859 0 0
860 0 1
861 0 0
862 0 0
863 0 0
864 0 1
865 0 0
866 0 0
867 0 0
868 0 0
869 0 0
870 0 0
871 0 0
872 0 0
873 0 0
874 0 0
875 0 1
876 0 0
877 0 0
878 0 1
879 0 0
880 0 1
881 0 1
882 0 1
883 0 1
884 0 1
885 0 0
886 0 0
887 0 0
888 0 0
889 0 0
890 0 0
891 0 0
892 0 1
893 0 0
894 0 0
895 0 0
896 0 0
897 0 1
898 0 0
899 0 1
900 0 0
901 0 0
902 0 1
903 0 0
904 0 0
905 0 0
906 0 0
907 0 0
908 0 0
909 0 0
910 0 0
911 0 0
912 0 0
913 0 1
914 0 1
915 0 1
916 0 0
917 0 1
918 0 1
919 0 0
920 0 0
921 0 1
922 0 1
923 0 0
924 0 0
925 0 0
926 0 0
927 0 0
928 0 0
929 0 1
930 0 1
931 0 0
932 0 0
Can someone please give me an idea of what is going wrong??
Thanks !
I've been able to create a situation like this. Consider this dataset:
> junk
trt rep blk y
1 A 1 1 -1.17415687
2 B 1 1 -0.20084854
3 C 1 1 0.64797806
4 A 1 2 -1.69371434
5 B 1 2 -0.35835442
6 C 1 2 1.35718782
7 A 1 3 0.20510482
8 B 1 3 1.00857651
9 C 1 3 -0.20553167
10 A 2 4 0.31261523
11 B 2 4 0.47989115
12 C 2 4 1.27574085
13 A 2 5 -0.79209520
14 B 2 5 1.07151315
15 C 2 5 -0.04222769
16 A 2 6 -0.80571767
17 B 2 6 0.80442988
18 C 2 6 1.73526561
This has 6 complete blocks, separately labeled with 3 blocks per rep. Not obvious, but true, is that rep is a numeric variable having values 1 and 2, while blk is a factor having 6 levels 1 -- 6:
> sapply(junk, class)
trt rep blk y
"factor" "numeric" "factor" "numeric"
With this complete dataset, I have no problem obtaining EMMs for modeling situations parallel to what was used in the original posting. However, if I use only a subset of these data, it is different. Consider:
> samp
[1] 1 2 3 5 8 11 13 15 16
> junk.lm = lm(y ~ trt + rep + blk, data = junk, subset = samp)
> emmeans(junk.lm, "trt")
trt emmean SE df asymp.LCL asymp.UCL
A nonEst NA NA NA NA
B nonEst NA NA NA NA
C nonEst NA NA NA NA
Results are averaged over the levels of: blk
Confidence level used: 0.95
Again, recall that rep is numeric in this model. If instead, I make rep a factor:
> junk.lmf = lm(y ~ trt + factor(rep) + blk, data = junk, subset = samp)
> emmeans(junk.lmf, "trt")
NOTE: A nesting structure was detected in the fitted model:
blk %in% rep
If this is incorrect, re-run or update with `nesting` specified
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk, rep
Confidence level used: 0.95
We get non-NA estimates, in part because it is able to detect the fact that blk is nested in rep, and thus performs the EMM computations separately in each rep. Note in the annotations in this last output that averaging is done over the 2 reps and 6 blocks; whereas in fiber.lm averaging is done only over blocks, while rep, a numeric variable, is set at its average. Compare the reference grids for the two models:
> ref_grid(junk.lm)
'emmGrid' object with variables:
trt = A, B, C
rep = 1.4444
blk = 1, 2, 3, 4, 5, 6
> ref_grid(junk.lmf)
'emmGrid' object with variables:
trt = A, B, C
rep = 1, 2
blk = 1, 2, 3, 4, 5, 6
Nesting structure: blk %in% rep
An additional option is to avoid the nesting issue by simply omitting rep from the model:
> junk.lm.norep = lm(y ~ trt + blk, data = junk, subset = samp)
> emmeans(junk.lm.norep, "trt")
trt emmean SE df lower.CL upper.CL
A -0.6262635 0.4707099 1 -6.607200 5.354673
B 0.0789780 0.3546191 1 -4.426885 4.584841
C 0.6597377 0.5191092 1 -5.936170 7.255646
Results are averaged over the levels of: blk
Confidence level used: 0.95
Note that exactly the same results are produced. The reason is the levels of blk already predict the levels of rep, so there is no need for it to be in the model.
In summary:
The situation is due in part to the fact that there are missing data
and in part because rep was in the model as a numeric predictor rather than a factor.
In your situation, I suggest re-fitting the model with factor(REP) instead of REPas a numeric predictor. This may be enough to produce estimates.
If, indeed, as in my example, the SUB.BLOCK levels predict the REP levels, just leave REP out of the model altogether.
EMMs are obtained by averaging predictions over 2 reps and 5 blocks (or maybe more?). Look at
coef(ASM_YIELD_1)
If any of the rep or block effects are NA, then you can’t estimate all of the rep or block effects, and that makes the average of them non-estimable.
You can see exactly which factor combinations are non-estimable by doing:
summary(ref_grid(ASM_YIELD_1))
addendum
Here is a reformatting of the tables I requested in comments:
ENTRY ---------- BLOCK -------------
NO A B C D
840 0 0 0 0 0 0 0 0
850 0 0 0 0 0 0 0 0
851 0 0 0 0 0 0 0 0
852 0 0 0 0 0 0 0 1
853 0 0 1 0 0 0 0 0
854 0 0 0 0 0 0 0 0
855 0 0 0 0 0 0 0 0
857 0 0 0 0 1 0 0 0
858 0 0 0 0 0 0 0 1
859 0 0 0 0 1 0 0 0
... etc ...
This is extremely sparse data. I think there are two more blocks not shown. But I see very few instances where a given ENTRY_NO is observed in more than one rep or block. So I think it is seriously over-fitting to try to account for rep or block effects in this model.
MAYBE omitting REP from the model will make it work. MAYBE re-fitting the model with factor(REP) in place of REP will enable emmeans to detect a nesting structure. Otherwise, there's some really subtle dependence in the blocking structure and treatments, and I don't know what to suggest.
My raw data contains numeric values with a recall of the headers every 20 lines.
I wish to remove the repeated header lines with R. I know it's quite easy with sed command but I wish the R script to handle all steps of tidying data.
> raw <- read.delim("./vmstat_archiveadm_s.txt")
> head(raw)
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100097600 97779056 285 426 53 0 0 0 367 86 6 0 0 1206 7711 2630 1 0 99
0 0 0 96908192 94414488 7 31 0 0 0 0 0 120 0 0 0 2782 5775 5042 2 0 97
0 0 0 96889840 94397152 0 1 0 0 0 0 0 122 0 0 0 2737 5591 4958 2 0 97
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100065744 97745448 282 422 52 0 0 0 363 89 6 0 0 1233 7690 2665 1 0 99
0 0 0 96725312 94222040 7 31 0 0 0 0 0 604 69 0 0 5269 5703 7910 2 1 97
0 0 0 96668624 94170784 0 0 0 0 0 0 0 155 53 0 0 3047 5505 5317 2 0 97
0 0 0 96595104 94086816 0 0 0 0 0 0 0 174 0 0 0 2879 5567 5068 2 0 97
1 0 0 96521376 94025504 0 0 0 0 0 0 0 121 0 0 0 2812 5471 5105 2 0 97
0 0 0 96503256 93994896 0 0 0 0 0 0 0 121 0 0 0 2731 5621 4981 2 0 97
(...)
Try this :
where df is the dataframe
x = seq(6,100,21)
df = df[-x,]
Sequence will generate a string of numbers from 6 till 100 at an interval of 21.
Therefore, in this case :
6 27 48 69 90
Remove them from the dataframe by
df[-x,]
EDIT:
To do this for the entire dataframe, replace 100 with number of rows. i.e
seq(6,nrow(df),21)
Instead of processing the output in R I will clean it at the generation level:
$ vmstat 1 | egrep -v '^ kthr|^ r'
0 0 0 154831904 153906536 215 471 0 0 0 0 526 33 32 0 0 1834 14171 5253 0 0 99
1 0 0 154805632 153354296 9 32 0 0 0 0 0 0 0 0 0 1463 610 739 0 0 100
1 0 0 154805632 153354696 0 4 0 0 0 0 0 0 0 0 0 1408 425 634 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1341 381 658 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1299 353 610 0 0 100
1 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1319 375 638 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1308 367 614 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1336 395 650 0 0 100
1 0 0 154805632 153354640 0 0 0 0 0 0 0 44 44 0 0 1594 378 878 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 66 65 0 0 1763 382 1015 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1312 411 645 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1342 390 647 0 0 100
Please bear with an R newbie here. I'm trying to follow along with a tutorial published on the wonderful flowingdata.com site by using my own data to replace the .Rdata file included in the tutorial. The Rdata file, "unisexCnts.RData", contains unisex names and the number of times used for different years:
head(unisexCnts)
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 312 273 274 263 0 273 0 255 235 195 222 0 195 0 193 225 204 196 177 156
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 190 0 0 325 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 243 219 214
Alva 177 132 159 178 145 138 131 119 119 119 127 97 107 97 83 76 83 90 84 81 58 68
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 1264 0 0 0 0 0 0 0 1579 2145 2488 0 0
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 595 664
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 561 565 556 643 747 722 0 742 0 0
Alva 54 57 53 54 59 40 62 0 48 0 28 0 34 0 0 0 0 0 0 0 0 26
Amari 0 0 0 0 0 0 11 0 0 0 0 0 16 0 22 0 32 0 0 0 0 0
Angel 2561 2690 2779 0 0 3004 3108 3113 3187 2924 3100 3341 3229 3101 3532 3889 4066 4520 0 0 0 0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Addison 778 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 0 19 0 14 0 0 0 0 0 24 0 0 0 0 0
Amari 0 0 0 0 0 0 1181 1397 1333 1299 1265 1550 1780 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run it through the str() function I get the follwoing:
str(unisexCnts)
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
..$ : chr [1:83] "1930" "1931" "1932" "1933" ...
My data is in a csv file ,called "boysnames.csv":
,2013,2012,2011,2010,2009,2008
Jack,764,831,840,935,1068,1151
James,746,773,796,746,711,737
Daniel,678,683,711,792,842,828
Conor,610,639,709,726,776,857
I am trying to overwrite the unisexCnts.RData with the contents of my boysnames.csv. So to restructure and get my csv ready to be saved, I did:
Step1.
unisexCnts<-data <- read.csv("boysnames.csv", stringsAsFactors=FALSE, header=TRUE, check.names = FALSE)
Step2.
unisexCnts<-as.matrix(unisexCnts)
Step3.
save(file="unisexCnts.RData") ##save as Rdata file, overwriting the original unisexCnts.RData in the dir
However I get the following after steps 1 & 2 which doesn't match the structure of the original, any ideas/pointers?
> str(unisexCnts)
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "" "2013" "2012" "2011" ...
When you load a .csv file you can specify the column that should become the row names of the uploaded data using the command "row.names"
I recreated your data quickly and uploaded it using the following code:
read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
This saves you having to do this work after uploading the data. This gives you the data structure you are looking for as well:
unisexCnts = read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
unisexCnts = as.matrix(unisexCnts)
str(unisexCnts)
int [1:4, 1:6] 764 746 678 610 831 773 683 639 840 796 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:4] "Jack" "James" "Dan" "Conor"
..$ : chr [1:6] "X2013" "X2012" "X2011" "X2010" ...
However I get the following after steps 1 & 2 which doesn't match the
structure of the original, any ideas/pointers?
In the original unisexCnts the names are specified as row names. That's why the the first attribute is
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
To replicate that in your example. You can set the names as rownames by specifying
rownames(unisexCnts) <- ListorOrVectorofNamesHere
This will make the output match.
The reason this line:
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
doens't match this line
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
is the same. You have the names included in the actual matrix itself. In a matrix you can only have data of the same type . By including character data in the matrix (the names) you are converting the whole matrix itself into character/strings.
in summary
remove the name vector from the matrix and use it as row names and the str() of your two objects will match.