aov won't return pvalues in R - r

I have a strange problem with anova summary results summary(aov).
So here is the problem. I have a dataset with 6 columns. Here is the dataset sample:
Panelist Prod.ID Overall Appearance Flavor Texture
1 196 9 9 9 9
1 239 7 9 6 7
1 354 9 8 8 7
1 427 3 8 2 3
1 577 8 9 7 9
1 638 7 9 7 8
1 772 6 4 3 3
1 852 9 8 9 8
2 196 8 8 7 8
2 239 7 7 7 7
2 354 6 5 6 4
2 427 6 7 4 6
2 577 3 6 3 5
2 638 4 4 5 2
2 772 6 2 6 7
2 852 7 6 7 6
3 196 7 9 7 8
3 239 8 9 8 8
3 354 8 8 7 8
3 427 7 8 6 8
3 577 8 9 8 8
3 638 8 9 8 7
3 772 5 8 8 8
3 852 8 9 8 8
Anyway the first two columns are the factors and the rest are the response variables. The Panelist and the Prod.ID are considered by the summary() as continuous variables, so I converted them to be a factors with as.factor().
After that conversion I ran the anova-test with following model Overall ~ Panelist * Prod.ID, but as summary results I got only this:
> summary(aov(Overall ~ Prod.ID * Panelist, data = paneElements))
Df Sum Sq Mean Sq
Prod.ID 7 189.6 27.085
Panelist 160 1252.9 7.830
Prod.ID:Panelist 1116 3116.1 2.792
I can't find any cause that makes the F-test values and P-values disappear.
Any help will be very appreciated.
Thanks a lot.

You have only one observation for each combination of Prod.ID and Panelist (at least in your sample data), so the number of groups is equal to the number of observations. This would cause a divide-by-zero in the F-Test, which may be the reason for the lack of reported F-Test and p-values.
For example, when I add an extra observation for Prod.ID 196 for just one level of Panelist, I get F and p values reported in the output.

Related

Pairing People in R with Loop

I am trying to create pen-pal pairs in R. The problem is that I can't figure out how to loop it so that once I pair one person that person and their pair are eliminated from the pool and the loop continues until everyone has a pair.
I have already rated the criteria to pair them and found a score for every person for how well they would pair for the other person. I think added every pair score together to get a sense of how good the pair is overall (not perfect, but good enough for these purposes). I have found each person's ideal match then and ordered these matches by most picky person to least picky person (basically person with the lowest best-paired score to highest best-paired score). I also found their 2nd-8th best match (there will probably be about 300 people in the data).
A test of the best-matches is below:
indexed_fake apply.fin_fake..1..max. X1 X2 X3 X4 X5 X6 X7 X8
14 14 151 3 9 8 4 10 12 2 6
4 4 177 9 5 8 7 11 3 10 12
9 9 177 4 11 3 6 10 7 12 5
5 5 179 7 4 11 3 12 10 8 5
10 10 179 12 10 2 9 3 5 6 4
13 13 182 8 1 12 11 10 5 3 2
1 1 185 7 1 3 8 6 13 2 11
7 7 185 1 12 5 7 4 6 9 11
3 3 187 12 3 8 5 9 1 2 10
8 8 190 8 12 13 3 4 11 1 6
2 2 191 6 12 11 10 3 4 5 1
6 6 191 2 11 7 1 6 9 10 8
11 11 193 12 6 9 5 2 8 11 4
12 12 193 11 3 8 7 12 10 2 5
Columns X1-X8 are the 8 best pairs for the people listed in the first columns. With this example every person would ideally get paired with someone in their top 8, ideally maximizing the pair compatibility as another user mentioned. Every person would get one pair.
Any help is appreciated!
This is not a specific answer. But it's easier to write in this space. You have a classic assignment optimization problem. These problems can be solved using packages in R. You have to assign preference weights to your feasible pairings. So for example 14-3 could be assigned 8 points, 14-9; 7 points, 14-8; 6 points...14-6; 1 point. Note that 3-14 would be assigned no points because while 14 likes 3, 3 does not like 14. The preference score for any x-y, y-x pairing could be the weight for the x-y preference plus the weight of the y-x preference.
The optimization model would choose the weighted pairs to maximize the total satisfaction among all of the the pairings.
If you have 300 people I can't think of an alternative algorithm that could be simply implemented.

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

Error styles[[id]] : Indexing out of bounds in "riverplot" package

I am struggling to create a Sankey diagram using the package "riverplot". I did not manage to create a minimal toy example, so I have to include the riverplot object created by makeRiver() here. makeRiver did not throw any errors, so I thought it would work, but it does not. I hope that anyone of you has an idea.
This is the riverplot object I am trying to plot:
$edges
ID N1 N2 Value
102 102 2 10 3
106 106 6 10 2
111 111 2 11 7
115 115 6 11 2
119 119 1 12 1
120 120 2 12 72
121 121 3 12 4
125 125 7 12 7
127 127 9 12 4
129 129 2 13 14
134 134 7 13 2
136 136 9 13 1
145 145 9 14 1
147 147 2 15 4
152 152 7 15 1
154 154 9 15 1
156 156 2 16 1
165 165 2 17 69
166 166 3 17 3
167 167 4 17 1
168 168 5 17 1
169 169 6 17 2
170 170 7 17 7
171 171 8 17 1
172 172 9 17 8
$nodes
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
$styles
list()
attr(,"class")
[1] "list" "riverplot"
Calling riverplot(river) ("river" being the name of the variable I saved the object in), I get the following output (sorry that the error message is in German, it says "Index(ing) out of bounds"):
[1] "calculating positions"
[1] 21.9
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
[1] "done"
[1] "drawing edges"
Fehler in styles[[id]] : Indizierung außerhalb der Grenzen
I THINK I traced the problem to the function riverplot:::getattr, but I am not sure about that. Any help?
In case anyone is interested in the solution to the problem I described above: I used numeric IDs for nodes (1, 2, 3, ...) and edges (101, 102, ...).
makeRiver() checks if IDs are duplicated among nodes and edges and throws an error if that happens. However, it does NOT check if the IDs are purely numeric, which is apparently the source of the error.
I now added an "E" at the beginning of the edge IDs (E1, E2, ...) and an "N" at the beginning of node IDs (N1, N2, ...). It works now!

Using R to generate a time series of averages from a very large dataset without using for loops

I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.

Resources