Sample Function R does not produce uniformly distributed sample - r

I am creating a survey. There are 31 possible questions, I would like each respondent to answer a subset of 3. I would like them to be administered in a random order. Participants should not answer the same questions twice
I have created a table matrix with a participant index, and a column for the question indices for the 1st, 2nd and 3rd questions.
Using the code below, index 31 is under-represented in my sample.
I think I am using the sample function incorrectly. I was hoping someone could please help me?
SgPassCode <- data.frame(PassCode=rep(0,10000), QIndex1=rep(0,10000),
QIndex2=rep(0,10000), QIndex3=rep(0,10000))
set.seed(123)
for (n in 1:10000){
temp <- sample(31,3,FALSE)
SgPassCode[n,1] <- n
SgPassCode[n,-1] <- temp
}
d <- c(SgPassCode[,2],SgPassCode[,3],SgPassCode[,4])
hist(d)

The issue is with hist and the way it picks its bins, not sample. Proof is the output of table:
table(d)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1003 967 938 958 989 969 988 956 983 990 921 1001 982 1016 1013 959
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 907 918 918 991 931 945 998 1017 1029 980 959 886 947 987 954
If you want hist to "work", hist(d, breaks = 0:31) (and certainly a lot of other things) will work.

Related

How to extract the most common words in a dataset (multiple rows) in R?

I'm using a data set where there are 2 columns: 'type' and 'posts'. Under 'type' there are 16 different personality types (multiple entries, there are ~8000 rows), and the 'posts' contain different words. I wish to be able to choose one 'type' and be able to find the most commonly used words from all the entries of this specific 'type'.
I've been trying to create separate data frames with only one 'type' and all of its 'posts' but I am not sure where to go from here.
For reference, I am aiming to create a final summary table of only one of each of the 16 'type's and most common words from 'posts'.
For the data set: https://www.kaggle.com/datasnaek/mbti-type
When using:
most_common_words <- mbti_train %>%
tidyr::separate_rows(posts, sep = "\\s+") %>%
group_by(type) %>%
count(type, posts) %>%
top_n(20, n)
This is outputted:
type posts n
1 ENFJ dont 1106
2 ENFJ feel 1104
3 ENFJ friend 713
4 ENFJ get 924
5 ENFJ go 714
6 ENFJ I'm 2003
7 ENFJ I've 689
8 ENFJ know 1001
9 ENFJ like 1762
10 ENFJ love 775
11 ENFJ make 664
12 ENFJ one 875
13 ENFJ peopl 1183
14 ENFJ person 698
15 ENFJ realli 887
16 ENFJ thing 816
17 ENFJ think 1400
18 ENFJ time 762
19 ENFJ want 632
20 ENFJ would 779
21 ENFP dont 3847
22 ENFP feel 3183
23 ENFP friend 2290
24 ENFP get 3229
25 ENFP go 2479
26 ENFP I'm 6779
27 ENFP I've 2324
28 ENFP know 3325
29 ENFP like 6478
30 ENFP love 2857
31 ENFP make 2239
32 ENFP one 3175
33 ENFP peopl 3675
34 ENFP realli 3074
35 ENFP say 2205
36 ENFP thing 2874
37 ENFP think 4613
38 ENFP time 2672
39 ENFP want 2121
40 ENFP would 2602
As you can see, the posts are the same 20 words even though the type is different.
We can split the sentences into word using separate_rows, count their occurrences and for each type select top 20 words using top_n.
library(dplyr)
mbti_train %>%
tidyr::separate_rows(posts, sep = "\\s+") %>%
count(type, posts) %>%
group_by(type) %>%
top_n(20, n)

How to user NSE inside fct_reorder() in ggplot2

I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.

Nested IF Else in R - SAT/ACT test

I have the following data set
df <- data.frame(student=c(1,2,3,4,5,6,7,8,9), sat=c(365,0,545,630,385,410,0,655,0), act=c(28,20,0,0,16,17,35,29,21))
student sat act
1 365 28
2 0 20
3 545 0
4 630 0
5 385 16
6 410 17
7 0 35
8 655 29
9 0 21
and I'd like to create a new field with the following conditions
If there is an SAT score > 0 use SAT score
If SAT=0, then convert the ACT to an SAT score using the rubric here. (When there was a range in the SAT score, I just used the median.
ACT SAT
8 200
9 210
10 220
11 225
12 250
13 285
14 325
15 360
16 385
17 410
18 440
19 465
20 485
21 505
22 525
23 545
24 560
25 575
26 595
27 615
28 635
29 655
30 675
31 700
32 725
33 750
34 775
35 790
36 800
This is one heck of an ifelse statement. I've tried this:
df$newgrade=-ifelse(ACT=8,200, ifelse (ACT=9,210, ifelse(ACT=10,220, ifelse (ACT=11,225, ACT=12,250, ifelse(ACT=13,285, ifelse (ACT=14,325, ACT=15,D, ifelse(ACT=16,C, ifelse (ACT=17,B, ACT=18,D, ifelse(ACT=19,C, ifelse (ACT=20,B, ACT=21,D, ifelse(ACT=22,C, ifelse (ACT=23,B, ACT=24,D, ifelse(ACT=25,C, ifelse (ACT=26,B, ACT=27,D, ifelse(ACT=28,C, ifelse (ACT=29,B, ACT=30,D, ifelse(ACT=31,C, ifelse (ACT=32,B, ACT=33,D, ifelse(ACT=34,C, ifelse (ACT=35,B, ACT=36,D))))))))))))))))))))
I tried to follow the example at the bottom of this page but it didn't work.
Does anyone have any ideas on how best to achieve this new field?
Thank you for any assistance you may bring.
Let's call conversion to the table you want to use to convert values when df$sat==0. Yo can do something like this:
df$newgrade<-ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat)
EDIT: If you want to include another condition df$sat ==0 and df$act==0, then df$new grade==0, you can include another ifelse:
df$newgrade<-ifelse(df$sat == 0 & df$act == 0, 0, ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat))
or use df[is.na(df)]<-0 after create the column df$newgrade, because in those cases ( df$sat ==0 and df$act==0 ) you'll have NAs

Sorting a matrix by a vector with correct numeration

I am looking to sort a matrix by a vector, it's partially working : I have a matrix g (2 column id and nobs) that I sort by the vector id.
My code is this one :
g[order(id),]
The sorting is OK however I end up with this result :
id nobs
6 30 932
5 29 711
4 28 475
3 27 338
2 26 586
1 25 463
And I am looking to an output this way :
id nobs
1 30 932
2 29 711
3 28 475
4 27 338
5 26 586
6 25 463
What is the first column with the numeration 1 to 6 and do I impact that ?
R 3.2.1, Windows 10
The first number of each line is just the name of the row. If you want/need to fix it, you can just use the following (after the ordering):
m <- g[order(id),]
rownames(m) <- 1:nrow(g)
and it should look the way you want it.

Error in sort.list(y) whlie using 'Strata()' in R

When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

Resources