I have the following data set
df <- data.frame(student=c(1,2,3,4,5,6,7,8,9), sat=c(365,0,545,630,385,410,0,655,0), act=c(28,20,0,0,16,17,35,29,21))
student sat act
1 365 28
2 0 20
3 545 0
4 630 0
5 385 16
6 410 17
7 0 35
8 655 29
9 0 21
and I'd like to create a new field with the following conditions
If there is an SAT score > 0 use SAT score
If SAT=0, then convert the ACT to an SAT score using the rubric here. (When there was a range in the SAT score, I just used the median.
ACT SAT
8 200
9 210
10 220
11 225
12 250
13 285
14 325
15 360
16 385
17 410
18 440
19 465
20 485
21 505
22 525
23 545
24 560
25 575
26 595
27 615
28 635
29 655
30 675
31 700
32 725
33 750
34 775
35 790
36 800
This is one heck of an ifelse statement. I've tried this:
df$newgrade=-ifelse(ACT=8,200, ifelse (ACT=9,210, ifelse(ACT=10,220, ifelse (ACT=11,225, ACT=12,250, ifelse(ACT=13,285, ifelse (ACT=14,325, ACT=15,D, ifelse(ACT=16,C, ifelse (ACT=17,B, ACT=18,D, ifelse(ACT=19,C, ifelse (ACT=20,B, ACT=21,D, ifelse(ACT=22,C, ifelse (ACT=23,B, ACT=24,D, ifelse(ACT=25,C, ifelse (ACT=26,B, ACT=27,D, ifelse(ACT=28,C, ifelse (ACT=29,B, ACT=30,D, ifelse(ACT=31,C, ifelse (ACT=32,B, ACT=33,D, ifelse(ACT=34,C, ifelse (ACT=35,B, ACT=36,D))))))))))))))))))))
I tried to follow the example at the bottom of this page but it didn't work.
Does anyone have any ideas on how best to achieve this new field?
Thank you for any assistance you may bring.
Let's call conversion to the table you want to use to convert values when df$sat==0. Yo can do something like this:
df$newgrade<-ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat)
EDIT: If you want to include another condition df$sat ==0 and df$act==0, then df$new grade==0, you can include another ifelse:
df$newgrade<-ifelse(df$sat == 0 & df$act == 0, 0, ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat))
or use df[is.na(df)]<-0 after create the column df$newgrade, because in those cases ( df$sat ==0 and df$act==0 ) you'll have NAs
Related
I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
i need to distribute some days along the year.
I have 213 activities and 247 days.. i need to plan this activities, but i need to cover the maximum time what can be possible.
I am substracting the total days - activities, in this case 34, i divide the total days with the previous result: 247/34= 7,26...
With this number i know what every seven days y have one without activity.
To code this part i doing this:
where day is a "for" variable what its looping a dataframe with dates and integer its the integer part of 7,26, in this case 7
if(day%%integer==0) {
aditional <- 0
} else {
aditional <- 1
}
#
if(day%%7==0) {
aditional <- 0
} else {
aditional <- 1
}
The result will be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
In bold font the day without activity
This way its cool, but its not so precise how i want.
I know i need to use the decimal part of the result of 7,26... 26, but i dont know how do it.
Can you help me please?
Thanks and sorry for my english
Make these 34 days the non-activity days:
round((247/34) * seq(34))
giving:
[1] 7 15 22 29 36 44 51 58 65 73 80 87 94 102 109 116 124 131 138
[20] 145 153 160 167 174 182 189 196 203 211 218 225 232 240 247
Hello I am new to R and am trying to take out row with an ==0 value .
I really am new and might be making simple mistake but I can't seem to figure it out
This is what i've tried
Simplecount <- na.omit[,Simple$Counts >=1,]
object of type 'closure' is not subsettable
The data table below is called Simple
row.names Time INT Counts
19 234 13703 4 0
20 235 13803 4 0
21 236 13903 4 0
22 237 14104 5 1
23 238 14204 5 0
61 276 18403 6 0
62 277 18503 7 1
63 278 18604 7 0
64 279 18704 7 0
You have a misplaced comma in your code.
Try:
> simple[simple$Counts >= 1, ]
row.names Time INT Counts
22 237 14104 5 1
62 277 18503 7 1
Or, in this particular case, even the following would work:
simple[as.logical(simple$Counts), ]
I am creating a survey. There are 31 possible questions, I would like each respondent to answer a subset of 3. I would like them to be administered in a random order. Participants should not answer the same questions twice
I have created a table matrix with a participant index, and a column for the question indices for the 1st, 2nd and 3rd questions.
Using the code below, index 31 is under-represented in my sample.
I think I am using the sample function incorrectly. I was hoping someone could please help me?
SgPassCode <- data.frame(PassCode=rep(0,10000), QIndex1=rep(0,10000),
QIndex2=rep(0,10000), QIndex3=rep(0,10000))
set.seed(123)
for (n in 1:10000){
temp <- sample(31,3,FALSE)
SgPassCode[n,1] <- n
SgPassCode[n,-1] <- temp
}
d <- c(SgPassCode[,2],SgPassCode[,3],SgPassCode[,4])
hist(d)
The issue is with hist and the way it picks its bins, not sample. Proof is the output of table:
table(d)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1003 967 938 958 989 969 988 956 983 990 921 1001 982 1016 1013 959
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 907 918 918 991 931 945 998 1017 1029 980 959 886 947 987 954
If you want hist to "work", hist(d, breaks = 0:31) (and certainly a lot of other things) will work.