Removing 0 Vaules from Subgroups in R studio - r

Hello I am new to R and am trying to take out row with an ==0 value .
I really am new and might be making simple mistake but I can't seem to figure it out
This is what i've tried
Simplecount <- na.omit[,Simple$Counts >=1,]
object of type 'closure' is not subsettable
The data table below is called Simple
row.names Time INT Counts
19 234 13703 4 0
20 235 13803 4 0
21 236 13903 4 0
22 237 14104 5 1
23 238 14204 5 0
61 276 18403 6 0
62 277 18503 7 1
63 278 18604 7 0
64 279 18704 7 0

You have a misplaced comma in your code.
Try:
> simple[simple$Counts >= 1, ]
row.names Time INT Counts
22 237 14104 5 1
62 277 18503 7 1
Or, in this particular case, even the following would work:
simple[as.logical(simple$Counts), ]

Related

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

Nested IF Else in R - SAT/ACT test

I have the following data set
df <- data.frame(student=c(1,2,3,4,5,6,7,8,9), sat=c(365,0,545,630,385,410,0,655,0), act=c(28,20,0,0,16,17,35,29,21))
student sat act
1 365 28
2 0 20
3 545 0
4 630 0
5 385 16
6 410 17
7 0 35
8 655 29
9 0 21
and I'd like to create a new field with the following conditions
If there is an SAT score > 0 use SAT score
If SAT=0, then convert the ACT to an SAT score using the rubric here. (When there was a range in the SAT score, I just used the median.
ACT SAT
8 200
9 210
10 220
11 225
12 250
13 285
14 325
15 360
16 385
17 410
18 440
19 465
20 485
21 505
22 525
23 545
24 560
25 575
26 595
27 615
28 635
29 655
30 675
31 700
32 725
33 750
34 775
35 790
36 800
This is one heck of an ifelse statement. I've tried this:
df$newgrade=-ifelse(ACT=8,200, ifelse (ACT=9,210, ifelse(ACT=10,220, ifelse (ACT=11,225, ACT=12,250, ifelse(ACT=13,285, ifelse (ACT=14,325, ACT=15,D, ifelse(ACT=16,C, ifelse (ACT=17,B, ACT=18,D, ifelse(ACT=19,C, ifelse (ACT=20,B, ACT=21,D, ifelse(ACT=22,C, ifelse (ACT=23,B, ACT=24,D, ifelse(ACT=25,C, ifelse (ACT=26,B, ACT=27,D, ifelse(ACT=28,C, ifelse (ACT=29,B, ACT=30,D, ifelse(ACT=31,C, ifelse (ACT=32,B, ACT=33,D, ifelse(ACT=34,C, ifelse (ACT=35,B, ACT=36,D))))))))))))))))))))
I tried to follow the example at the bottom of this page but it didn't work.
Does anyone have any ideas on how best to achieve this new field?
Thank you for any assistance you may bring.
Let's call conversion to the table you want to use to convert values when df$sat==0. Yo can do something like this:
df$newgrade<-ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat)
EDIT: If you want to include another condition df$sat ==0 and df$act==0, then df$new grade==0, you can include another ifelse:
df$newgrade<-ifelse(df$sat == 0 & df$act == 0, 0, ifelse(df$sat == 0, conversion$SAT[match(df$act, conversion$ACT)], df$sat))
or use df[is.na(df)]<-0 after create the column df$newgrade, because in those cases ( df$sat ==0 and df$act==0 ) you'll have NAs

Trying to get confidence/prediction intervals with `predict.lm` in R, but I keep getting an error regarding my dichotomous variable

I have a dataset that looks like this:
time size type
1 22 151 0
2 31 92 0
3 26 175 0
4 35 31 0
5 27 104 0
6 5 277 0
7 17 210 0
8 24 120 0
9 9 290 0
10 21 238 0
11 33 164 1
12 20 272 1
13 16 295 1
14 43 68 1
15 36 85 1
16 26 224 1
17 25 166 1
18 18 305 1
19 35 124 1
20 19 246 1
My aim is simple: to run a regression and get confidence/prediction intervals.
I run my regression like so:
fit.lm1<-lm(time~size+type,data=project3)
I then want to get 95% confidence intervals and 95% prediction intervals for the mean time when size is equal to 200. I want CIs/PIs for type = 0 and type = 1. My code is:
new_val <- data.frame(size= c(200,200),type=c(1,0))
CI<-predict(fit.lm1,newdata=new_val,interval="confidence")
PI<-predict(fit.lm1,newdata=new_val,interval="prediction")
I get the following errors:
Error: variable 'type' was fitted with type "factor" but type "numeric" was supplied
In addition: Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable 'type' is not a factor
I'm not quite sure how to interpret this. R shows that type is a "factor" with two levels, so I don't know what's wrong.
Any help would be great! Thank you.

mlogit duplicate 'row.names' are not allowed

New to R and want to use mlogit function.
However after putting my data into a data frame and run
x <- mlogit.data(mlogit, choice="PlacedN", shape="long", alt.var="RaceID")
I get duplicate 'row.names' are not allowed
I can upload my file if needed I've spent days trying to get this to work, so any help will be appreciated
You may want to put "RaceID" into the alt.levels argument instead of alt.var. From the mlogit.data help file:
alt.levels
the name of the alternatives: if null, for a wide data.frame, they are guessed from the variable names and the choice variable (both should be the same), for a long data.frame, they are guessed from the alt.var argument.
Give this a try.
library(mlogit)
m <- read.csv("mlogit.csv")
mlogd <- mlogit.data(m, choice="PlacedN", shape="long", alt.levels="RaceID")
head(mlogd)
# RaceID PlacedN RSP TrA JoA aDS bDS mDS aDH bDH mDH LDH MR eMR
# 1.RaceID 20119552 TRUE 3.00 13 12 0 0 0 0 0 0 0 0 131
# 2.RaceID 20119552 FALSE 4.00 23 26 91 94 94 139 153 145 153 150 150
# 3.RaceID 20119552 FALSE 0.83 15 15 99 127 99 150 153 150 153 159 159
# 4.RaceID 20119552 FALSE 18.00 21 15 0 0 0 0 0 0 0 0 131
# 5.RaceID 20119552 FALSE 16.00 16 12 92 127 92 134 135 134 135 136 136
# 6.RaceID 20119617 TRUE 2.50 12 10 0 0 0 0 0 0 0 0 152

How to remove rows with 0 values using R

Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)

Resources