I'm trying to create new variables with mutate in dplyr and I can't understand my error, I've tried everything and have not stumbled upon this issue in the past.
I have a large data set, over a million observations. I only provide you with the 20 first observations.
This is how my data looks like:
data1 <- read.table(header=TRUE, text="IDnr visit time year end event survival
7 1 04/09/06 2006 31/12/06 0 118
7 2 04/09/06 2007 31/12/07 0 483
7 3 04/09/06 2008 31/12/08 0 849
7 4 04/09/06 2009 31/12/09 0 1214
7 5 04/09/06 2010 31/12/10 0 1579
7 6 04/09/06 2011 31/12/11 0 1944
20 1 24/10/03 2003 31/12/03 0 68
20 2 24/10/03 2004 31/12/04 0 434
20 3 24/10/03 2005 31/12/05 0 799
20 4 24/10/03 2006 31/12/06 0 1164
20 5 24/10/03 2007 31/12/07 0 1529
20 6 24/10/03 2008 31/12/08 0 1895
20 7 24/10/03 2009 31/12/09 0 2260
20 8 24/10/03 2010 31/12/10 0 2625
20 9 24/10/03 2011 31/12/11 0 2990
87 1 17/01/06 2006 31/12/06 0 348
87 2 17/01/06 2007 31/12/07 0 713
87 3 17/01/06 2008 31/12/08 0 1079
87 4 17/01/06 2009 31/12/09 0 1444
87 5 17/01/06 2010 31/12/10 0 1809")
I must say that the date and time variables does not have this format in my dataset, I't is coded with POSIXct with the format ("%Y-%m-%d"). I't somehow reformats itself when I attach I't to stackoverflow and apply the "code" citations.
Okey, the problem is that I'm trying to create new survival time variables in the same dataset, one is for a cox regression model with stop and start time (survival is stop time and the new start variable should be called survcox).
Also im trying to do a poisson regression where the offset variable (i.e the survival time variable) should be called survpois. This is the code I'm trying to use;
data2 <- data1 %>%
group_by(IDnr) %>%
mutate(survcox = ifelse(visit==1, 0, lag(survival)),
year_aar = substr(data1$year, 1,4), first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
survpois = as.numeric(data1$end - first_day)+1) %>%
mutate(survpois = ifelse(year_aar > first_day, as.numeric(end - year_aar),
survpois)) %>%
ungroup()
I receive an error in this step!
Error: incompatible size (1345000), expecting 6 (the group size) or 1
I have no idea why I get this error, what I't means and why my code doesn't work.
All the help I can get is appreciated, thanks in advance!
It's because you reference variable as data1$year which doesn't fit in grouped data (and in data1$end too)
I teased apart your code and found a few issues. One was the thing I mentioned in the comment above. Second thing was the class of end. If the data you provided is the one, end is factor. If this is the case in your own situation, you need to convert end to an date object. The other thing was year_aar > first_day. first_day is a date object whereas year_arr is character. Given those, I modified your code.
data1 %>%
group_by(IDnr) %>%
mutate(survcox = ifelse(visit == 1, 0, lag(survival)),
year_aar = substr(year, 1,4),
first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
survpois = as.numeric(as.POSIXct(end, format = "%d/%m/%y") - first_day) + 1) %>%
mutate(survpois = ifelse(as.numeric(year_aar) > as.numeric(format(first_day, "%Y")),
as.numeric(as.POSIXct(end, format = "%d/%m/%y") - year_aar), survpois)) %>%
ungroup()
Here is a bit of the outcome.
# IDnr visit time year end event survival survcox year_aar first_day survpois
#1 7 1 04/09/06 2006 31/12/06 0 118 0 2006 2006-01-01 365
#2 7 2 04/09/06 2007 31/12/07 0 483 118 2007 2007-01-01 365
#3 7 3 04/09/06 2008 31/12/08 0 849 483 2008 2008-01-01 366
#4 7 4 04/09/06 2009 31/12/09 0 1214 849 2009 2009-01-01 365
#5 7 5 04/09/06 2010 31/12/10 0 1579 1214 2010 2010-01-01 365
Related
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
Sorry for the very specific question, but I have a file as such:
Adj Year man mt wm wmt by bytl gr grtl
3 careless 1802 0 126 0 54 0 13 0 51
4 careless 1803 0 166 0 72 0 1 0 18
5 careless 1804 0 167 0 58 0 2 0 25
6 careless 1805 0 117 0 5 0 5 0 7
7 careless 1806 0 408 0 88 0 15 0 27
8 careless 1807 0 214 0 71 0 9 0 32
...
560 mean 1939 21 5988 8 1961 0 1152 0 1512
561 mean 1940 20 5810 6 1965 1 914 0 1444
562 mean 1941 10 6062 4 2097 5 964 0 1550
563 mean 1942 8 5352 2 1660 2 947 2 1506
564 mean 1943 14 5145 5 1614 1 878 4 1196
565 mean 1944 42 5630 6 1939 1 902 0 1583
566 mean 1945 17 6140 7 2192 4 1004 0 1906
Now I have to call for specific values (e.g. [careless,1804,man] or [mean, 1944, wmt].
Now I have no clue how to do that, one possibility would be to split the data.frame and create an array if I'm correct. But I'd love to have a simpler solution.
Thank you in advance!
Subsetting for specific values in Adj and Year column and selecting the man column will give you the required output.
df[df$Adj == "careless" & df$Year == 1804, "man"]
I have two dataframes, one of failed firms, and one of non-failed firms.
They both comprise of rows of observations of firms, the variables in these rows include the industry of firm, the year where financial information was recorded, and the size of total assets of the firm and others.
I want to match each failed firm with one non-failed firm of the same industry and total asset size and year of financial information recorded.
I am happy to throw away observations with no match. If one failed firm matches multiple nonfailed firms, I am happy to just randomly choose one.
Currently, my code looks like this:
merge(cessdurc1[cessdurc1$afcyear=="2007",], cessdura[cessdura$afcyear=="2007",], by=c("ssic", "total_assets"), all.x=TRUE, all.y=FALSE)
Which does not work because the columns chosen need to be unique.
My data looks like this:
>head(alivefirms)
failed within year total_assets afcyear ssic
1 0 9e+07 2007 20
2 0 7e+06 2007 43
3 0 7e+05 2007 46
4 0 1e+07 2007 82
5 0 1e+08 2007 93
6 0 1e+06 2007 11
> head(failedfirms)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 1 5000 2007 73
105 1 400 2007 30
127 1 4000 2007 18
133 1 2000 2007 70
154 1 10000 2007 41
I want the output to match failed firms to alive firms who have the same SSIC & Total Assets & Afcyear, so something that looks like this
> head(wantedoutput)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 0 20000 2007 41
105 1 400 2007 30
127 0 400 2007 30
133 1 2000 2007 70
154 0 2000 2007 70
I'm trying to run an ARIMA on a temporal dataset that is in a .csv file. Here is my code so far:
Oil_all <- read.delim("/Users/Jkels/Documents/Introduction to Computational
Statistics/Oil production.csv",sep="\t",header=TRUE,stringsAsFactors=FALSE)
Oil_all
The file looks like:
year.mbbl
1 1880,30
2 1890,77
3 1900,149
4 1905,215
5 1910,328
6 1915,432
7 1920,689
8 1925,1069
9 1930,1412
10 1935,1655
11 1940,2150
12 1945,2595
13 1950,3803
14 1955,5626
15 1960,7674
16 1962,8882
17 1964,10310
18 1966,12016
19 1968,14104
20 1970,16690
21 1972,18584
22 1974,20389
23 1976,20188
24 1978,21922
25 1980,21732
26 1982,19403
27 1984,19608
Code:
apply(Oil_all,1,function(x) sum(is.na(x)))
Results:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run ARIMA:
library(forecast)
auto.arima(Oil_all,xreg=year)
This is the error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In data.matrix(data) : NAs introduced by coercion
So, I was able to call in the data set and it prints. However, when I go to check whether the values are present with the apply function, I see all 0's, so I know something's wrong and that's probably why I'm getting the error. I'm just not sure what the error means or how to fix it in the code.
Any advice?
If I got your question right, it should be like:
Oil_all <- read.csv("myfolder/myfile.csv",header=TRUE)
## I don't have your source data, so I tried to reproduce it with the data you printed
Oil_all
year value
1 1880 30
2 1890 77
3 1900 149
4 1905 215
5 1910 328
6 1915 432
7 1920 689
8 1925 1069
9 1930 1412
10 1935 1655
11 1940 2150
12 1945 2595
13 1950 3803
14 1955 5626
15 1960 7674
16 1962 8882
17 1964 10310
18 1966 12016
19 1968 14104
20 1970 16690
21 1972 18584
22 1974 20389
23 1976 20188
24 1978 21922
25 1980 21732
26 1982 19403
27 1984 19608
library(forecast)
auto.arima(Oil_all$value,xreg=Oil_all$year)
Series: Oil_all$value
ARIMA(3,0,0) with non-zero mean
Coefficients:
ar1 ar2 ar3 intercept Oil_all$year
1.2877 0.0902 -0.4619 -271708.4 144.2727
s.e. 0.1972 0.3897 0.2275 107344.4 55.2108
sigma^2 estimated as 642315: log likelihood=-221.07
AIC=454.15 AICc=458.35 BIC=461.92
your import should be
Oil_all<-read.csv("/Users/Jkels/Documents/Introduction to Computational Statistics/Oil production.csv")
That is why your data is weird. Sorry I do not have the reputation to comment.I did the same as Nemesi and it worked then. I think you are trying to import a csv as a tab delimited file.
I'm trying to calculate incidence (with poisson regression) for a rare type of cancer. My dataset is quite large, consisting of 25.000 observations, i have only included the first 20 rows.
The nrcase variable indicates each individual, as you can see an individual can have a number of observations, depending on how many times they have visited the clinic. The variable visit is the number of observations each unique individual has in the dataset, and maxvisit is the total number.
Start is when the individuals was observed for the first time ever in the dataset and done is respectively the last observed date for each year the patient is in the dataset. I haven't included the censoring variable in this subset ( if the patient haven't suffered and event or quits the study for some reason the censoring date is 2011-12-31)
Survival is the number of days that a patient has lived since the inclusion date (start)
Event is if the patient suffered and event (which no patient has in the subset I have provided you)
This is how the dataset looks like
first <- read.table(header = TRUE, text ="nrcase visit maxvisit done start survival event
7 1 6 31/12/06 04/09/06 118 0
7 2 6 31/12/07 04/09/06 483 0
7 3 6 31/12/08 04/09/06 849 0
7 4 6 31/12/09 04/09/06 1214 0
7 5 6 31/12/10 04/09/06 1579 0
7 6 6 31/12/11 04/09/06 1944 0
20 1 9 31/12/03 24/10/03 68 0
20 2 9 31/12/04 24/10/03 434 0
20 3 9 31/12/05 24/10/03 799 0
20 4 9 31/12/06 24/10/03 1164 0
20 5 9 31/12/07 24/10/03 1529 0
20 6 9 31/12/08 24/10/03 1895 0
20 7 9 31/12/09 24/10/03 2260 0
20 8 9 31/12/10 24/10/03 2625 0
20 9 9 31/12/11 24/10/03 2990 0
87 1 6 31/12/06 17/01/06 348 0
87 2 6 31/12/07 17/01/06 713 0
87 3 6 31/12/08 17/01/06 1079 0
87 4 6 31/12/09 17/01/06 1444 0
87 5 6 31/12/10 17/01/06 1809 0")
This is how i want the dataset to look like:
make <- read.table(header=TRUE, text="nrcase visit maxvisit done start survival event startstop
7 1 6 31/12/06 04/09/06 118 0 118
7 2 6 31/12/07 04/09/06 483 0 365
7 3 6 31/12/08 04/09/06 849 0 365
7 4 6 31/12/09 04/09/06 1214 0 365
7 5 6 31/12/10 04/09/06 1579 0 365
7 6 6 31/12/11 04/09/06 1944 0 365
20 1 9 31/12/03 24/10/03 68 0 68
20 2 9 31/12/04 24/10/03 434 0 365
20 3 9 31/12/05 24/10/03 799 0 365
20 4 9 31/12/06 24/10/03 1164 0 365
20 5 9 31/12/07 24/10/03 1529 0 365
20 6 9 31/12/08 24/10/03 1895 0 365
20 7 9 31/12/09 24/10/03 2260 0 365
20 8 9 31/12/10 24/10/03 2625 0 365
20 9 9 31/12/11 24/10/03 2990 0 233
87 1 6 31/12/06 17/01/06 348 0 348
87 2 6 31/12/07 17/01/06 713 0 365
87 3 6 31/12/08 17/01/06 1079 0 365
87 4 6 31/12/09 17/01/06 1444 0 365
87 5 6 31/12/10 17/01/06 1809 0 105")
As you can see I want to create a new variable called startstop that is the total days the patient contributes with each year to the observation row.
Startstop will later on work as my offset variable in the glm (poisson) model.
Appreciate all the help I can get!
I hope this does what you need. I've used lubridate and dplyr because they make things easier but the same results could be achieved in base.
There's no need to retain year_done or first_jan_done, these can be removed with %>% select(-year_done, -first_jan_done) but I thought I would leave them in to make the process clearer.
require(dplyr)
require(lubridate)
make <- first %>%
mutate(start = dmy(start), done = dmy(done),
year_done = year(done), first_jan_done = dmy(paste0("01/01/",year_done)),
days_in_year = as.numeric(done - first_jan_done)+1
) %>% # Need to deal with those observations where patients entered study part way into year
mutate(days_in_year = ifelse(start > first_jan_done, as.numeric(done - start),
days_in_year))
make
nrcase visit maxvisit done start survival event year_done first_jan_done days_in_year
1 7 1 6 2006-12-31 2006-09-04 118 0 2006 2006-01-01 118
2 7 2 6 2007-12-31 2006-09-04 483 0 2007 2007-01-01 365
3 7 3 6 2008-12-31 2006-09-04 849 0 2008 2008-01-01 366
4 7 4 6 2009-12-31 2006-09-04 1214 0 2009 2009-01-01 365
5 7 5 6 2010-12-31 2006-09-04 1579 0 2010 2010-01-01 365
6 7 6 6 2011-12-31 2006-09-04 1944 0 2011 2011-01-01 365
7 20 1 9 2003-12-31 2003-10-24 68 0 2003 2003-01-01 68
8 20 2 9 2004-12-31 2003-10-24 434 0 2004 2004-01-01 366
9 20 3 9 2005-12-31 2003-10-24 799 0 2005 2005-01-01 365
10 20 4 9 2006-12-31 2003-10-24 1164 0 2006 2006-01-01 365
11 20 5 9 2007-12-31 2003-10-24 1529 0 2007 2007-01-01 365
12 20 6 9 2008-12-31 2003-10-24 1895 0 2008 2008-01-01 366
13 20 7 9 2009-12-31 2003-10-24 2260 0 2009 2009-01-01 365
14 20 8 9 2010-12-31 2003-10-24 2625 0 2010 2010-01-01 365
15 20 9 9 2011-12-31 2003-10-24 2990 0 2011 2011-01-01 365
16 87 1 6 2006-12-31 2006-01-17 348 0 2006 2006-01-01 348
17 87 2 6 2007-12-31 2006-01-17 713 0 2007 2007-01-01 365
18 87 3 6 2008-12-31 2006-01-17 1079 0 2008 2008-01-01 366
19 87 4 6 2009-12-31 2006-01-17 1444 0 2009 2009-01-01 365
20 87 5 6 2010-12-31 2006-01-17 1809 0 2010 2010-01-01 365