Subsetting odd rows in r using seq - r

Hope it is not a too newbie question.
I am trying to subset rows from the GDP UK dataset that can be downloaded from here:
http://www.ons.gov.uk/ons/site-information/using-the-website/time-series/index.html
The dataframe looks more or less like that:
X ABMI
1 1948 283297
2 1949 293855
3 1950 304395
....
300 2013 Q2 381318
301 2013 Q3 384533
302 2013 Q4 387138
303 2014 Q1 390235
The thing is that for my analysis I only need the data for years 2004-2013 and I am interested in one result per year, so I wanted to get every fourth row from the dataset that lies between the 263 and 303 row.
On the basis of the following websites:
https://stat.ethz.ch/pipermail/r-help/2008-June/165634.html
(plus a few that i cannot quote due to the link limit)
I tried the following, each time getting some error message:
> GDPUKodd <- seq(GDPUKsubset[263:302,], by = 4)
Error in seq.default(GDPUKsubset[263:302, ], by = 4) :
argument 'from' musi mieæ d³ugoœæ 1
> OddGDPUK <- GDPUK[seq(263, 302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(263, 302, by = 4)) :
undefined columns selected
> OddGDPUKprim <- GDPUK[seq(263:302), by = 4]
Error in `[.data.frame`(GDPUK, seq(263:302), by = 4) :
unused argument (by = 4)
> OddGDPUK <- GDPUK[seq(from=263, to=302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(from = 263, to = 302, by = 4)) :
undefined columns selected
> OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to=GDPUK[302,] by = 4)]
Error: unexpected symbol in "OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to"
> GDPUK[seq(1,nrows(GDPUK),by=4),]
Error in seq.default(1, nrows(GDPUK), by = 4) :
could not find function "nrows"
To put a long story short: help!

Instead of trying to extract data based on row ids, you can use the subset function with appropriate filters based on the values.
For example if your data frame has a year column with values 1948...2014 and a quarter column with values Q1..Q4, then you can get the right subset with:
subset(data, year >= 2004 & year <= 2013 & quarter == 'Q1')
UDATE
I see your source data is dirty, with no proper year and quarter columns. You can clean it like this:
x <- read.csv('http://www.ons.gov.uk/ons/datasets-and-tables/downloads/csv.csv?dataset=pgdp&cdid=ABMI')
x$ABMI <- as.numeric(as.character(x$ABMI))
x$year <- as.numeric(gsub('[^0-9].*', '', x$X))
x$quarter <- gsub('[0-9]{4} (Q[1-4])', '\\1', x$X)
subset(x, year >= 2004 & year <= 2013 & quarter == 'Q1')

Your code GDPUK[seq(1,nrows(GDPUK),by=4),] actually works quite well for these purposes. The only thing you need to change is nrow for nrows.

Related

A way to get Column Names as Row Names?

My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]

Formating life-tables to use in survival analysis

I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.

Subset is not working

I have a dataset called "x" that produces the following records when I do head(x, 1)
VRTG_ID_NR EEG_VRTG_CAT_V GEBR_IDENT_KEUR PL_KEURING NAAM_VRTG_AANB DAT_RESULT_KEUR TYD_RESULT_KEUR KL_CODE_EU_1 KL_CODE_EU_2
1 VF1JA04N522215749 M1 NULL NULL NULL 20090527 906 6 NULL
RES_CODE_KEUR KENT_LAND_OORS LAND_HERK YEAR
1 GDK ME-QT 761 D 2009
I get the following when I show the classes of the relevant column
$YEAR
[1] "numeric"
I now want to create a subset where I only see data from the years 2009 en 2010. So I tried
x_subset <- x[x$YEAR >=2009 & <= 2011]
That however gives me the following error:
data frame with 0 columns and 992287 rows
While actually I want an overview with a subset of the records between 2009 and 2011...
If YEAR is a factor variable, first convert it to a numeric:
x$YEAR <- as.numeric(x$YEAR)
I think you are missing a comma:
x_subset <- x[x$YEAR >=2009 & x$YEAR <= 2011,]

How to know which line (date) is missing in a text file in R?

I have a text file dat that contains data (daily values) for two years 2008-2009.The total number of lines is 730 but it should be 731 (because 2008 has 366) so there is one date missing (line). I wonder how I can I know which date is missing?
there should be one line (row) per day
The file:
head(dat)
Year day valu
61322 2008 1 0.301
61346 2008 2 0.285
61370 2008 3 0.272
61394 2008 4 0.253
Try:
dfDate = with(dat, as.Date(day, origin="2008-01-01"))
yearDates = seq(as.Date("2008-01-01"),as.Date("2009-12-31"), by="days")
yearDates[!yearDates %in% dfDate]
This is surprisingly complicated. But maybe I’ve made a logical error here, and there’s a much more straightforward solution.
First, some helpers:
days_in_year = function (year)
1 : (if (is_leap_year(year)) 366 else 365)
is_leap_year = function (year)
year %% 4 == 0 && (year %% 100 != 0 || year %% 400 == 0)
Now we can generate a full list of days for each year, and see whether these are all present in your data.frame:
years = c(2008, 2009)
years = setNames(years, years)
full_years = lapply(years, days_in_year)
missing_days = lapply(years, function (y) which(is.na(match(full_years[[as.character(y)]], subset(dat, Year == y)$day))))
You can count the fields in the file with count.fields()
txt <- "Year day valu
61322 2008 1 0.301
61370 2008 3 0.272
61394 2008 4 0.25"
We can set the starting line to 2 with skip = 1 so that the header row won't appear in the result, and blank.lines.skip = FALSE to get back any blank rows (shown as zero). You can spot any other discrepancies by taking the difference from 4.
(cf <- count.fields(textConnection(txt), skip = 1, blank.lines.skip = FALSE))
# [1] 4 0 4 4
which(cf == 0)
# [1] 2
So now you can deduce that the missing date may be on the second line. In your case, running count.fields() on the file should tell you where the missing line is.
count.fields("file.dat", skip = 1, blank.lines.skip = FALSE)
There are also other useful arguments
> args(count.fields)
function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
comment.char = "#")

Apply Function for Rows of a Dataframe where the Function uses Row Elements

I'm trying to apply a function to each element of a data frame. A simple example of such a data frame would be:
> accts
ACCOUNT DATE
1 2008-03-01
2 2009-06-17
3 2008-07-02
4 2009-03-15
What I need to do is look at each row of this data frame and go find that account in a larger data frame, like the one shown below:
> trans
ACCOUNT_NUM TRAN_DATE
1 2008-02-02
2 2008-04-02
3 2008-03-16
3 2009-08-22
3 2008-05-05
6 2010-11-03
7 2008-09-18
4 2009-10-14
4 2009-01-15
10 2011-07-06
For each row in the 'accts' data frame I need to get the record in the 'trans' data frame corresponding to that account which also has the 'TRAN_DATE' that occurred nearest the 'DATE' but prior to it. I tried to use the apply function:
tranDateVector <- apply(accts, 2, getTranDate)
getTranDate <- function(x)
{
tranDate <- subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
dataDiff <- x[2] - tranDate
tranDate <- unique(date[which(dateDiff == min(dateDiff))])
return(tranDate)
}
accts <- cbind(accts, tranDateVector)
When I run my mini-example I get the following error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
However, when I run my full-blown version I get a different error, which I have realized is coming from this line:
subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
If I set x to be the third row of my 'accts' data frame, so:
x
ACCOUNT DATE
3 3 2008-07-02
and run the 'subset' line of code I get the following error, which corresponds to the error I get on my regular code:
> subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
Error in eval(expr, envir, enclos) :
dims [product 1] do not match the length of object [10]
In addition: Warning message:
In eval(expr, envir, enclos) :
Incompatible methods ("Ops.Date", "Ops.data.frame") for "<"
Thanks for your help.
(Information below was added after the answer to the above was provided b/c I realized there was a complication)
There are additional constraints on the function that I just realized need to be considered, and these cause the problem to become a bit more complicated. In the 'accts' data frame there are two different statuses:
> accts <- data.frame(
+ ACCOUNT = 1:4,
+ DATE = as.Date(c("2008-03-01", "2009-06-17",
+ "2008-07-02", "2009-03-15")),
+ STATUS = c("new", "old", "new", "old"))
In the 'accts' frame a record can be classified as either old or new. If the account is 'new' than it needs to meet the conditions specified earlier, but it also must only be matched with records in 'trans' flagged as 'revised'. Likewise for 'old' accounts, they can only be compared to the 'orig' records of trans:
> trans <- data.frame(
+ ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
+ TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
+ "2008-03-16", "2009-08-22",
+ "2008-05-05", "2010-11-03",
+ "2008-09-18", "2009-10-14",
+ "2009-01-15", "2011-07-06")),
+ BALANCE = c("orig", "orig", "orig", "orig", "revised", "orig", "revised", "revised", "revised", "orig"))
I tried to implement your code to fit this situation as follows:
library(plyr)
adply(accts, 1, transform,
TRAN_DATE = {
if(STATUS == "old")
{
data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
TRAN_DATE < DATE & BALANCE == "orig")
}else{
data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
TRAN_DATE < DATE & BALANCE == "revised")
}
tail(data$TRAN_DATE, 1) })
I get the following error from this code:
Error in data.frame(list(ACCOUNT = 1L, DATE = 13939, STATUS = 1L), BALANCE = list( :
arguments imply differing number of rows: 1, 0
My apologies for not specifying this requirement in my initial post, I didn't realize it would cause a problem.
Because you data mixes types (numbers, dates), I'd stay away from using apply as it will coerce your data into a single type. Instead I'd recommend using plyr's adply function which does preserve all types as each row is processed as a data.frame. It also has the advantage that fields can still be accessed using the column names and that usually leads to much more readable code as I will let you judge.
Your data:
accts <- data.frame(
ACCOUNT = 1:4,
DATE = as.Date(c("2008-03-01", "2009-06-17",
"2008-07-02", "2009-03-15")))
trans <- data.frame(
ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
"2008-03-16", "2009-08-22",
"2008-05-05", "2010-11-03",
"2008-09-18", "2009-10-14",
"2009-01-15", "2011-07-06")))
A solution using adply:
library(plyr)
adply(accts, 1, transform,
TRAN_DATE = { data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
TRAN_DATE < DATE)
tail(data$TRAN_DATE, 1) })
# ACCOUNT DATE TRAN_DATE
# 1 1 2008-03-01 2008-02-02
# 2 2 2009-06-17 2008-04-02
# 3 3 2008-07-02 2008-05-05
# 4 4 2009-03-15 2009-01-15

Resources