easy way to subset data into bins - r

I have a data frame as seen below with over 1000 rows. I would like to subset the data into bins by 1m intervals (0-1m, 1-2m, etc.). Is there an easy way to do this without finding the minimum depth and using the subset command multiple times to place the data into the appropriate bins?
Temp..ºC. Depth..m. Light time date
1 17.31 -14.8 255 09:08 2012-06-19
2 16.83 -21.5 255 09:13 2012-06-19
3 17.15 -20.2 255 09:17 2012-06-19
4 17.31 -18.8 255 09:22 2012-06-19
5 17.78 -13.4 255 09:27 2012-06-19
6 17.78 -5.4 255 09:32 2012-06-19

Assuming that the name of your data frame is df, do the following:
split(df, findInterval(df$Depth..m., floor(min(df$Depth..m.)):0))
You will then get a list where each element is a data frame containing the rows that have Depth..m. within a particular 1 m interval.
Notice however that empty bins will be removed. If you want to keep them you can use cut instead of findInterval. The reason is that findInterval returns an integer vector, making it impossible for split to know what the set of valid bins is. It only knows the values it has seen and discards the rest. cut on the other hand returns a factor, which has all valid bins defined as levels.

Related

Graph in SAS coming up with strange values using gchart function

I am trying to create a graph from a table I've made. I want to graph the values for month with the numbers in the Scheduled column. Unfortunately, it is displaying the months as like .75 or 2.25 and 4.75 instead of the actual month numbers and I don't know why.
I have tried changing the type of graph, the sumvar, the axes and values for them, but none of this has helped... it worked at one point but then simply stopped and I cannot figure out why.
1 SKED 7573
1 UNSK 1882
2 SKED 6635
2 UNSK 1642
3 SKED 817
3 UNSK 208
4 SKED 9494
4 UNSK 2376
5 SKED 1900
5 UNSK 551
6 SKED 9864
6 UNSK 3319
7 SKED 9770
7 UNSK 4145
pattern1 value=solid color=CXc01933;
pattern2 value=solid color=CX003366;
axis1 label=(angle=90 'Amount of Wheelchair Requests');
axis2 label=('Month') order=(0 to 12 by 1);
proc gchart data=Overall_Arr;
vbar month / type=sum SUMVAR=Arr_num subgroup=scheduled raxis=axis1 maxis=axis2
autoref clipref ;
run;
This is the table and this is the code to make the graph. I am expecting an output of a graph with two different colored bars, signifying the scheduled number and the unscheduled number. Before I put the order on the second axis it would output a graph but would have strange numbers for the month, like .75 or 4.25, etc, instead of using the 1 2 3 etc to signify the months. Now it is outputting no bars, I am assuming because it is trying to use those weird numbers but I've restricted the axis to whole numbers for the month... Any help would be appreciated.
Alright I actually think I figured it out, the problem was that month is also a command, so changing my variable's name allowed for it to be a variable instead of a command.

Function to identify changes done previously

BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.
Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Merging in R based on dates

I'm using getSymbols to import stock data from Yahoo to R.
When I store it in a data frame, it's in the following format.
IDEA.BO.Open IDEA.BO.High IDEA.BO.Low IDEA.BO.Close IDEA.BO.Volume
2007-03-09 92.40 94.25 84.00 85.55 63599400
2007-03-12 85.55 89.95 85.55 87.40 12490900
2007-03-13 88.50 91.25 86.20 89.85 16785000
2007-03-14 87.05 90.85 86.60 87.75 7763800
2007-03-15 90.00 94.00 88.80 91.45 14808200
2007-03-16 92.40 93.65 91.25 92.40 6365600
Now the date column has no name.
I want to import 2 stock data and merge closing prices (between any random set of rows) on the basis of dates. The problem is, the date column is not being recognized.
I want my final result to be like this.
IDEA.BO.Close BHARTIARTL.BO.Close
2007-03-12 123 333
2007-03-13 456 645
2007-03-14 789 999
I tried the following:
> c <- merge(Cl(IDEA.BO),Cl(BHARTIARTL.BO))
> c['2013-08/']
IDEA.BO.Close BHARTIARTL.BO.Close
2013-08-06 NA 323.40
2013-08-07 NA 326.80
2013-08-08 157.90 337.40
2013-08-09 157.90 337.40
The same data on excel looks like this:
8/6/2013 156.75 8/6/2013 323.4
8/7/2013 153.1 8/7/2013 326.8
8/8/2013 157.9 8/8/2013 337.4
8/9/2013 157.9 8/9/2013 337.4
I don't understand the reason behind the NA values in R and the way to obtain a merged data free of NA Values.
You need to do more reading about xts and zoo data structures. They are matrices with indices that are ordered. When you convert to data.frames they become lists with a 'rownames' attribute which gets displayed by print.data.frame with no header. The list elements are given names based on ht naming of the matrix columns. (I do understand Joshua's visible annoyance at this question since he has posted many SO examples of how to use xts-objects.)

Data dictionary packing in R

I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:
1) Create a text file which:
a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc
b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc
2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.
The basic idea is to create a data dictionary of a data frame with one function.
My question is: is there a package which already does this? And if not, do people think this would be a useful function?
Thanks
There are a variety of describe functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:
" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."
And an example of the output:
Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")]
2 Variables 5325006 Observations
----------------------------------------------------------------------------------
CHOLEST
n missing unique Mean .05 .10 .25 .50 .75 .90
4410307 914699 689 199.4 141 152 172 196 223 250
.95
268
lowest : 0 10 19 20 31, highest: 1102 1204 1213 1219 1234
----------------------------------------------------------------------------------
HDL
n missing unique Mean .05 .10 .25 .50 .75 .90
4410298 914708 258 54.2 32 36 43 52 63 75
.95
83
lowest : -11.0 0.0 0.2 1.0 2.0, highest: 241.0 243.0 248.0 272.0 275.0
----------------------------------------------------------------------------------
Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.

Resources