Checking multiple value ranges in R - r

I have a column of 17000 values that I would like to classify into 48 groups by their ranges (classifying SIC codes into Fama French industries).
df$SIC
[1] 5080 4911 7359 2834 3674 6324 2810 4512 4400 6331 3728 3350 2911 2085 7340 6311 6199 6321 2771 3844 2870 3823 2836 3825
The only way I can think of to do this is to write a bunch of if then statements and place them all in a for loop. However, this will take forever to run.
for(i in c(1:(dim(df)[1])){
if(df$SIC[i] >= 0100 && df$SIC[i] <= 0299){df$FF_IND <- "AGRI"}
}
## and so on for all groups
Do you know of a less taxing way to perform this task?
Many thanks!

Something like:
cut(df$SIC,breaks=c(100,299,...),labels=c("AGRI",...))
A more thorough solution (which I don't have time for right now) would extract the table found via http://boards.fool.com/famafrench-industry-codes-26799316.aspx (downloading http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Siccodes49.zip and extracting the table) and finding the breakpoints programmatically.

Related

Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:
id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue
The first 2 digits of reg_period are the year they sat the paper. As can be seen, I would want to be keeping where id is 1507 and reg_period is 1101. So far, an example of my code to get the values I want to be trimming is:
unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])
However, there are a couple of problems I am then running in to. This only works because the data is ordered by id and reg_period and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in% doesn't seem to work with it and a loop with rbind runs out of memory.
What's the best way to handle this?
I would probably use dplyr. Calling your data df:
result = df %>% group_by(id) %>%
filter(period == min(period))
If you prefer base, I would pull the id/period combinations to keep into a separate data frame and then do an inner join with the original data:
id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)
Try this, it works for me with your data:
dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)
Output:
id period desc
1 1507 1101 90714 Research a contemporary biological issue
4 1507 1201 8931 Describe gene expression
7 1507 1301 6317 Explain the process of speciation

Summing values for a month in R

please see data sample as follows:
3326 2015-03-03 Wm Eu Apple 2L 60
3327 2015-03-03 Tp Euro 2 Layer 420
3328 2015-03-03 Tpe 3-Layer 80
3329 2015-03-03 14/3 Bgs 145
3330 2015-03-04 T/P 196
3331 2015-03-04 Wm Eu Apple 2L 1,260
3332 2015-03-04 Tp Euro 2 Layer 360
3333 2015-03-04 14/3 Bgs 1,355
Currently graphing this data creates a really horrible graph because the amount of cartons change so rapidly by day. It would make more sense to sum the cartons by month so that each data point represents a sum for that month rather than an individual day. The current range of the data is 11/01/2008-04/01/2015.
This is the code that I am using to graph (which may or may not be relevant for this):
ggvis(myfile, ~Shipment.Date, ~ctns) %>%
layer_lines()
Shipment.Date is column 2 in the data set and ctns is the 4th column.
I don't know much about R and have given it a few trys with some code that I have found here but I don't think I have found a problem similar enough to match the code. My idea is to create a new table, sum Act. Ctns for the month and then save it as that new table and graph from there.
Thanks for any assistance! :)
Do you need this:
data.aggregated<-aggregate(list(new.value=data$value),
by=list(date.time=cut(data$date.time, breaks="1 month")),
FUN=function(x) sum(x))

Merging in R based on dates

I'm using getSymbols to import stock data from Yahoo to R.
When I store it in a data frame, it's in the following format.
IDEA.BO.Open IDEA.BO.High IDEA.BO.Low IDEA.BO.Close IDEA.BO.Volume
2007-03-09 92.40 94.25 84.00 85.55 63599400
2007-03-12 85.55 89.95 85.55 87.40 12490900
2007-03-13 88.50 91.25 86.20 89.85 16785000
2007-03-14 87.05 90.85 86.60 87.75 7763800
2007-03-15 90.00 94.00 88.80 91.45 14808200
2007-03-16 92.40 93.65 91.25 92.40 6365600
Now the date column has no name.
I want to import 2 stock data and merge closing prices (between any random set of rows) on the basis of dates. The problem is, the date column is not being recognized.
I want my final result to be like this.
IDEA.BO.Close BHARTIARTL.BO.Close
2007-03-12 123 333
2007-03-13 456 645
2007-03-14 789 999
I tried the following:
> c <- merge(Cl(IDEA.BO),Cl(BHARTIARTL.BO))
> c['2013-08/']
IDEA.BO.Close BHARTIARTL.BO.Close
2013-08-06 NA 323.40
2013-08-07 NA 326.80
2013-08-08 157.90 337.40
2013-08-09 157.90 337.40
The same data on excel looks like this:
8/6/2013 156.75 8/6/2013 323.4
8/7/2013 153.1 8/7/2013 326.8
8/8/2013 157.9 8/8/2013 337.4
8/9/2013 157.9 8/9/2013 337.4
I don't understand the reason behind the NA values in R and the way to obtain a merged data free of NA Values.
You need to do more reading about xts and zoo data structures. They are matrices with indices that are ordered. When you convert to data.frames they become lists with a 'rownames' attribute which gets displayed by print.data.frame with no header. The list elements are given names based on ht naming of the matrix columns. (I do understand Joshua's visible annoyance at this question since he has posted many SO examples of how to use xts-objects.)

easy way to subset data into bins

I have a data frame as seen below with over 1000 rows. I would like to subset the data into bins by 1m intervals (0-1m, 1-2m, etc.). Is there an easy way to do this without finding the minimum depth and using the subset command multiple times to place the data into the appropriate bins?
Temp..ÂșC. Depth..m. Light time date
1 17.31 -14.8 255 09:08 2012-06-19
2 16.83 -21.5 255 09:13 2012-06-19
3 17.15 -20.2 255 09:17 2012-06-19
4 17.31 -18.8 255 09:22 2012-06-19
5 17.78 -13.4 255 09:27 2012-06-19
6 17.78 -5.4 255 09:32 2012-06-19
Assuming that the name of your data frame is df, do the following:
split(df, findInterval(df$Depth..m., floor(min(df$Depth..m.)):0))
You will then get a list where each element is a data frame containing the rows that have Depth..m. within a particular 1 m interval.
Notice however that empty bins will be removed. If you want to keep them you can use cut instead of findInterval. The reason is that findInterval returns an integer vector, making it impossible for split to know what the set of valid bins is. It only knows the values it has seen and discards the rest. cut on the other hand returns a factor, which has all valid bins defined as levels.

Read text file with many 2D datasets in it using R

I have a data file which I'd like to read into R which is something like the following:
STARTOFDATA 2011-06-23 35
143 6456 23 646 123.53A 864.95 23B
343 634 24 545 65.3 235.2 94C
...
524 542 45 245.4 24 245A 45B
STARTOFDATA 2011-06-24 84
245 6532 24.4 624.2 542 23B 35A
241 4532 13.5 235.12 534.23 54 32B
etc...
As you can see, it's basically a 2D dataset (each of the columns between the header lines is a different variable) which is stored for a number of dates, specified by the STARTOFDATA lines, which split up the different days. The number at the end of the header line is the number of lines of data before the next header line. The A's, B's and C's etc are quality control information which can basically just be discarded - probably just as a gsub on the text I get from the file.
My question is: how should I go about reading this into R? Ideally I'd like to be able to read either the whole file, or a specified date (or date range). I should probably point out that the file is over 200,000 lines long!
I've done some thinking and researching about this, but can't seem to work out a sensible way to do it.
As far as I can see it, there are two questions:
How to read the file: Is there a way to move a pointer around within a file in R? Some other languages I've worked with have had that ability, in which case I could read the first line, read the date, see if I want that date or not, then if not skip the number of lines listed at the end of the header (preferably without reading them!) and read the next header line. I can't see anything in the documentation about a function that would let me do that without actually reading in the lines. It seems that if I create a connection object manually then that will keep track of where I am in the file, and I can use repeated calls to readLines (in a loop) to read in chunks of the file, discarding them once read if they're not needed.
How to store the data: Ideally I want to store the 2D dataset for each date in a dataframe, then I can continue to do any analysis on them fairly easily. However, how should I store loads of these 2D datasets? I'm thinking of a list of data-frames, but is that the best way to do it (in terms of being able to index the list sensibly)?
Any ideas or comments would be much appreciated.
Use readLines to read your data as a character vector and then manipulate this vector. Here is some code that splits your sample data into a list of blocks:
Use readLines to read the data:
x <- readLines(textConnection(
"STARTOFDATA 2011-06-23 35
143 6456 23 646 123.53A 864.95 23B
343 634 24 545 42 65.3 235.2 94C
...
524 542 45 245.4 24 542.54 245A 45B
STARTOFDATA 2011-06-24 84
245 6532 24.4 624.2 542 23B 35A
241 4532 13.5 235.12 534.23 54
etc..."))
Determine the positions of STARTOFDATA, then split into a list of blocks:
positions <- c(grep("STARTOFDATA", x), length(x)+1)
lapply(head(seq_along(positions), -1),
function(i)x[positions[i]:(positions[i+1]-1)])
[[1]]
[1] "STARTOFDATA 2011-06-23 35"
[2] "143 6456 23 646 123.53A 864.95 23B"
[3] "343 634 24 545 42 65.3 235.2 94C"
[4] "..."
[5] "524 542 45 245.4 24 542.54 245A 45B"
[[2]]
[1] "STARTOFDATA 2011-06-24 84"
[2] "245 6532 24.4 624.2 542 23B 35A"
[3] "241 4532 13.5 235.12 534.23 54"
[4] "etc..."
Now each block of data is an element in a list and you can process that as required, using a second lapply()

Resources