Data split and sapply

Data split and sapply - r

I'm new to the R language and I'm having some difficult to calculate the returns of my dataset for every Identification.
I have a very large dataset of monthly observations grouped like so:
Code Subset Identification Names Times Value %
100 1001 10011 ..... 201012 10 40
100 1001 10012 ..... 201012 11 60
100 1002 10021 ..... 201012 7 30
100 1002 10022 ..... 201012 13 70
.....
100 1001 10011 ..... 201301 11 45
100 1001 10012 ..... 201301 15 55
100 1002 10021 ..... 201301 9 33
100 1002 10022 ..... 201301 17 67
I need to write a function that can calculate the monthly rate of returns for every Identification. Then, I need to aggregate the values so calculated in the upper level of "subset" (with a mean weighted "%").
I've changed the format of the vector times to year-month i.e. "%Y-%m" in this way:
as.yearmon(as.character(Data$Times), format = "%Y%m")
and I've tried to calculate the returns for every Identification using split and sapply, like this:
xm <- split(Data, Identification)
Retxm <- sapply(1:length(xm), function(x) returns(Value))
The output i had using the function above is like this:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] 1.605198e-03 1.605198e-03 1.605198e-03 1.605198e-03
[3,] -1.190902e-02 -1.190902e-02 -1.190902e-02 -1.190902e-02
[4,] 3.318032e-03 3.318032e-03 3.318032e-03 3.318032e-03
The output is not many clear, so i would have on the row the Times and on the header the Identification.
Thank you so much!

Here's a minimal dataset which is similar:
set.seed(1)
df1 <- data.frame(id=sample(c("10011", "10012", "10013"), 6, replace=TRUE),
d1=rep(c(201012, 201101), each=3),
v1=ceiling(20*runif(6))
)
As to your first question, you can't format an object as Date in base R unless you specify the day in addition to the month and year.
To handle dates which are specified by month & year you could use:
library(zoo)
df1$d1 <- as.yearmon(as.character(df1$d1), format="%Y%m")
As to the second part of the question it's unclear to me what sort of calculation you're trying to perform. Following your methods you can indeed split the data.frame and do something with each element e.g. get the sum of the elements in the v1 column:
l1 <- split(df1, df1$id)
sapply(1:length(l1), function(i) sum(l1[[i]]$v1))
Edit My Java's not working so can't add comment. Still not clear what you're trying to do. Would be better if you could spell it out with a working example; try editing your original question if able to do so.

Related

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you

This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]

I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA

split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Subsetting an integer vector based on a vector of corresponding dates

Elementary question:
I'm trying to subset a vector of a data frame based on a vector of dates that correspond with the vector that I wish to subset. Consider the following data frame as an example:
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 NA
2 6/12/10 5:01:00 40 1 NA
3 6/12/10 5:02:00 50 1 NA
4 6/13/10 5:03:00 10 2 NA
5 6/13/10 5:04:00 20 2 NA
6 6/13/10 5:05:00 30 2 NA
I want to fill the column to the right with the sum of values for each day. Basically, (1:3,5) should = 110, and (4:6,5) should = 60.
I know there are many ways to do this that are smarter/faster/better than what I'm attempting to do (e.g., my date variable is a factor split into "levels" that I don't know how to access), but I'm trying to build my skills from the ground up, and want to figure out how to:
Take a subset of data$Axis1 that will only grab the values for the 1st day
Take a subset of the values of data$Axis1 that will only grab the values for the 2nd day
Sum the values for each day, and place them in column 5, overwriting the "NA"
I successfully performed a function similar to this to auto-fill-in the "Day" vector, which was originally full of "NA" values (below). But I'm getting stuck as I think about how to a) subset with dates, and b) sum while subsetting.
Thanks in advance for your help - also, let me know if my question could be clearer/I'm violating cardinal stackoverflow rules. I'm very new to R and the coding community in general; I appreciate your help!
dates <-c("6/12/10","6/13/10")
counts <- c(1:2)
x <- nrow(data)
for (i in 1:x) {
for (j in 1:12) {
if (data[i,1] == dates[j]) {
data[i,4] <- counts[j]
}
}
}

Using ave :
transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum))
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 110
2 6/12/10 5:01:00 40 1 110
3 6/12/10 5:02:00 50 1 110
4 6/13/10 5:03:00 10 2 60
5 6/13/10 5:04:00 20 2 60
6 6/13/10 5:05:00 30 2 60

Another way would be using data.table
#Let's say df is your dataset
library(data.table)
dt = as.data.table(df)
dt = dt[, Sum.A1.Daily := sum(Axis1), by = Date]

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486

Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.

Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

removing duplicate units from data frame

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.
set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
"1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
"1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))
test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)
'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,
head(test)
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.2070657 -0.9111954
2 1 1/1/2010 400 0.2774292 -0.8371717
3 1 1/1/2010 400 1.0844412 2.4158352
4 1 1/2/2012 400 -2.3456977 0.1340882
5 2 1/2/2009 600 0.4291247 -0.4906859
6 2 1/2/2004 700 0.5060559 -0.4405479
Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:
subset(test, !duplicated(c(UNIT,DATE,OUT1)))
Which unfortunately does not do the trick:
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.20706575 -0.9111954
5 2 1/2/2009 600 0.42912469 -0.4906859
8 3 1/2/2005 800 -0.54663186 -0.6937202
11 4 1/1/2011 900 -0.47719270 -1.0236557
14 5 1/1/2008 100 0.06445882 1.1022975
15 6 1/1/2012 100 0.95949406 -0.4755931
Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?

You can pass a data.frame to duplicated
In your case, you want to pass the first 3 columns of test
test2 <- test[!duplicated(test[,1:3]),]
If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique
library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)
For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table

Thanks! Looks like we can do:
test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]
and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter

You can use distinct() from the dplyr package:
library(dplyr)
test %>%
distinct(UNIT, DATE, OUT1)
Or without the %>% pipe:
distinct(test, UNIT, DATE, OUT1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data split and sapply - r

Related

Subsetting in R using a list

Subsetting an integer vector based on a vector of corresponding dates

Populate a column with forecasts of panel data using data.table in R

Altering a large distance matrix to be just three columns

removing duplicate units from data frame

Categories

Resources