cumulative sum in r based on two columns - r

R newb; have tried to figure this out on the basis of earlier questions, but didn't really have any success. I have data that looks roughly like the following:
Name Date Value
A 2014-09-11 1.23
A 2014-12-11 4.56
A 2014-03-01 7.89
A 2014-06-05 0.12
B 2014-09-25 9.87
B 2014-12-21 6.54
B 2014-11-12 3.21
I'm looking to perform the following task on a data-frame: Add an index column that counts the cumulative occurrences of the column Name (which contains strings, not factors). For each "Name" replace all elements at cumulative index k or larger with the element at index k-1 for the given Name.
So for k=4, the result would be:
Name Date Value
A 2014-09-11 1.23
A 2014-12-11 4.56
A 2014-03-01 7.89
A 2014-06-05 7.89
B 2014-09-25 9.87
B 2014-12-21 6.54
B 2014-11-12 3.21
Any hints at how to do this in idiomatic R; looping over the frame will probably work, but I'm trying to learn to do this the way it was intended, to pick up some R skills on the go as well.

I think that you are looking for this:
require("data.table")
A = data.table(
Name = c("A","A","A","A","B","B","B"),
Date = c("2014-09-11", "2014-12-11", "2014-03-01", "2014-06-05", "2014-09-25", "2014-12-21", "2014-11-12"),
Value = c(1.23, 4.56, 7.89, 0.12, 9.87, 6.54,3.21))
A[,IX:=seq(1,.N),by="Name"]
Edit: (Since you corrected the question, I update my answer.)
func = function(x,b){return(c(x[seq(1,b)],rep(x[b],length(x)-b)))}
k = 4
A[,Value:=func(Value,k-1),by="Name"]

Related

Can anyone explain this R language ts function

myts <- ts(mydata[,2:4], start = c(1981, 1), frequency = 4)
Above line of code is clear to me except the part ts(mydata[,2:4]) I know that ts(mydata[,2] tells us that R should read from 2 column however, what does 2:4 stands for?
data looks like this
usnim
1 2002-01-01 4.08
2 2002-04-01 4.10
3 2002-07-01 4.06
4 2002-10-01 4.04
Could you provide an example when to use 2:4 ?

Data aggregation by week or by 3 days

Here is an example of my data:
Date Prec aggregated by week (output)
1/1/1950 3.11E+00 4.08E+00
1/2/1950 3.25E+00 9.64E+00
1/3/1950 4.81E+00 1.15E+01
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00
I have long time series precipitation data and I want to aggregate it in such a way that (output is in third column; I calculated it from excel) is as follows
If I do aggregation by weekly
output in 1st cell = average prec from day 1 to 7 days.
output in 2nd cell = average prec from 8 to 14 days.
Output in 3rd cell=average prec from 15 to 21 day
If I do aggregation by 3 days
output in 1st cell = average of day 1 to 3 days.
output in 2nd cell = average of day 4 to 6 days.
I will provide the function with "prec" and the "time step" input. I tried loops and lubridate, POSIXct, and some other functions, but I cant figure out the output like in third column.
One code I came up with ran without error but my output is bot correct.
Where dat is my data set.
tt=as.POSIXct(paste(dat$Date),format="%m/%d/%Y") #converting date formate
datZoo <- zoo(dat[,-c(1,3)], tt)
weekly <- apply.weekly(datZoo,mean)
prec_NLCD <-data.frame (weekly)
Also I wanted to write it in form of a function. Your suggestions will be helpful.
Assuming the data shown reproducibly in the Note at the end create the weekly means, zm, and then merge it with z.
(It would seem to make more sense to merge the means at the point that they are calculated, i.e. merge(z, zm) in place of the line marked ##, but for consistency with the output shown in the question they are put at the head of the data below.)
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
zm <- rollapplyr(z, 7, by = 7, mean)
merge(z, zm = zoo(coredata(zm), head(time(z), length(zm)))) ##
giving:
z zm
1950-01-01 3.11 4.081429
1950-01-02 3.25 9.642857
1950-01-03 4.81 11.517143
1950-01-04 7.07 NA
1950-01-05 4.25 NA
1950-01-06 3.11 NA
1950-01-07 2.97 NA
1950-01-08 2.83 NA
1950-01-09 2.72 NA
1950-01-10 2.72 NA
1950-01-11 2.60 NA
1950-01-12 2.83 NA
1950-01-13 17.00 NA
1950-01-14 36.80 NA
1950-01-15 42.40 NA
1950-01-16 17.00 NA
1950-01-17 7.07 NA
1950-01-18 3.96 NA
1950-01-19 3.54 NA
1950-01-20 3.40 NA
1950-01-21 3.25 NA
Note:
Lines <- "Date Prec
1/1/1950 3.11E+00
1/2/1950 3.25E+00
1/3/1950 4.81E+00
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00"

R log-transformation on dataframe

I have a dataframe (df) with the values (V) of different stocks at different dates (t). I would like to get a new df with the profitability for each time period.
Profitability is: ln(Vi_t / Vi_t-1)
where:
ln is the natural logarithm
Vi_t is the Value of the stock i at the date t
Vi_t-1 the value of the same stock at the date before
This is the output of df[1:3, 1:10]
date SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
1 01/08/88 1507.5 3.63 4.98 159.20 15.62 14.64 4.01 4.59 11.33
2 01/09/88 1467.4 3.69 4.97 161.55 15.69 14.40 4.06 4.87 11.05
3 01/10/88 1538.0 3.27 5.47 173.72 16.02 14.72 4.14 5.05 11.94
Specifically, instead of 1467.4 at [2, "SMI"] I want the profitability which is ln(1467.4/1507.5) and the same for all the rest of the values in the dataframe.
As I am new to R I am stuck. I was thinking of using something like mapply, and create the transformation function myself.
Any help is highly appreciated.
This will compute the profitabilities (assuming data is in a data.frame call d):
(d2<- log(embed(as.matrix(d[,-1]), 2) / d[-dim(d)[1], -1]))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#1 -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#2 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
Then, you can add in the dates, if you want:
d2$date <- d$date[-1]
Alternatively, you could use an apply based approach:
(d2 <- apply(d[-1], 2, function(x) diff(log(x))))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#[1,] -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#[2,] 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

R- replace values in a matrix with the average value of its group?

I am new-ish to R and have what should be a simple enough question to answer; any help would be greatly appreciated.
The situation is I have a tab delimited data matrix (data matrix.txt) like below with group information included on the last column.
sampleA sampleB sampleC Group
obs11 23.2 52.5 -86.3 1
obs12 -86.3 32.5 -84.7 1
obs41 -76.2 35.8 -16.3 2
obs74 23.2 32.5 -86.8 2
obs82 -86.2 52.8 -83.2 3
obs38 -36.2 59.5 -74.3 3
I would like to replace the values of each of the groups with the average value for that group
How can a group average rather than a row or column average be calculated in R?
And how can I use this value to replace original values? Is the replace() function useable in this situation or is that only for replacing two known values?
Thanks in advance
The package ddply should do the trick.
dat <- as.data.frame(matrix(runif(80),ncol=4))
dat$group <- sample(letters[1:4],size=20,replace=T)
head(dat)
library(plyr)
ddply(.data = dat, .variables =.(group), colwise(mean))
Result
group V1 V2 V3 V4
1 a 0.4741673 0.7669612 0.5043857 0.5039938
2 b 0.3648794 0.5776748 0.4033758 0.5748613
3 c 0.1450466 0.5399372 0.2440170 0.5124578
4 d 0.4249183 0.3252093 0.5467726 0.4416924

Resources