R log-transformation on dataframe - r

I have a dataframe (df) with the values (V) of different stocks at different dates (t). I would like to get a new df with the profitability for each time period.
Profitability is: ln(Vi_t / Vi_t-1)
where:
ln is the natural logarithm
Vi_t is the Value of the stock i at the date t
Vi_t-1 the value of the same stock at the date before
This is the output of df[1:3, 1:10]
date SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
1 01/08/88 1507.5 3.63 4.98 159.20 15.62 14.64 4.01 4.59 11.33
2 01/09/88 1467.4 3.69 4.97 161.55 15.69 14.40 4.06 4.87 11.05
3 01/10/88 1538.0 3.27 5.47 173.72 16.02 14.72 4.14 5.05 11.94
Specifically, instead of 1467.4 at [2, "SMI"] I want the profitability which is ln(1467.4/1507.5) and the same for all the rest of the values in the dataframe.
As I am new to R I am stuck. I was thinking of using something like mapply, and create the transformation function myself.
Any help is highly appreciated.

This will compute the profitabilities (assuming data is in a data.frame call d):
(d2<- log(embed(as.matrix(d[,-1]), 2) / d[-dim(d)[1], -1]))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#1 -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#2 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
Then, you can add in the dates, if you want:
d2$date <- d$date[-1]
Alternatively, you could use an apply based approach:
(d2 <- apply(d[-1], 2, function(x) diff(log(x))))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#[1,] -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#[2,] 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368

Related

Subset/extract the highest values of groups in one column based on values of groups in other columns

I'm looking at oxygen concentrations in relation to bottom trawling at different depths in inner Danish waters for the last 40 years.
I have a data frame (Oxy) with four columns: ID, Date, Depth and Oxygen. The Oxygen has been measured throughout many years (Date), at many different locations (ID) and at many different Depths down the water column, spanning from 0-50 meters.
I would like to create a data frame where I have the Oxygen for the last 4 meters (Depth) (from the bottom and 4 meters up in the water column) for each station and the corresponding date. The measurements are not by every whole meter but at varying depths. The Depth where the Oxygen has been measured are not the same for each ID, so for one ID it has been sampled at 0.2, 0.4, 0.6 meters etc. and for another ID it has been sampled at 0.67, 1.3, 1.55 meters etc. The Depth for each ID also varies, so for one station the deepest measurement is at 30 meters and for another one it's 46 meters.
I have about 4 million rows, so this is just an output of my data:
ID Date Depth Oxygen
------ ---------- ----- ------
957001 2002-01-14 1.20 12.10
967503 2002-01-28 2.00 11.60
957001 2002-01-22 25.00 7.80
965206 2002-01-28 5.40 11.70
953001 2002-01-31 23.60 10.30
941101 2002-01-22 8.67 12.00
940201 2002-01-17 5.00 11.70
965404 2002-01-30 38.80 9.40
952003 2002-01-08 23.40 6.30
957101 2002-01-15 6.00 11.60
I have been searching on google for an answer but can't seem to find the right one. I can extract the highest value or the top 5 highest values by using arrange(), group_by() and slice(). However, that wouldn't work for my data frame because the measurement intervals vary in depth and it needs to be similar for all ID's and Dates.
I imagine that it could be something like; take the highest value and then keep the values that are within -4 from that highest value.
So, I need to end up with all the deepest (last 4 meters for Depth) measurements for Oxygen dependent on ID and Date.
It would look something like this:
ID Date Depth Oxygen
------ ---------- ----- ------
957001 2002-01-14 30.20 2.10
967503 2002-01-28 28.00 1.60
957001 2002-01-22 29.00 7.80
965206 2002-01-28 30.40 5.70
953001 2002-01-31 23.60 10.30
941101 2002-01-22 28.67 7.00
940201 2002-01-17 30.00 8.70
965404 2002-01-30 38.80 9.40
952003 2002-01-08 23.40 6.30
957101 2002-01-15 46.00 1.60
Just as you said, filter to Depth greater than max() - 4 within each ID. Using dplyr:
library(dplyr)
oxy %>%
group_by(ID) %>%
filter(Depth >= max(Depth) - 4) %>%
ungroup()
# A tibble: 9 × 4
ID Date Depth Oxygen
<dbl> <date> <dbl> <dbl>
1 967503 2002-01-28 2 11.6
2 957001 2002-01-22 25 7.8
3 965206 2002-01-28 5.4 11.7
4 953001 2002-01-31 23.6 10.3
5 941101 2002-01-22 8.67 12
6 940201 2002-01-17 5 11.7
7 965404 2002-01-30 38.8 9.4
8 952003 2002-01-08 23.4 6.3
9 957101 2002-01-15 6 11.6

Data aggregation by week or by 3 days

Here is an example of my data:
Date Prec aggregated by week (output)
1/1/1950 3.11E+00 4.08E+00
1/2/1950 3.25E+00 9.64E+00
1/3/1950 4.81E+00 1.15E+01
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00
I have long time series precipitation data and I want to aggregate it in such a way that (output is in third column; I calculated it from excel) is as follows
If I do aggregation by weekly
output in 1st cell = average prec from day 1 to 7 days.
output in 2nd cell = average prec from 8 to 14 days.
Output in 3rd cell=average prec from 15 to 21 day
If I do aggregation by 3 days
output in 1st cell = average of day 1 to 3 days.
output in 2nd cell = average of day 4 to 6 days.
I will provide the function with "prec" and the "time step" input. I tried loops and lubridate, POSIXct, and some other functions, but I cant figure out the output like in third column.
One code I came up with ran without error but my output is bot correct.
Where dat is my data set.
tt=as.POSIXct(paste(dat$Date),format="%m/%d/%Y") #converting date formate
datZoo <- zoo(dat[,-c(1,3)], tt)
weekly <- apply.weekly(datZoo,mean)
prec_NLCD <-data.frame (weekly)
Also I wanted to write it in form of a function. Your suggestions will be helpful.
Assuming the data shown reproducibly in the Note at the end create the weekly means, zm, and then merge it with z.
(It would seem to make more sense to merge the means at the point that they are calculated, i.e. merge(z, zm) in place of the line marked ##, but for consistency with the output shown in the question they are put at the head of the data below.)
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
zm <- rollapplyr(z, 7, by = 7, mean)
merge(z, zm = zoo(coredata(zm), head(time(z), length(zm)))) ##
giving:
z zm
1950-01-01 3.11 4.081429
1950-01-02 3.25 9.642857
1950-01-03 4.81 11.517143
1950-01-04 7.07 NA
1950-01-05 4.25 NA
1950-01-06 3.11 NA
1950-01-07 2.97 NA
1950-01-08 2.83 NA
1950-01-09 2.72 NA
1950-01-10 2.72 NA
1950-01-11 2.60 NA
1950-01-12 2.83 NA
1950-01-13 17.00 NA
1950-01-14 36.80 NA
1950-01-15 42.40 NA
1950-01-16 17.00 NA
1950-01-17 7.07 NA
1950-01-18 3.96 NA
1950-01-19 3.54 NA
1950-01-20 3.40 NA
1950-01-21 3.25 NA
Note:
Lines <- "Date Prec
1/1/1950 3.11E+00
1/2/1950 3.25E+00
1/3/1950 4.81E+00
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00"

R inference from one matrix to a data frame

I think this may be a very simple and easy question, but since I'm new to R, I hope someone can give me some outlines of how to solve it step by step. Thanks!
So the question is if I have a (n * 2) matrix (say m) where the first column representing the index of the data in another data frame (say d) and the second column representing some value(p value).
What i want to do is if the p value of some row r in m is less than 0.05,I will plot the data in d by the index indicated in the first column in row r of matrix m.
..............
The data is somewhat like what I draw below:
m:
ind p_value
2 0.02
23 0.03
56 0.12
64 0.54
105 0.04
d:
gene_id s1 s2 s3 s4 ... sn
IDH1 0.23 3.01 0 0.54 ... 4.02
IDH2 0.67 0 8.02 10.54 ... 0.72
...
so IDH2 is corresponding to the first line in m whose index column is 2
toplot <- d[ m[ m[,'p_value'] < .05,'ind'], ] works!

cumulative sum in r based on two columns

R newb; have tried to figure this out on the basis of earlier questions, but didn't really have any success. I have data that looks roughly like the following:
Name Date Value
A 2014-09-11 1.23
A 2014-12-11 4.56
A 2014-03-01 7.89
A 2014-06-05 0.12
B 2014-09-25 9.87
B 2014-12-21 6.54
B 2014-11-12 3.21
I'm looking to perform the following task on a data-frame: Add an index column that counts the cumulative occurrences of the column Name (which contains strings, not factors). For each "Name" replace all elements at cumulative index k or larger with the element at index k-1 for the given Name.
So for k=4, the result would be:
Name Date Value
A 2014-09-11 1.23
A 2014-12-11 4.56
A 2014-03-01 7.89
A 2014-06-05 7.89
B 2014-09-25 9.87
B 2014-12-21 6.54
B 2014-11-12 3.21
Any hints at how to do this in idiomatic R; looping over the frame will probably work, but I'm trying to learn to do this the way it was intended, to pick up some R skills on the go as well.
I think that you are looking for this:
require("data.table")
A = data.table(
Name = c("A","A","A","A","B","B","B"),
Date = c("2014-09-11", "2014-12-11", "2014-03-01", "2014-06-05", "2014-09-25", "2014-12-21", "2014-11-12"),
Value = c(1.23, 4.56, 7.89, 0.12, 9.87, 6.54,3.21))
A[,IX:=seq(1,.N),by="Name"]
Edit: (Since you corrected the question, I update my answer.)
func = function(x,b){return(c(x[seq(1,b)],rep(x[b],length(x)-b)))}
k = 4
A[,Value:=func(Value,k-1),by="Name"]

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources