Subsetting only positive values of specific column in a list - r

I have the following code to get options data list and create a new list to get only puts data (only_puts_list)
library(quantmod)
Symbols<-c ("AA","AAL","AAOI","ABBV","ABC","ABNB")
Options.20221111 <- lapply(Symbols, getOptionChain)
names(Options.20221111) <- Symbols
only_puts_list <- lapply(Options.20221111, function(x) x$puts)
I'd like now to subset the only_puts_list and create a new list (i.e. new_list1) to subset and get only the data which has a positive value in the column ChgPct of the only_puts_list.
I guess lapply should work, but how to apply to only positive values of a specific column ChgPct?

We could use subset after looping over the list with lapply
new_list1 <- lapply(only_puts_list, subset, subset = ChgPct > 0)
If we check the output, most of the list elements returned have only 0 rows as there were no positive observations in 'ChgPct'. We could Filter to keep only those having any rows
new_list1_sub <- Filter(nrow, new_list1)
-output
new_list1_sub
$ABBV
ContractID ConractSize Currency Expiration Strike Last Chg ChgPct Bid Ask Vol OI LastTradeTime IV
31 ABBV221202P00155000 REGULAR USD 2022-12-02 155.0 0.66 0.1100000 20.00000 0.56 0.66 70 480 2022-11-29 13:10:43 0.2690503
32 ABBV221202P00157500 REGULAR USD 2022-12-02 157.5 1.49 0.2400000 19.20000 1.41 1.51 544 383 2022-11-29 13:17:43 0.2627027
33 ABBV221202P00160000 REGULAR USD 2022-12-02 160.0 3.05 0.4300001 16.41222 2.79 2.99 34 308 2022-11-29 12:07:54 0.2692944
34 ABBV221202P00162500 REGULAR USD 2022-12-02 162.5 4.95 1.6499999 50.00000 4.80 5.05 6 28 2022-11-29 13:26:10 0.3017648
ITM
31 FALSE
32 FALSE
33 TRUE
34 TRUE
$ABC
ContractID ConractSize Currency Expiration Strike Last Chg ChgPct Bid Ask Vol OI LastTradeTime IV ITM
18 ABC221202P00165000 REGULAR USD 2022-12-02 165 1.05 0.1999999 23.5294 0.6 0.8 3 111 2022-11-29 09:51:47 0.2710034 FALSE

Related

Dataframe modification consisting of multiple steps

I have these two datasets that I am trying to use for linear regression. One contains daily average values (independent variables) measured from a weather station.
date ST5_mean ST1_mean ST0_mean ST10_mean Snowheight Precipitation
1 2014-10-08 11.136713 10.980278 11.333995 11.622550 0.23680556 118
2 2014-10-09 9.255580 8.727486 8.796319 11.635243 0.00000000 124
3 2014-10-10 10.297521 9.441427 9.376736 12.879920 0.00000000 108
4 2014-10-11 9.080031 9.172347 9.389281 9.372538 0.01041667 152
5 2014-10-12 10.059455 9.428875 9.392774 11.866694 0.00000000 425
.
.
.
242 2015-06-06 12.946955 11.979896 11.50326 14.060399 0.00000000 470
243 2015-06-07 12.918128 11.737031 11.17246 13.691757 0.00000000 407
244 2015-06-08 12.214410 11.779344 11.50781 12.370771 0.00000000 100
245 2015-06-09 11.271517 10.942083 10.79751 11.324122 0.00000000 19
246 2015-06-10 8.597696 9.730661 10.20789 8.181455 0.01180556 481
The second one is basically a logger data (dependent variable) which may have several measurements per day or none (logger dataset) (data table jpeg). I need to modify the logger data so that is consistent with the station data and can run regression on these (which means there should be 1 row per day). Logger measurements ( the "Distance" column) that happened in the same day need to be summed up so that a single value per day is obtained; so if there are for example 3 measurements for 1.2.2014, there should be a value of 2.355 (3 x 0.785). Additionally, I need to create a row for every day of the period to match the sample size of the station data. A day for which the logger has no measurements, should have value of 0. I need to perform these modifications for numerous datasets so I need to figure out a code that does this in an automatic/semi-automatic manner. Manually adding data would be absurd as datasets have up to few thousand rows. Unfortunately, I couldn't come up with anything meaningful the last few days. Any help is appreciated.
I hope I managed to explain the problem here. Let me know if you would need more clarification. Thanks in advance!
P.S I managed the first part where I aggregate by the date and obtain the daily sums, however I am still stuck at creating a row for every day in the given time period and assigning 0 for the "distance" variable. This is what I have so far.
startTime <- as.Date("2014-10-08")
endTime <- as.Date("2015-06-10")
start_end <- c(startTime,endTime)
startTime <- as.Date("2014-10-08")
logger1 <- read.csv("124106_106.csv",header=TRUE, sep=",")
logger1$date <- as.Date(logger1$Date, "%d.%m.%Y")
logger1_sum <- aggregate (logger1$Distance, by = list(logger1$date), FUN = sum, na.rm=TRUE)"
names (logger1_sum) <- c("date", "distance")
head(logger1_sum, 5)
date distance
1 2014-10-02 1.570
2 2014-10-03 3.140
3 2014-10-08 3.925
4 2014-10-23 9.420
5 2014-10-24 3.925
tail(logger1_sum, 5)
date distance
45 2015-05-26 1.570
46 2015-05-27 1.570
47 2015-05-28 1.570
48 2015-06-10 0.785
49 2015-07-06 1.570
I think this should do the job. I use the data.table package which makes joins super easy and fast.
For brevity, I do not report your data so you will see the code starting as if the logger and station data.frame are already in the environment. The code does the following: it sums the columns Distance and AccuDist (assuming those two columns are the one important) by the column date, which is the one correctly formatted in Date class.
Then, I set the merging keys with the function setkey(). If you want to read more about how to joins work and how to perform them with data.table, please refer to this link. If you instead want to know more about data.table in general, you can refer to the official website here.
I then define the data.table final which comes out of right outer join. This way, I will retain all the observations (i.e., rows) in the object station.
library(data.table)
# this converts the two data.frame in data.table by reference
setDT(logger)
setDT(station)
# sum Distance by date
logger_summed <- logger[ , .( sum_Distance = sum(Distance),
sum_AccuDist = sum(AccuDist)), by = date]
> head(logger_summed)
## date sum_Distance sum_AccuDist
## 1: 2014-10-02 1.570 2.355
## 2: 2014-10-03 3.140 14.130
## 3: 2014-10-08 3.925 35.325
## 4: 2014-10-23 9.420 164.850
## 5: 2014-10-24 3.925 102.050
## 6: 2014-10-25 2.355 70.650
setkey( logger_summed, date )
setkey( station, date )
final <- logger_summed[ station ]
final[ is.na(sum_Distance), `:=` ( sum_Distance = 0, sum_AccuDist = 0) ]
> final
## date sum_Distance sum_AccuDist ST5_mean ST1_mean ST0_mean ST10_mean Snowheight Precipitation
## 1: 2014-10-08 3.925 35.325 11.136713 10.980278 11.333995 11.622550 0.23680556 118
## 2: 2014-10-09 0.000 0.000 9.255580 8.727486 8.796319 11.635243 0.00000000 124
## 3: 2014-10-10 0.000 0.000 10.297521 9.441427 9.376736 12.879920 0.00000000 108
## 4: 2014-10-11 0.000 0.000 9.080031 9.172347 9.389281 9.372538 0.01041667 152
## 5: 2014-10-12 0.000 0.000 10.059455 9.428875 9.392774 11.866694 0.00000000 425
## ---
## 242: 2015-06-06 0.000 0.000 12.946955 11.979896 11.503257 14.060399 0.00000000 470
## 243: 2015-06-07 0.000 0.000 12.918128 11.737031 11.172462 13.691757 0.00000000 407
## 244: 2015-06-08 0.000 0.000 12.214410 11.779344 11.507812 12.370771 0.00000000 100
## 245: 2015-06-09 0.000 0.000 11.271517 10.942083 10.797510 11.324122 0.00000000 19
## 246: 2015-06-10 0.785 115.395 8.597696 9.730661 10.207893 8.181455 0.01180556 481
Does this help?

Issue with calculating row mean in data table for selected columns in R

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]
Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

R identifying first value in data-frame and creating new variable by adding/subtracting this from all values in data-frame in new column

I know this question may have been already answered elsewhere and apologies for repeating it if so but I haven't found a workable answer as yet.
I have 17 subjects each with two variables as below:
Time (s) OD
130 41.48
130.5 41.41
131 39.6
131.5 39.18
132 39.41
132.5 37.91
133 37.95
133.5 37.15
134 35.5
134.5 36.01
135 35.01
I would like R to identify the first value in column 2 (OD) of my dataframe and create a new column (OD_adjusted) by adding or subtracting (depending if the first value is +ive or -ive) from all values in column 2 so it would look like this:
Time (s) OD OD_adjusted
130 41.48 0
130.5 41.41 -0.07
131 39.6 -1.88
131.5 39.18 -2.3
132 39.41 -2.07
132.5 37.91 -3.57
133 37.95 -3.53
133.5 37.15 -4.33
134 35.5 -5.98
134.5 36.01 -5.47
135 35.01 -6.47
First value in column 2 is 41.48 therefore I want to subtract this value from all datapoints in column 2 to create a new third column (OD_adjusted).
I can use OD_adjusted <- ((df$OD) - 41.48) however, I would like to automate the process using a function and this is where I am stuck:
AUC_OD <- function(df){
return_value_1 = df %>%
arrange(OD) %>%
filter(OD [1,2) %>%
slice_(1)
colnames(return_value_1)[3] <- "OD_adjusted"
if (nrow(return_value_1) > 0 ) { subtract
(return_value_1 [1,2] #into new row
else add
(return_value_1 [1,2] #into new row
}
We get the first element of 'OD' and subtract from the column
library(dplyr)
df1 %>%
mutate(OD_adjusted = OD- OD[1])
Or using base R
df1$OD_adjusted <- with(df1, OD - OD[1])

Error message in Treeclim

I'm attempting to use the package treeclim to analyze my tree ring growth data and climate. I measured the widths in CooRecorder, grouped them into series in CDENDRO, and read them into R-Studio using dplR read.rwl function. However, I keep getting an error message reading
"Error in dcc(Plot92.crn, Site92PRISM, selection = -6:9, method = "response", :
Overlapping time span of chrono and climate records is smaller than number of parameters! Consider adapting the number of parameters to a maximum of 100."
I have 100 years of monthly climate data that looks like below:
# head(Site92PRISM)
year month ppt tmax tmin tmean vpdmin..hPa. vpdmax..hPa. site
1 1915 01 0.97 26.1 12.3 19.2 0.97 2.32 92
2 1915 02 1.20 31.5 16.2 23.9 1.03 3.30 92
3 1915 03 2.51 36.0 17.0 26.5 0.97 4.69 92
4 1915 04 3.45 48.9 26.3 37.6 1.14 8.13 92
5 1915 05 3.95 44.6 29.1 36.9 0.94 5.58 92
6 1915 06 6.64 51.0 31.5 41.3 1.04 7.93 92
And my chronology, made in dplR looks like below:
#head(Plot92.crn)
CAMstd samp.depth
1840 0.7180693 1
1841 0.3175528 1
1842 0.5729651 1
1843 0.9785082 1
1844 0.7676334 1
1845 0.3633687 1
Where am I going wrong? Both files contain data from 1915-2015.
I posted a similar question to the author in the google forum of the package (i.e. https://groups.google.com/forum/#!forum/treeclim).
What you need to make sure of is that the number of parameters (n_param) is less or equal to the sample size of your dendrochronological data. By 'number of parameters' I mean the total number of columns in the climatic variables matrices.
For instance, in the following analysis:
resp <- dcc(chrono = my_chrono,
climate = list(precip, temp),
boot = 'stationary')
You need to make sure that the following is TRUE :
length(unique(rownames(my_chrono))) >= (ncol(precip)-1) + (ncol(temp)-1)
ncol(precip)-1 and not ncol(precip) because the first column of the matrix is YEAR. Also note that in my example the years in my_chrono are the same years as in precip and temp, which doesn't have to be the case to run the function (it will automatically take the common years).
Finally, if the previous line code gives you FALSE, you can reduce the number of parameters with the argument selection like this :
resp <- dcc(chrono = my_chrono,
climate = list(precip, temp),
selection = .range(6:12,'prec') + .range(6:12, 'temp'),
var_names = c('prec', 'temp'),
boot = 'stationary')
Because the dcc function automatically takes all months from previous June to current September (i.e. .range(-6:9)), you may need to reduce that range.

list unique values for each column in a data frame

Suppose you have a very large input file in "csv" format. And you want to know the different values that occur in each column. How would you do that?
ex.
column1 column2 column3 column4
----------------------------------------
value11 value12 value13 value14
value21 value22 value23 value24
...
valueN1 valueN2 valueN3 valueN4
So I want my output to be something like:
column1 has these values: value11, value21, ...valueN1. but I don't need to see reoccurrences of the same value. I need this just to get an idea of what my data is all about.
Let dat be your data frame after reading in the csv file, you can do
ulst <- lapply(dat, unique)
If you further want to know the number of unique values for each column, do
k <- lengths(ulst)
I find the describe() function from the Hmisc package very handy to get an overview on a dataset, e.g.,
Hmisc::describe(chickwts)
chickwts
2 Variables 71 Observations
----------------------------------------------------------------------------------------------------------------
weight
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
71 0 66 1 261.3 90.26 140.5 153.0 204.5 258.0 323.5 359.0
.95
385.0
lowest : 108 124 136 140 141, highest: 380 390 392 404 423
----------------------------------------------------------------------------------------------------------------
feed
n missing distinct
71 0 6
Value casein horsebean linseed meatmeal soybean sunflower
Frequency 12 10 12 11 14 12
Proportion 0.169 0.141 0.169 0.155 0.197 0.169
----------------------------------------------------------------------------------------------------------------

Resources