Substituting the results of a calculation - r

I'm munging data, specifically, I've opened this pdf http://pubs.acs.org/doi/suppl/10.1021/ja105035r/suppl_file/ja105035r_si_001.pdf and scraped the data from table s4,
1a 1b 1a 1b
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
9(R) 4.16 3.89 4.12b 4.18b
9(S) 4.16 3.92 4.12b 4.18b
10 1.19 0.91 1.21 1.25
pasted it into notepad and saved it as a txt file.
s4 <- read.table("s4.txt", header=TRUE, stringsAsFactors=FALSE)
gives,
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
in order to use the data I need to change it all to numeric and remove the letters, thanks to this link R regex gsub separate letters and numbers I can use the following code,
gsub("([[:alpha:]])","",s4[,3])
I can get rid of the extraneous letters.
What I want to do now, and the point of the question, is to change the ranges,
"7.22-7.35" "7.22-7.35" "7.22-7.35"
with their means,
"7.29"
Could I use gsub for this? (or would I need to strsplit across the hyphen, combine into a vector and return the mean?).

You need a single regex in strsplit for this task (removing letters and splitting):
s4[] <- lapply(s4, function(x) {
if (is.numeric(x)) x
else sapply(strsplit(as.character(x), "-|[[:alpha:]]"),
function(y) mean(as.numeric(y)))
})
The result:
> s4
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.090 4.750
2 2.47 2.74 2.770 2.800
4 1.14 1.38 1.120 1.020
6 7.43 7.35 7.285 7.305
7 7.38 7.34 7.285 7.305
8 7.23 7.20 7.285 7.305

Here's an approach that seems to work right on the sample data:
df[] <- lapply(df, function(col){
col <- gsub("([[:alpha:]])","", col)
col <- ifelse(grepl("-", col), mean(as.numeric(unlist(strsplit(col[grepl("-", col)], "-")))), col)
as.numeric(col)
})
> df
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305
Disclaimer: It only works right if the ranges in each column are all the same (as in the sample data)

something like that :
mean(as.numeric(unlist(strsplit("7.22-7.35","-"))))
should work (and correspond to what you had in mind I guess)
or you can do :
eval(parse(text=paste0("mean(c(",gsub("-",",","7.22-7.35"),"))")))
but I'm not sure this is simpler...
To apply it to a vector :
vec<-c("7.22-7.35","7.22-7.35")
1st solution : sapply(vec, function(x) mean(as.numeric(unlist(strsplit(x,"-")))))
2nd solution : sapply(vec, function(x) eval(parse(text=paste0("mean(c(",gsub("-",",",x),"))"))))
In both cases, you'll get :
7.22-7.35 7.22-7.35
7.285 7.285

Also,
library(gsubfn)
indx <- !sapply(s4, is.numeric)
s4[indx] <- lapply(s4[indx], function(x)
sapply(strapply(x, '([0-9.]+)', ~as.numeric(x)), mean))
s4
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305

Related

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

Cross-tabulating data with a function

I have data in tree columns
set.seed(42)
N = 1000
XYp = as.data.frame(matrix(cbind(round(runif(N)*100),
round(runif(N)*1000+1000),
round(runif(N),2)),N,3))
colnames(XYp) <- c('X','Y','p')
Now I would like to cross-tabulate the data based on deciles in 2 dimension:
colX_deciles = quantile(data[,'X'], probs=seq(0,1,1/10))
colY_deciles = quantile(data[,'Y'], probs=seq(0,1,1/10))
XYp['X_decile'] <- findInterval(XYp[,'X'],colX_deciles,all.inside = TRUE)
XYp['Y_decile'] <- findInterval(XYp[,'Y'],colY_deciles,all.inside = TRUE)
> colX_deciles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.0 9.9 18.0 29.0 39.0 48.0 57.0 69.0 79.2 91.0 100.0
> colY_deciles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1000.0 1088.0 1180.0 1279.4 1392.0 1502.5 1602.4 1711.3 1805.2 1902.0 2000.0
I have figured out that it is possible to calculate the sum of elements in column p using xtabs:
> xtabs(p ~ X_decile + Y_decile, XYp)
Y_decile
X_decile 1 2 3 4 5 6 7 8 9 10
1 2.57 8.74 5.51 5.74 4.40 1.77 5.79 3.43 4.66 3.80
2 6.43 4.25 7.29 5.41 3.08 4.43 8.70 2.62 3.37 4.45
3 1.99 2.80 7.54 2.56 5.02 4.30 7.99 2.03 4.91 6.28
4 4.53 4.90 8.04 3.49 2.25 2.87 7.47 5.41 3.54 9.28
5 2.32 5.82 7.18 4.58 5.39 2.26 0.59 9.61 5.91 5.37
6 7.70 5.50 6.45 7.83 4.65 8.45 1.70 6.40 4.88 4.32
7 7.05 3.87 3.54 3.79 6.15 5.55 6.31 2.31 3.42 6.14
8 4.43 4.50 3.04 3.62 9.92 5.66 3.75 7.01 4.92 7.08
9 3.67 5.56 3.56 7.92 5.05 5.00 3.64 6.74 5.85 3.26
10 5.75 3.17 9.50 5.44 3.64 6.13 3.18 5.93 6.18 3.71
But how to elegantly apply any function to the cross-tabulated matrix element and get the results, for example avg(p) in the following manner? :
> xtabs(mean(p) ~ X_decile + Y_decile, XYp)
Error in model.frame.default(formula = mean(p) ~ X_decile + Y_decile, :
variable lengths differ (found for 'X_decile')
As a bonus, the values of colX_deciles[1:10] and colY_deciles[1:10] could be set as row names and column names, respectively.
I assume you want to use XYp object all the time (sometimes you used data)
I would suggest to immerse the aggregate function inside xtabs
xtabs(p ~ X_decile + Y_decile, aggregate(p ~ X_decile + Y_decile, XYp, mean))
Y_decile
X_decile 1 2 3 4 5 6 7 8 9 10
1 0.4283333 0.5826667 0.5009091 0.4100000 0.5500000 0.2950000 0.5263636 0.4900000 0.3584615 0.4222222
2 0.5358333 0.5312500 0.6627273 0.4918182 0.3850000 0.5537500 0.5800000 0.4366667 0.4814286 0.4450000
3 0.3980000 0.3500000 0.5800000 0.5120000 0.4183333 0.3583333 0.4205263 0.3383333 0.5455556 0.5233333
4 0.4118182 0.3769231 0.6700000 0.5816667 0.5625000 0.3587500 0.6225000 0.3864286 0.5900000 0.7138462
5 0.4640000 0.4476923 0.6527273 0.5088889 0.4900000 0.4520000 0.1966667 0.6006250 0.4925000 0.5370000
6 0.4812500 0.6111111 0.7166667 0.5592857 0.5166667 0.6035714 0.3400000 0.5818182 0.5422222 0.6171429
7 0.5035714 0.5528571 0.4425000 0.5414286 0.5125000 0.3964286 0.4853846 0.5775000 0.4275000 0.4723077
8 0.4430000 0.4090909 0.6080000 0.5171429 0.6200000 0.5660000 0.4687500 0.5392308 0.3784615 0.5446154
9 0.4077778 0.6177778 0.5085714 0.7200000 0.4208333 0.5000000 0.4550000 0.5616667 0.5318182 0.3622222
10 0.5227273 0.4528571 0.6785714 0.3885714 0.3640000 0.4715385 0.5300000 0.5390909 0.6866667 0.5300000

How can a variable name be used as a string in a for loop?

I am aware that:
deparse(substitute(x))
allows us to get the variable name "x" as a string, but in the context of a for-loop index it will return the index itself and not the variable that it refers to.
My code contains 3 examples:
Histogram <- function(n){
l <- list()
m <- list()
s <- list()
df <- data.frame(l, m, s, stringsAsFactors = F) #Dataframe to store output
for(i in names(n)){
CS <- normalmixEM(n$i) #Generates Histogram
plot(CS, which = 2, main2 = paste(deparse(substitute(i)), "Colony Size (mm)"))
df <- cbind(CS$lambda, CS$mu, CS$sigma)
dev.copy(png, paste(deparse(substitute(i)),".png")) #Export histogram
dev.off()
}
write.csv(df, file = paste(deparse(substitute(n)), ".csv"))
}
Where I would ideally like:
The plot main title to contain the name of the variable that is in the current loop (i)
The label of the exported plot to contain the same variable (i)
The final exported .csv file name to be the name of the dataframe put into the function (n)
My input data is in this format:
>head(my_values)
X1A X2A X3A X4A X5A X6A X7A X8A X9A X10A X11A X12A X13A X14A X15A X16A X17A X18A X19A X20A X21A X22A X23A
1 2.11 4.58 4.39 5.43 5.73 4.96 3.89 1.65 5.56 3.72 4.20 2.81 4.80 3.95 4.31 3.84 1.63 4.65 2.11 1.90 3.78 6.80 5.51
2 0.55 3.32 4.58 4.67 5.75 4.39 2.39 1.96 4.01 2.85 3.46 2.51 2.28 3.70 4.25 3.30 4.37 4.04 2.79 1.84 3.93 5.18 4.27
3 1.96 3.84 4.14 6.42 5.47 6.17 5.37 1.88 5.85 2.75 3.74 2.47 4.84 2.49 2.56 3.87 4.84 3.19 2.26 2.83 5.03 4.56 5.94
Is there a way to recover the variable name from the loop index before it is converted into the string?

subset xts or data.frame to just one particular day every year

I am new to quantmod, it has many ways to subset dates but I need to subset to a specific day of the year, i.e, 12/24 of every year out of a data set of many years and quantmod does not seem to have this function. Is there a way to do that?
Example:
getSymbols('AMD',src='google')
and you get data starting from 2007 and I want to subset it to a dataframe with just
2007-12-24 ...
2008-12-24 ...
2016-12-26 ...
#and so on.
You can try something like this:
getSymbols('AMD',src='google')
#indexmon==11 for every December and indexmday==24 for every 24th
AMD[.indexmon(AMD)==11 & .indexmday(AMD)==24]
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
#2007-12-24 7.78 7.88 7.68 7.77 9193719
#2008-12-24 1.98 2.03 1.97 1.99 2912312
#2009-12-24 9.79 9.95 9.78 9.91 11331966
#2012-12-24 2.54 2.57 2.47 2.48 9625363
#2013-12-24 3.77 3.80 3.75 3.77 5798855
#2014-12-24 2.63 2.70 2.63 2.65 4624005
#2015-12-24 2.88 3.00 2.86 2.92 11900888
Just to add to LyzandeR's answer, you could also convert the data to a tibble and use lubridate:
library(tidyverse)
library(lubridate)
library(quantmod)
getSymbols('AMD',src='google')
AMD %>% as_tibble() %>% rownames_to_column("date") %>%
filter(month(date) == 12, day(date) == 24)
date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-12-24 7.78 7.88 7.68 7.77 9193719
2 2008-12-24 1.98 2.03 1.97 1.99 2912312
3 2009-12-24 9.79 9.95 9.78 9.91 11331966
4 2012-12-24 2.54 2.57 2.47 2.48 9625363
5 2013-12-24 3.77 3.80 3.75 3.77 5798855
6 2014-12-24 2.63 2.70 2.63 2.65 4624005
7 2015-12-24 2.88 3.00 2.86 2.92 11900888

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Resources