R data.table, select columns with no NA

R data.table, select columns with no NA - r

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!

Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]

There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]

An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows

Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

Related

In R, how do I keep the first single occurrence of a row based on a repeated value in one column?

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.
In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.
Any help is appreciated!
Original Dataframe
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
Final Dataframe
2007-01-31 2.72 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-11-30 5.75 4.76 4
2008-04-30 6.86 4.91 1
2008-07-31 8.00 5.13 4

Here's another solution using run length encoding rle().
lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]
Output:
V1 V2 V3 V4
1 2007-01-31 2.72 4.75 2
4 2007-04-30 2.74 4.75 3
5 2007-05-31 2.46 4.75 2
6 2007-06-30 2.98 4.75 3
11 2007-11-30 5.75 4.76 4
16 2008-04-30 6.86 4.91 1
19 2008-07-31 8.00 5.13 4

You can use data.table::shift to filter, plus the first row, in rbind
library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])
Or an equivalent approach using dplyr
library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))
Output:
date v1 v2 v3
<IDat> <num> <num> <int>
1: 2007-01-31 2.72 4.75 2
2: 2007-04-30 2.74 4.75 3
3: 2007-05-31 2.46 4.75 2
4: 2007-06-30 2.98 4.75 3
5: 2007-11-30 5.75 4.76 4
6: 2008-04-30 6.86 4.91 1
7: 2008-07-31 8.00 5.13 4

DATA
x <- "
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
"
df <- read.table(textConnection(x) , header = F)
and use this two lines
df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]
#> V1 V2 V3 V4
#> 1 2007-01-31 2.72 4.75 2
#> 4 2007-04-30 2.74 4.75 3
#> 5 2007-05-31 2.46 4.75 2
#> 6 2007-06-30 2.98 4.75 3
#> 11 2007-11-30 5.75 4.76 4
#> 16 2008-04-30 6.86 4.91 1
#> 19 2008-07-31 8.00 5.13 4
Created on 2022-06-12 by the reprex package (v2.0.1)

Transforming tibble to tsibble

I've got a tibble that I'm struggling to turn into a tsibble.
# A tibble: 13 x 8
year `Administration, E~ `All Staff` `Ambulance staff` `Healthcare Assi~ `Medical and De~ `Nursing, Midwife~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2009 3.97 5.08 7.16 6.94 1.36 6.19
2 2010 4.12 5.07 6.89 7.02 1.41 6.02
3 2011 4.06 5.03 6.69 7.06 1.36 6.02
4 2012 4.40 5.40 7.79 7.48 1.52 6.44
5 2013 4.28 5.35 8.19 7.46 1.48 6.44
6 2014 4.45 5.56 8.87 7.82 1.53 6.67
7 2015 4.30 5.29 6.86 7.54 1.44 6.30
8 2016 4.21 5.15 7.56 7.15 1.66 6.17
9 2017 4.33 5.13 7.32 7.20 1.69 6.04
10 2018 4.58 5.30 7.96 7.00 1.73 6.38
11 2019 4.71 5.52 7.66 7.96 1.94 6.65
12 2020 4.69 5.98 7.49 8.37 2.11 7.56
13 2021 4.19 5.72 9.62 8.47 1.71 7.29
# ... with 1 more variable: Scientific, Therapeutic and Technical staff <dbl>
How would I turn this into a tsibble so that I can plot graphs with ggplot2?
When trying as_tsibble()
absence_ts <- as_tsibble(absence, key = absence$All Staff, index = absence$year)
it comes up with the following error:
Error: Must subset columns with a valid subscript vector. x Can't convert from <double> to <integer> due to loss of precision.

Cross-tabulating data with a function

I have data in tree columns
set.seed(42)
N = 1000
XYp = as.data.frame(matrix(cbind(round(runif(N)*100),
round(runif(N)*1000+1000),
round(runif(N),2)),N,3))
colnames(XYp) <- c('X','Y','p')
Now I would like to cross-tabulate the data based on deciles in 2 dimension:
colX_deciles = quantile(data[,'X'], probs=seq(0,1,1/10))
colY_deciles = quantile(data[,'Y'], probs=seq(0,1,1/10))
XYp['X_decile'] <- findInterval(XYp[,'X'],colX_deciles,all.inside = TRUE)
XYp['Y_decile'] <- findInterval(XYp[,'Y'],colY_deciles,all.inside = TRUE)
> colX_deciles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.0 9.9 18.0 29.0 39.0 48.0 57.0 69.0 79.2 91.0 100.0
> colY_deciles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1000.0 1088.0 1180.0 1279.4 1392.0 1502.5 1602.4 1711.3 1805.2 1902.0 2000.0
I have figured out that it is possible to calculate the sum of elements in column p using xtabs:
> xtabs(p ~ X_decile + Y_decile, XYp)
Y_decile
X_decile 1 2 3 4 5 6 7 8 9 10
1 2.57 8.74 5.51 5.74 4.40 1.77 5.79 3.43 4.66 3.80
2 6.43 4.25 7.29 5.41 3.08 4.43 8.70 2.62 3.37 4.45
3 1.99 2.80 7.54 2.56 5.02 4.30 7.99 2.03 4.91 6.28
4 4.53 4.90 8.04 3.49 2.25 2.87 7.47 5.41 3.54 9.28
5 2.32 5.82 7.18 4.58 5.39 2.26 0.59 9.61 5.91 5.37
6 7.70 5.50 6.45 7.83 4.65 8.45 1.70 6.40 4.88 4.32
7 7.05 3.87 3.54 3.79 6.15 5.55 6.31 2.31 3.42 6.14
8 4.43 4.50 3.04 3.62 9.92 5.66 3.75 7.01 4.92 7.08
9 3.67 5.56 3.56 7.92 5.05 5.00 3.64 6.74 5.85 3.26
10 5.75 3.17 9.50 5.44 3.64 6.13 3.18 5.93 6.18 3.71
But how to elegantly apply any function to the cross-tabulated matrix element and get the results, for example avg(p) in the following manner? :
> xtabs(mean(p) ~ X_decile + Y_decile, XYp)
Error in model.frame.default(formula = mean(p) ~ X_decile + Y_decile, :
variable lengths differ (found for 'X_decile')
As a bonus, the values of colX_deciles[1:10] and colY_deciles[1:10] could be set as row names and column names, respectively.

I assume you want to use XYp object all the time (sometimes you used data)
I would suggest to immerse the aggregate function inside xtabs
xtabs(p ~ X_decile + Y_decile, aggregate(p ~ X_decile + Y_decile, XYp, mean))
Y_decile
X_decile 1 2 3 4 5 6 7 8 9 10
1 0.4283333 0.5826667 0.5009091 0.4100000 0.5500000 0.2950000 0.5263636 0.4900000 0.3584615 0.4222222
2 0.5358333 0.5312500 0.6627273 0.4918182 0.3850000 0.5537500 0.5800000 0.4366667 0.4814286 0.4450000
3 0.3980000 0.3500000 0.5800000 0.5120000 0.4183333 0.3583333 0.4205263 0.3383333 0.5455556 0.5233333
4 0.4118182 0.3769231 0.6700000 0.5816667 0.5625000 0.3587500 0.6225000 0.3864286 0.5900000 0.7138462
5 0.4640000 0.4476923 0.6527273 0.5088889 0.4900000 0.4520000 0.1966667 0.6006250 0.4925000 0.5370000
6 0.4812500 0.6111111 0.7166667 0.5592857 0.5166667 0.6035714 0.3400000 0.5818182 0.5422222 0.6171429
7 0.5035714 0.5528571 0.4425000 0.5414286 0.5125000 0.3964286 0.4853846 0.5775000 0.4275000 0.4723077
8 0.4430000 0.4090909 0.6080000 0.5171429 0.6200000 0.5660000 0.4687500 0.5392308 0.3784615 0.5446154
9 0.4077778 0.6177778 0.5085714 0.7200000 0.4208333 0.5000000 0.4550000 0.5616667 0.5318182 0.3622222
10 0.5227273 0.4528571 0.6785714 0.3885714 0.3640000 0.4715385 0.5300000 0.5390909 0.6866667 0.5300000

subset xts or data.frame to just one particular day every year

I am new to quantmod, it has many ways to subset dates but I need to subset to a specific day of the year, i.e, 12/24 of every year out of a data set of many years and quantmod does not seem to have this function. Is there a way to do that?
Example:
getSymbols('AMD',src='google')
and you get data starting from 2007 and I want to subset it to a dataframe with just
2007-12-24 ...
2008-12-24 ...
2016-12-26 ...
#and so on.

You can try something like this:
getSymbols('AMD',src='google')
#indexmon==11 for every December and indexmday==24 for every 24th
AMD[.indexmon(AMD)==11 & .indexmday(AMD)==24]
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
#2007-12-24 7.78 7.88 7.68 7.77 9193719
#2008-12-24 1.98 2.03 1.97 1.99 2912312
#2009-12-24 9.79 9.95 9.78 9.91 11331966
#2012-12-24 2.54 2.57 2.47 2.48 9625363
#2013-12-24 3.77 3.80 3.75 3.77 5798855
#2014-12-24 2.63 2.70 2.63 2.65 4624005
#2015-12-24 2.88 3.00 2.86 2.92 11900888

Just to add to LyzandeR's answer, you could also convert the data to a tibble and use lubridate:
library(tidyverse)
library(lubridate)
library(quantmod)
getSymbols('AMD',src='google')
AMD %>% as_tibble() %>% rownames_to_column("date") %>%
filter(month(date) == 12, day(date) == 24)
date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-12-24 7.78 7.88 7.68 7.77 9193719
2 2008-12-24 1.98 2.03 1.97 1.99 2912312
3 2009-12-24 9.79 9.95 9.78 9.91 11331966
4 2012-12-24 2.54 2.57 2.47 2.48 9625363
5 2013-12-24 3.77 3.80 3.75 3.77 5798855
6 2014-12-24 2.63 2.70 2.63 2.65 4624005
7 2015-12-24 2.88 3.00 2.86 2.92 11900888

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!

If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)

As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50

Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R data.table, select columns with no NA - r

Try: example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]

There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply: example[, .SD, .SDcols = colSums(is.na(example)) == 0]

Related

In R, how do I keep the first single occurrence of a row based on a repeated value in one column?

Transforming tibble to tsibble

Cross-tabulating data with a function

subset xts or data.frame to just one particular day every year

R: returning row value when certain number of columns reach certain value

Categories

Resources