Calculating table in R with uneven length - r

I have to table of data in R
a = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 2 0 0 0 2
2 3 0 0 10 3
3 4 0 51 25 0
4 5 19 129 14 0
5 6 60 137 1 0
6 7 31 62 15 5
7 8 7 11 7 0
and
b = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 1 0 0 1 266
2 2 1 0 47 335
3 3 1 26 415 142
4 4 3 965 508 5
5 5 145 2535 103 0
6 6 939 2239 15 6
7 7 420 613 86 34
8 8 46 84 36 16
I wouold like to calculate b/a by matching the duration. I though of some thing like ifelse() but it does not work. Can someone please help me?
Thanks a lot

Match the order and selection of b with a (in my example y with x). Then do the math.
x <- data.frame(duration = 2:8, v = rnorm(7))
y <- data.frame(duration = 8:1, v = rnorm(8))
m <- match(y$duration, x$duration)
ym <- y[m[!is.na(m)],]
x$v/ym$v
It does not work when x contains items that are not in y, btw.

Do you want something like the following:
a <- a[-1]
b <- b[-1]
a <- a[order(a$Duration),]
b <- b[order(b$Duration),]
durations <- intersect(a$Duration, b$Duration)
b[b$Duration %in% durations,] / a[a$Duration %in% durations,]
Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
2 1 Inf NaN Inf 167.50000
3 1 Inf Inf 41.500000 47.33333
4 1 Inf 18.921569 20.320000 Inf
5 1 7.631579 19.651163 7.357143 NaN
6 1 15.650000 16.343066 15.000000 Inf
7 1 13.548387 9.887097 5.733333 6.80000
8 1 6.571429 7.636364 5.142857 Inf
you may like to replace NaN and Inf values by something else.

Related

Consecutive Positive or Negative calculation from data frame and filter results using R

I have the following dataset and looking to write a code that can help pull out which stocks have been positive or negative consecutively. The data would have first 3 column. last 2 columns are manually calculated in excel to depict expected results.
This is only sample, i would have data for 200+ stocks and few years of data with all stocks not trading every day.
In the end, i want to extract which stocks have say 3 or 4 or 5 consecutive positive or negative change for the day.
` Stocks Date Close Price Change for day Positive/Negative Count
A 11/11/2020 11
B 11/11/2020 50
C 11/11/2020 164
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2
The difficulty is to have a function that makes the cumulative sum, both positive and negative, resetting the count when the sign changes, and starting the count with the first value. I managed to make one, but it is not terribly efficient and will probably get slow on a bigger dataset. I suspect there is a way to do better, if only with a simple for loop in C or C++.
library(tidyverse)
df <- read.table(text="Stocks Date Close_Price Change_for_day Positive/Negative_Count
A 11/11/2020 11 NA 0
B 11/11/2020 50 NA 0
C 11/11/2020 164 NA 0
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2",
header = TRUE) %>%
select(1:3) %>%
as_tibble()
# this formulation could be faster on data with longer stretches
nb_days_cons2 <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
y <- rle(x)
cur_pos <- 1
for(i in seq_len(length(y$lengths))){
out[(cur_pos):(cur_pos+y$lengths[i]-1)] <- cumsum(rep(y$values[i], y$lengths[i]))
cur_pos <- cur_pos + y$lengths[i]
}
out
}
# this formulation was faster on some tests, and would be easier to rewrite in C
nb_days_cons <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
out[1] <- x[1]
for(i in 2:n){
if(x[i] == x[i-1]){
out[i] <- out[i-1] + x[i]
} else{
out[i] <- x[i]
}
}
out
}
Once we have that function, the dplyr part is quite classic.
df %>%
group_by(Stocks) %>%
arrange(Date) %>% # make sure of order
mutate(change = c(0, diff(Close_Price)),
stretch_duration = nb_days_cons(sign(change))) %>%
arrange(Stocks)
#> # A tibble: 18 x 5
#> # Groups: Stocks [3]
#> Stocks Date Close_Price change stretch_duration
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 A 11/11/2020 11 0 0
#> 2 A 11/12/2020 19 8 1
#> 3 A 11/13/2020 7 -12 -1
#> 4 A 11/16/2020 17 10 1
#> 5 A 11/17/2020 24 7 2
#> 6 A 11/18/2020 31 7 3
#> 7 B 11/11/2020 50 0 0
#> 8 B 11/12/2020 62 12 1
#> 9 B 11/13/2020 63 1 2
#> 10 B 11/16/2020 70 7 3
#> 11 B 11/17/2020 52 -18 -1
#> 12 B 11/18/2020 61 9 1
#> 13 C 11/11/2020 164 0 0
#> 14 C 11/12/2020 125 -39 -1
#> 15 C 11/13/2020 165 40 1
#> 16 C 11/16/2020 170 5 2
#> 17 C 11/17/2020 165 -5 -1
#> 18 C 11/18/2020 157 -8 -2
Created on 2020-11-19 by the reprex package (v0.3.0)
Of course, the final arrange() is just for easy visualization, and you can remove the columns you don't need anymore with select().

Using aggregate in a dataframe with NA without dropping rows [duplicate]

This question already has an answer here:
Blend of na.omit and na.pass using aggregate?
(1 answer)
Closed 5 years ago.
I am using aggregate to get the means of several variables by a specific category (cy), but there are a few NA's in my dataframe. I am using aggregate rather than ddply because from my understanding it takes care of NA's similarly to using rm.na=TRUE. The problem is that it drops all rows containing NA in the output, so the means are slightly off.
Dataframe:
> bt cy cl pf ne YH YI
1 1 H 1 95 70.0 20 20
2 2 H 1 25 70.0 46 50
3 1 H 1 0 70.0 40 45
4 2 H 1 95 59.9 40 40
5 2 H 1 75 59.9 36 57
6 2 H 1 5 70.0 35 43
7 1 H 1 50 59.9 20 36
8 2 H 1 95 59.9 40 42
9 3 H 1 95 49.5 17 48
10 2 H 1 5 70.0 42 42
11 2 H 1 95 49.5 19 30
12 3 H 1 25 49.5 33 51
13 1 H 1 75 49.5 5 26
14 1 H 1 5 70.0 35 37
15 1 H 1 5 59.9 20 40
16 2 H 1 95 49.5 29 53
17 2 H 1 75 70.0 41 41
18 2 H 1 0 70.0 10 10
19 2 H 1 95 49.5 25 32
20 1 H 1 95 59.9 10 11
21 2 H 1 0 29.5 20 28
22 1 H 1 95 29.5 11 27
23 2 H 1 25 59.9 26 26
24 1 H 1 5 70.0 30 30
25 3 H 1 25 29.5 20 30
26 3 H 1 50 70.0 5 5
27 1 H 1 0 59.9 3 10
28 1 K 1 5 49.5 25 29
29 2 K 1 0 49.5 30 32
30 1 K 1 95 49.5 13 24
31 1 K 1 0 39.5 13 13
32 2 M 1 NA 70.0 45 50
33 3 M 1 25 59.9 3 34'
The full dataframe has 74 rows, and there are NA's peppered throughout all but two columns (cy and cl).
My code looks like this:
meancnty<-(aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean))
I double checked in excel, and the means this function produces are for a dataset of N=69, after removing all rows containing NA's. Is there any way to tell R to ignore the NA's rather than remove the rows, other than taking the mean of each variable by county (I have a lot of variables to summarize by many different categories)?
Thank you
using dplyr
df %>%
group_by(cy) %>%
summarize_all(mean, na.rm = TRUE)
# cy bt cl pf ne YH YI
# 1 H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2 K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3 M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4 O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571
I think this will work:
meancnty<-(aggregate(with(newChart(cbind(pf,ne,YH,YI),
by = list(newchart$cy), FUN=mean, na.rm=T))
I used the following test data:
> q<- data.frame(y = sample(c(0,1), 10, replace=T), a = runif(10, 1, 100), b=runif(10, 20,30))
> q$a[c(2, 5, 7)]<- NA
> q$b[c(1, 3, 4)]<- NA
> q
y a b
1 0 86.87961 NA
2 0 NA 22.39432
3 0 89.38810 NA
4 0 12.96266 NA
5 1 NA 22.07757
6 0 73.96121 24.13154
7 0 NA 22.31431
8 1 62.77095 21.46395
9 0 55.28476 23.14393
10 0 14.01912 28.08305
Using your code from above, I get:
> aggregate(cbind(a,b)~y, data=q, mean, na.rm=T)
y a b
1 0 47.75503 25.11951
2 1 62.77095 21.46395
which is wrong, i.e. it deletes all rows with any NAs and then takes the mean.
This however gave the right result:
> aggregate(with(q, cbind(a, b)), by = list(q$y), mean, na.rm=T)
Group.1 a b
1 0 55.41591 24.01343
2 1 62.77095 21.77076
It did na.rm=T by column first, and then took the average by group.
Unfortunately, I have no idea why that is, but my guess is that is has to do with the class of y.

Using do() with names of list elements

I am trying to take the names of list elements and use do() to apply a function over them all, then bind them in a single data frame.
require(XML)
require(magrittr)
url <- "http://gd2.mlb.com/components/game/mlb/year_2016/month_05/day_21/gid_2016_05_21_milmlb_nynmlb_1/boxscore.xml"
box <- xmlParse(url)
xml_data <- xmlToList(box)
end <- length(xml_data[[2]]) - 1
x <- seq(1:end)
away_pitchers_names <- paste0("xml_data[[2]][", x, "]")
away_pitchers_names <- as.data.frame(away_pitchers_names)
names(away_pitchers_names) <- "elements"
away_pitchers_names$elements %<>% as.character()
listTodf <- function(x) {
df <- as.data.frame(x)
tdf <- as.data.frame(t(df))
row.names(tdf) <- NULL
tdf
}
test <- away_pitchers_names %>% group_by(elements) %>% do(listTodf(.$elements))
When I run the listTodf function on a list element it works fine:
listTodf(xml_data[[2]][1]
id name name_display_first_last pos out bf er r h so hr bb np s w l sv bs hld s_ip s_h s_r s_er s_bb
1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3 0 0 0 36.0 41 24 23 15
s_so game_score era
1 25 45 5.75
But when I try to loop through the names of the elements with the do() function I get the following:
Warning message:
In rbind_all(out[[1]]) : Unequal factor levels: coercing to character
And here is the output:
> test
Source: local data frame [5 x 2]
Groups: elements [5]
elements V1
(chr) (chr)
1 xml_data[[2]][1] xml_data[[2]][1]
2 xml_data[[2]][2] xml_data[[2]][2]
3 xml_data[[2]][3] xml_data[[2]][3]
4 xml_data[[2]][4] xml_data[[2]][4]
5 xml_data[[2]][5] xml_data[[2]][5]
I am sure it is something extremely simple, but I can't figure out where things are getting tripped up.
For evaluating the strings, eval(parse can be used
library(dplyr)
lapply(away_pitchers_names$elements,
function(x) as.data.frame.list(eval(parse(text=x))[[1]], stringsAsFactors=FALSE)) %>%
bind_rows()
# id name name_display_first_last pos out bf er r h so hr bb np s w l
#1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3
#2 430641 Boyer Blaine Boyer P 2 4 0 0 2 0 0 0 8 7 1 0
#3 448614 Torres, C Carlos Torres P 3 4 0 0 0 1 0 2 21 11 0 1
#4 592804 Thornburg Tyler Thornburg P 3 3 0 0 0 1 0 0 14 8 2 1
#5 518468 Blazek Michael Blazek P 1 5 1 1 2 0 0 2 23 10 1 1
# sv bs hld s_ip s_h s_r s_er s_bb s_so game_score era loss note
#1 0 0 0 36.0 41 24 23 15 25 45 5.75 <NA> <NA>
#2 0 1 0 21.1 22 4 4 5 7 48 1.69 <NA> <NA>
#3 0 0 2 22.1 22 9 9 14 21 52 3.63 <NA> <NA>
#4 1 2 8 18.2 13 8 8 7 29 54 3.86 <NA> <NA>
#5 0 1 8 21.1 23 6 6 14 18 41 2.53 true (L, 1-1)
However, it is easier and faster to just do
lapply(xml_data[[2]][1:5], function(x)
as.data.frame.list(x, stringsAsFactors=FALSE)) %>%
bind_rows()

How to remove rows based on distance from an average of column and max of another column

Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)
subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1
Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)

Replace last NA of a segment of NAs in a column with last valid value

Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21

Resources