Sorting and aggregating in R - r

I used the aggregate function in R to bring down my data entries from 90k to 1800.
a=test$ID
b=test$Date
c=test$Value
d=test$Value1
sumA=aggregate(c, by=list(Date=b,Id=a), FUN=sum)
sumB=aggregate(d, by=list(Date=b,Id=a), FUN=sum)
final[1]=sumA[1],final[2]=sumA[2]
final[3]=sumA[3]/sumB[3]
Now I have data in 20 different dates in a month with close to 90 different ids each day so its around 1800 entries in the final table .
My question is that I want to aggregate further down and find the maximum value of final[3] for each date so that I am just left with 20 values .
In simple terms -
There are 20 days .
Each day has 90 values for 90 ids
I want to find maximum of these 90 values for each day .
So at last I would be left with just 20 values for 20 days .
Now aggregate function is not working here with function 'max' instead of sum
Date ID Value Value1
1 A 20 10
1 A 25 5
1 B 50 5
1 B 50 5
1 C 25 25
1 C 35 5
2 A 30 10
2 A 25 45
2 B 40 10
2 B 40 30
This is the Data
Now By using Aggregate function I got final table as
Date ID x
1 A 45/15=3
1 B 100/10=10
1 c 60/30=2
2 A 55/55=1
2 B 80/40=2
Now I want maximum value for date 1 and 2 thats it
Date max- Value
1 10
2 2

This is a one step process using data table. The data.table is an evolved version of data.frame, and works really well. It has the class of data.frame, so works just like data.frame.
Step0: Converting data.frame to data.table:
library(data.table)
setDT(test)
setkey(test,Date,ID)
Step1: Do the computation
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
Here the explanation of the step:
The first part creates what you call the final table in your question:
test[,sum(Value)/sum(Value1),by=key(test)]
# Date ID V1
# 1: 1 A 3
# 2: 1 B 10
# 3: 1 C 2
# 4: 2 A 1
# 5: 2 B 2
Now this is passed to the second item to do the max function by Date:
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
# Date V1
# 1: 1 10
# 2: 2 2
Hope this helps.
It's a very well documented package. You should read more about it.

May be this helps.
test <- structure(list(Date = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), ID = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B"),
Value = c(20L, 25L, 50L, 50L, 25L, 35L, 30L, 25L, 40L, 40L
), Value1 = c(10L, 5L, 5L, 5L, 25L, 5L, 10L, 45L, 10L, 30L
)), .Names = c("Date", "ID", "Value", "Value1"), class = "data.frame", row.names = c(NA,
-10L))
res1 <- aggregate(. ~ID+Date, data=test, FUN=sum)
res1 <- transform(res1, x=Value/Value1)
res1
# ID Date Value Value1 x
#1 A 1 45 15 3
#2 B 1 100 10 10
#3 C 1 60 30 2
#4 A 2 55 55 1
#5 B 2 80 40 2
aggregate(. ~Date, data=res1[,-c(1,3:4)], FUN=max)
# Date x
# 1 1 10
# 2 2 2
First I run the aggregate based on two grouping variables (ID and Date) on the two value column by using. ~`
Created a new variable x i.e. Value/Value1 with transform
Did the final run of aggregate with one grouping variable (Date) and removed the rest of the variables except x.

Related

Find consecutive integers in a window of an array greater than threshold value group-wise in R

How can I get the index of the sample whose previous samples were consecutive in a window or range and were greater than a fixed threshold in groups?
In the below example, I need to find the time when I have consecutively 3 samples in a window that starts from 3rd element to the end of the array, and also whose speed is greater than 35 speed >= 35 group-wise
speed_threshold = 35
Group Time Speed
1 5 25 # Ignore first 3 elements
1 10 23 # Ignore first 3 elements
1 15 21 # Ignore first 3 elements
1 20 33 # Speed < 35
1 25 40 # Speed > 35
1 30 42 # Speed > 35
1 35 52 # Speed > 35
1 40 48 # <--- Return time = 40 as answer for Group 1 !
1 45 52
2 5 48 # Ignore first 3 elements
2 10 42 # Ignore first 3 elements
2 15 39 # Ignore first 3 elements
2 20 36 # Speed > 35
2 25 38 # Speed > 35
2 30 46 # Speed > 35
2 35 53 # <--- Return time = 35 as answer for Group 2 !
3 5 45 # Ignore first 3 elements
3 10 58 # <--- Return time = NA as answer for group 3 !
The solution I have tried is as follows using data.table -
df[, {above <- Speed[-(1:3)] > speed_thresh
ends <- which(above & rowid(rleid(above)) == 3)
.(Return_Time = Time[ends[1]+ 1])}
, Group]
The above solution removes the first three elements from the entire array, and not remove the first three elements in each group, how can I ignore the first three elements in each group and then find the consecutive integers exceeding the threshold?
Thanks in advance!
Note
Lines <- "Group Time Speed
1 5 25 # Ignore first 3 elements
1 10 23 # Ignore first 3 elements
1 15 21 # Ignore first 3 elements
1 20 33 # Speed < 35
1 25 40 # Speed > 35
1 30 42 # Speed > 35
1 35 52 # Speed > 35
1 40 48 # <--- Return time = 40 as answer for Group 1 !
1 45 52
2 5 48 # Ignore first 3 elements
2 10 42 # Ignore first 3 elements
2 15 39 # Ignore first 3 elements
2 20 36 # Speed > 35
2 25 38 # Speed > 35
2 30 46 # Speed > 35
2 35 53 # <--- Return time = 35 as answer for Group 2 !
3 5 45 # Ignore first 3 elements
3 10 58 # <--- Return time = NA as answer for group 3 !"
df <- read.table(text = Lines, header = TRUE)
Here is a tidyverse solution...
library(dplyr)
speed_threshold <- 35
df %>% group_by(Group) %>%
mutate(ind = cumsum(Speed >= speed_threshold), #nunber exceeding threshhold
ind = ind - lag(ind, 3, default = 0), #compared to 3 previous
ind = lag(ind, default = 0) == 3) %>% #mark one after where this hits 3
summarise(Return_Time = Time[max(7,which(ind)[1])]) #has to be at least the 7th value
# A tibble: 3 x 2
Group Return_Time
<int> <int>
1 1 40
2 2 35
3 3 NA
library(data.table)
setDT(df)
speed_thresh <- 35
df[, {
window <- 4:.N
above <- Speed[window] >= speed_thresh
ends <- which(above & rowid(rleid(above)) == 3)
.(Return_Time = Time[window][ends[1] + 1])
}
, Group]
#> Group Return_Time
#> 1: 1 40
#> 2: 2 35
#> 3: 3 NA
Let the ith value of roll be TRUE if the last 3 of 6 values ending at index i all exceed 35. Then find the first TRUE in each group, add 1 and index that into Time.
library(data.table)
library(zoo)
roll <- function(x) rollapplyr(x, 6, function(x) all(tail(x, 3) > 35), fill = FALSE)
DT[, list(Time = Time[which(roll(Speed))[1] + 1]), by = Group]
giving
Group Time
1: 1 40
2: 2 35
3: 3 NA
Note
DF <- structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L), Time = c(5L, 10L, 15L, 20L,
25L, 30L, 35L, 40L, 45L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 5L,
10L), Speed = c(25L, 23L, 21L, 33L, 40L, 42L, 52L, 48L, 52L,
48L, 42L, 39L, 36L, 38L, 46L, 53L, 45L, 58L)), row.names = c(NA,
-18L), class = "data.frame")
library(data.table)
DT <- as.data.table(DF)
Note 2
The poster indicated that they have old outdated versions of all packages and R and are restricted in installing packages so use this base R version of roll instead.
roll <- function(x) {
f <- function(i) if (i < 6) FALSE else all(x[seq(to = i, length = 3)] > 35)
sapply(seq_along(x), f)
}

Removing rows when some values match and some do not [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
ID Amount Previous
1 10 15
1 10 13
2 20 18
2 20 24
3 5 7
3 5 6
I want to remove the duplicate rows from the following data frame, where ID and Amount match. Values in the Previous column do not match. When deciding which row to take, I'd like to take the one where the Previous column value is higher.
This would look like:
ID Amount Previous
1 10 15
2 20 24
3 5 7
An option is distinct on the columns 'ID', 'Amount' (after arrangeing the dataset) while specifying the .keep_all = TRUE to get all the other columns that correspond to the distinct elements in those columns
library(dplyr)
df1 %>%
arrange(ID, Amount, desc(Previous)) %>%
distinct(ID, Amount, .keep_all = TRUE)
# ID Amount Previous
#1 1 10 15
#2 2 20 24
#3 3 5 7
Or with duplicated from base R applied on the 'ID', 'Amount' to create a logical vector and use that to subset the rows of the dataset
df2 <- df1[with(df1, order(ID, Amount, -Previous)),]
df2[!duplicated(df2[c('ID', 'Amount')]),]
# ID Amount Previous
#1 1 10 15
#3 2 20 24
#5 3 5 7
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Amount = c(10L,
10L, 20L, 20L, 5L, 5L), Previous = c(15L, 13L, 18L, 24L, 7L,
6L)), class = "data.frame", row.names = c(NA, -6L))

Create a new data frame column that is a combination of other columns

I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)

R replace the column name by the dataframe name with a loop

I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!
You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))

Renaming column headers after splitting data frames with split function

This is a very basic question, but I am struggling to solve it. I have a master data frame that I have split into multiple data frames, based on unique values in a particular column. This was achieved by creating a list of data frames, and then saving each data frame as a separate csv file using the lapply function (see the code below).
Example code:
split_df <- split(df, df$ID)
u_ID <- unique(df$ID)
names(split_df) <- paste(u_ID)
lapply(names(split_df), function(x) write.csv(split_df[x], file= paste0(x, '_ID.csv')))
The issue is that the column headers in the output csv files are different to those in the master data frame i.e. in the example below where the data frame is split by unique ID values, the ID name has been added to each column header in the split data frames. I would like to end up with the same column headers in my output data frames as in my master date frame.
Example data:
ID Count Sp
1 A 23 1
2 A 34 2
3 B 4 2
4 A 4 1
5 C 22 1
6 B 67 1
7 B 51 2
8 A 11 1
9 C 38 1
10 B 59 2
dput:
structure(list(ID = c("A", "A", "B", "A", "C", "B", "B", "A",
"C", "B"), Count = c(23L, 34L, 4L, 4L, 22L, 67L, 51L, 11L, 38L,
59L), Sp = c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L)), .Names = c("ID",
"Count", "Sp"), class = "data.frame", row.names = c(NA, -10L))
Example output data frames (csv files) using above code:
$A
A.ID A.Count A.Sp
1 A 23 1
2 A 34 2
4 A 4 1
8 A 11 1
$B
B.ID B.Count B.Sp
3 B 4 2
6 B 67 1
7 B 51 2
10 B 59 2
$C
C.ID C.Count C.Sp
5 C 22 1
9 C 38 1
For this example, I would like to end up with output csv files containing the column headers ID, Count and Sp. Any solutions would be greatly appreciated!
Use this simply:
lapply(names(split_df), function(x) write.csv(split_df[[x]], file= paste0(x, '_ID.csv')))
Note double square brackets in split_df[[x]].

Resources