Objective: Calculate stochastics with three different values for nFastK
for all variables using TTR::stoch and tidyquant packages.
Topic 1: Error message
The snippet below works, but throws an error
with option: bounded = TRUE. What is the reason for the error?
rm(list = ls())
library(tidyquant)
library(lubridate)
my_data <- tibble( Symbol = as_factor(c( rep("a", 100), rep("b", 100)))
, Date = rep( ymd("2017-11-14") + 7 * (0:99), 2) # weekly
, major = c (10000 + sample(-800:300, 100), (8000 + sample(-100:900, 100)))
, v1 = sample(-1000:1000, 200 ) / 100
, v2 = sample(-100:1200, 200) / 100
)
my_final <- my_data %>%
gather( -Date, -Symbol, key = "kkeys", value = "wwords") %>%
mutate(kkeys = as_factor(kkeys)) %>%
group_by(Symbol, kkeys) %>%
tq_mutate(
# tq_mutate args
select = wwords,
mutate_fun = stoch,
# args to mutate_fun
nFastK = 10
# , bounded = FALSE # <- uncomment this line for error!
) %>%
select( -wwords, -fastD, -stoch ) %>%
mutate( fastK = round(fastK, digits = 2)) %>%
spread( kkeys, fastK)
Topic 2: Functional programming on this issue.
A for loop produces three values of nFastK
calling the above and then renaming and
right-joining to the final table like so.
This is just a brief illustration of my original code:
my_periods <- c(5, 10, 20)
my_vars <- my_data %>% select (-Date, -Symbol) %>% colnames()
my_final <- my_data
for (i in seq_along(my_periods)) {
# Create unique Colnames
my_vars_to <- str_c( my_vars, "_pk", my_periods[i])
my_final <-
my_data %>%
# Do all of the above from topic 1 plus this
rename_at( vars(my_vars), ~ my_vars_to) %>%
right_join(my_final, by = c("Symbol", "Date"))
}
This loop works and gets me what I want. Still being in the steep learning curve, there are two questions:
Question 1: Acc. to Wickham with solutions provided by Arnold, preallocation operates faster. How would this code need to be written to pre-allocate the memory compared to right_join()? Or is this an OK solution? I looked at https://jrnold.github.io/r4ds-exercise-solutions/iteration.html
Question 2: After reading a few tutorials, purrr::map()
appears to be appropriate instead of the for loop.
Even after reading tutorials and questions here I can't get my head around how to write it properly. Would you please provide an example or point in the direction of more reading?
Finally:
Thank you all the help via examples, vignettes and other posts. This is probably one of the most active, helpful and knowledgable communities I have ever come across. As a new user to R I appreciate the many examples on stackoverflow and any other websites. This is my first post. Thanks, A.
Related
This relates to another problem I posted, but I did not quite ask the right question. If anyone can help with this, it would really be appreciated.
I have a DF with several players' answers to 100 questions in a quiz (example data frame below with 10 questions and 10 players-not the real data, which is not really from a quiz, but the principle is the same).
My goal is to create a function that will check when a player has answered 3 questions incorrectly cumulatively at any point during their answers, and then change their following answers to the string "disc". I would like to be able to change the parameters also, so it could be 4 or 5 questions incorrect etc. In the df: 1=correct, 0=incorrect, and 2=unanswered. Unanswered is considered incorrect, but I do not want to recode it as 0.
df=data.frame(playerID=numeric(),
q1=numeric(),
q2=numeric(),
q3=numeric(),
q4=numeric(),
q5=numeric(),
q6=numeric(),
q7=numeric(),
q8=numeric(),
q9=numeric(),
q10=numeric())
set.seed(1)
for(i in 1:10){
list_i=c(i,sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1),sample(0:2,1))
df[i,]=list_i
}
So, in this DF, for example, playerID=3,8 and 9 should have their answers="disc" from q4 onwards, whereas playerid5 should have “disc” from 8 onwards. So anytime there are 3 consecutive incorrect answers (including values of 2), the following answers should change to “disc”.
I presume the syntax would be a for loop with an if statement inside using mutate or similar.
One possible solution using mutate and across:
df %>%
ungroup() %>%
mutate(
# Mutate across all question columns
across(
starts_with("q"),
function(col) {
# Get previous columns
col_i <- which(names(cur_data())==cur_column())
previous_cols <- 2:(col_i-1)
# Get results for previous questions as string (i.e. zero, or 2)
previous_qs <- select(cur_data(), all_of(previous_cols)) %>%
mutate(across(everything(), ~as.numeric(.x %in% c(0,2)))) %>%
tidyr::unite("str", sep = "") %>%
pull(str)
# Check for three successive incorrect answers at some previous point
results <- grepl(pattern = "111", previous_qs)
# For those with three successive incorrect answers at some previous point, overwrite value with 'disc'
col[results] <- "disc"
col
}
)
)
Are you looking for something like this?
library(tidyverse)
n <- 100
f <- function(v, cap, new_value){
df <-
data.frame(v = v) |>
mutate(
b = cumsum(v),
v_new = ifelse(b > cap, new_value, v)
)
return(df$v_new)
}
# apply function to vector
v <- runif(n)
v_new <- f(v, 5, "disc")
# apply function in a dataframe with mutate
df <-
data.frame(a = runif(n))
df |>
mutate(
b = f(a, 5, "disc")
)
My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",
I am using the 'anomalize' package for anomaly detection. My data consists of three columns, the date, an agent (this is where the different levels come from), and the number of schedules that agent had on a particular day. I can run the anomaly detection just fine when I remove the 'agent' column and sum the number of consults by day using this code:
df <- scheds %>%
group_by(date) %>%
summarise(
new_scheds = sum(new_scheds)
)
df_ts <- df %>% rownames_to_column() %>% as_tibble() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% select(-one_of('rowname'))
df_ts <- df_ts[order(df_ts$date),]
########## TS Decomp ###############
df_ts %>%
time_decompose(new_scheds, method = "stl", frequency = 5, trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
plot_anomaly_decomposition()
But I cannot find out how I would do this same type of thing for each agent individually without manually typing everything out and using filter(). I have tried the following loop with no luck:
agents <- levels(ts_agents$agent)
results <- matrix(NA, length(agents))
for(i in 1:length(agents)){
ts_agents %>%
time_decompose(new_scheds)[i] %>%
anomalize(remainder)[i] %>%
time_recompose()[i] %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)[i] }
but I get the following error:
'Error in time_decompose(new_scheds) : object 'new_scheds' not found'
Any tips or pointers would be greatly appreciated!
The reason for your error is that the pipe operator %>% doesn't work right when you try to subset your data.
If you enclose it in brackets and use . to refer to the input, you will avoid this error:
for(i in 1:length(agents)){
ts_agents %>% {
time_decompose(., new_scheds)[i]
} %>% {
...
This fixes the immediate problem of the error but I'm not sure how well the subsetting will work. It may be that you need filter() in the loop, or even group_by(df, agent) without any loop at all. (If you provide a full reproducible example including data, it will be easier to help).
I'm trying to plot tweet topics by year from this file
https://www.mediafire.com/file/64lzbt46v01jbe1/cleaned.xlsx/file
works fine to get the topics, but when I try to plot them by year I have a dimensions problem:
library(readxl)
library(tm)
tweets <- read_xlsx("C:/cleaned.xlsx")
mytextdata <- tweets$textdata
# Convert to tm corpus and use its API
corpus <- Corpus(VectorSource(mytextdata)) # Create corpus object
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
dim(theta)
dim(theta)=4857 and I have 4876 dates in my cleaned.xls file and I need them to be the same to run this aggregate function
topic_proportion_per_decade <- aggregate(theta, by = list(decade = textdata$decade), mean)
from here
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
I think that the problem is that the cleaned.xls file is not clean enough and that's why theta misses some rows..
But in fact I really don't know why theta misses some rows..
I also don't know how to clean the file better if that was the problem, the file looks good to me, there are some rows that are only numbers or non-english words but I prefer to keep them..
The problem is that ui = unique(dtm$i) removes several documents (I don't know why you do this, so I won't comment on that part). So your theta doesn't have the same number of rows as the data. We can solve this by only keeping the rows which are still in theta:
library("dplyr")
library("reshape2")
library("ggplot2")
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta)) %>% # keep only rows still in theta
cbind(theta) %>% # now we can attach the topics to the data.frame
mutate(year = format(date, "%Y")) # make year variable
I then used dplyr functions to make the aggregation, since I think it makes for easier to read code:
tweets_clean_yearly <- tweets_clean %>%
group_by(year) %>%
summarise_at(vars(as.character(1:7)), funs(mean)) %>%
melt(id.vars = "year")
Then we can plot this:
ggplot(tweets_clean_yearly, aes(x = year, y = value, fill = variable)) +
geom_bar(stat = "identity") +
ylab("proportion")
Note: I tested if theta and tweets had really the same documents with:
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta))
all.equal(tweets_clean$id, rownames(theta))
I'm writing a function to remove duplicate observations of undirected relationships between firms when both parties report the relationship. For my ~1.3 million observation dataset, the function collapse_undirected below results either in R crashing during the mutate for hash or the error:
"translateCharUTF8 must be called on CHARSXP"
during the mutate for dup.
The goal of this function is to uniquely identify each pair of related firms by ordering and concatenating their IDs, and then dropping duplicate hashes reported in the same time period.
The data set I'm using is licensed so I can't provide it, but the "translateCharUTF8" error is reproducible with the randomly generated data I've included below. It occurs more frequently with larger sets. I'd say its common at around 9000 observations. I've also included a slow version of the function that runs without incident, which further leads me to believe that the problem occurs in the first mutate.
The function with the error:
collapse_undirected <- function(data, dir){
out <- data %>% filter(REL_TYPE != dir)
obs <- data %>% filter(REL_TYPE == dir) %>%
group_by(SOURCE, TARGET) %>%
mutate(hash = paste(min(SOURCE, TARGET),max(SOURCE, TARGET))) %>%
group_by(START, END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup)
bind_rows(out,obs)
}
The slow workaround:
jank_undir <- function(data, dir){
obs <- data %>% filter(REL_TYPE == dir)
out <- data %>% filter(REL_TYPE != dir)
obs$hash <- NA
for(i in 1:nrow(obs)){
obs$hash[i] <- paste(min(obs$SOURCE[i], obs$TARGET[i]),
max(obs$SOURCE[i], obs$TARGET[i]))
}
obs %>% group_by(START,END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup) %>%
bind_rows(out)
}
Here's a convenience function to randomly generate test data:
reroll <- function(n){
test_data <- data_frame(1:n)
test_data$SOURCE <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$TARGET <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$REL_TYPE <- "DUMMY"
test_data$START <- sample(1:2870, size = n, replace = TRUE)
test_data$END <- sample(1:2781, size = n, replace = TRUE)
test_data
}
And, varying with the random draw, this should demonstrate the error:
library(dplyr)
test_data <- reroll(9000)
test_cleaned <- test_data %>% jank_undir("DUMMY")
test_cleaned <- test_data %>% collapse_undirected("DUMMY")
I'd greatly appreciate any insight into why this is happening. The slow version is fast enough for now but I anticipate needing to use it for larger datasets. The R crashes occurred on both my Windows and Linux based R sessions with the main dataset but seems less frequent on the Linux version. My dplyr is 0.7.2
Thank you,