I am new to programming and am attempting to create a prediction model for multiple articles.
Unfortunately, using Excel or similar software is not possible for this task. Therefore, I have installed Rstudio to solve this problem. My goal is to make a 18-month prediction for each article in my dataset using an ARIMA model.
However, I am currently facing an issue with the format of my data frame. Specifically, I am unsure of how my CSV should be structured to be read by my code.
I have attached an image of my current dataset in CSV format : https://i.stack.imgur.com/AQJx1.png
Here is my dput(sales_data) :
structure(list(X.Article.1.Article.2.Article.3 = c("janv-19;42;49;55", "f\xe9vr-19;56;58;38", "mars-19;55;59;76")), class = "data.frame", row.names = c(NA, -3L))
And also provided the code I have constructed so far with the help of blogs and websites :
library(forecast)
library(reshape2)
sales_data <- read.csv("sales_data.csv", header = TRUE)
sales_data_long <- reshape2::melt(sales_data, id.vars = "Code Article")
for(i in 1:nrow(sales_data_long)) {
sales_data_article <- subset(sales_data_long, sales_data_long$`Code Article` == sales_data_long[i,"Code Article"])
sales_ts <- ts(sales_data_article$value, start = c(2010,6), frequency = 12)
arima_fit <- auto
arima_forecast <- forecast(arima_fit, h = 18)
print(arima_forecast)
print("Article: ", Code article[i])
}
With this code, RStudio gives me the following error : "Error: id variables not found in data: Code Article"
Currently, I am not interested in generating any plots or outputs. My main focus is on identifying the appropriate format for my data.
Do I need to modify my CSV file and separate each column using "," or ";"? Or, can I keep my data in its current format and make adjustments in the code instead?
Added the dput output as per jrcalabrese request.
Swapped to the replacement for reshape2 (tidyr).
Used pivot_longer.
Now doesn't give error, which was happening in reshape2::melt.
It doesn't matter so much what the csv structure is. Your structure was fine.
Hope this helps! :-)
library(tidyr)
sales_data <- structure(list(var1 = c("Article 1", "Article 2", "Article 3"),
`janv-19` = c(42, 56, 55),
`fev-19` = c(49, 58, 59),
`mars-19` = c(55, 38, 76)),
row.names = c(NA, 3L), class = "data.frame")
sales_data_long <- sales_data |> pivot_longer(!var1,
names_to = "month",
values_to = "count")
Related
This question kinda builds on questions I asked here and here, but its finally coming together and I think I know what the problem is, just need help kicking it over the goal line. TL;DR at the bottom.
The overall goal as simply put as possible:
I have a dataframe that is from an API pull of a redcap database. It
has a few columns of information about various studies.
I'd like to go through that dataframe line by line, and push it into a different website called Oncore, through an API.
In the first question linked above (here again), I took a much simpler dataframe... took one column from that dataframe (the number), used it to do an API pull from Oncore where it would download from Oncore, copy one variable it downloaded over to a different spot, and push it back in. It would do this over and over, once per row. Then it would return a simple dataframe of the row number and the api status code returned.
Now I want to get a bit more complicated and instead of just pulling a number from one colum, I want to swap over a bunch of variables from my original dataframe, and upload them.
The idea is for sample studies input into Redcap to be pushed into Oncore.
What I've tried:
I have this dataframe from the redcap api pull:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I have used this code to try and push the data into Oncore:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
When I get to the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
I get this error:
When I do traceback() I get this:
When I step through the loop line by line I realized that in this chunk of code:
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
Where I had envisioned it downloading ONE sample study and replacing aspects of it with variables from ONE row of my beginning dataframe, its actually trying to paste everything in the column in there. I.e. results$nctNo is "654321 543210" instead of just "654321" from the first row.
TL;DR version:
I need my purrr loop to take one row at a time instead of my entire column, and I think if I do that, it'll all magically work.
Within UpdateAccountNumbers(), you are referring to entire columns of the testprotocols frame when you do things like results$nctNo<-testprotocols$nct_number ....
Instead, perhaps at the top of the UpdateAccountNumbers() function, you can do something like tp = testprotocols[testprotocols$protocol_no == protocol,], and then when you are trying to assign values to results you can refer to tp instead of testprotocols
Note that your purrr::walk() command is passing just one value of protocol at a time to the UpdateAccountNumbers() function
structure(list(trip_count = 1:10, pickup_longitude = c(-73.964096,
-73.989037, -73.934998, -73.93409, -73.998222, -74.004478, -73.994881,
-73.955917, -73.993607, -73.948265), pickup_latitude = c(40.764141,
40.760208, 40.746693, 40.715908, 40.750809, 40.741501, 40.74033,
40.776054, 40.758625, 40.778515)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
So I am trying to recode the longitude and latitude data into ZIP codes of NYC.
With revgeocodingfrom ggmap I already get some results:
res <- revgeocode(c(test$pickup_longitude[1],test$pickup_latitude[1]), output = "all")
However,
I can apply this procedure just to one location data at a time. Is there a way I can get the results for every location at once like
res <- revgeocode(c(test$pickup_longitude,test$pickup_latitude), output = "all")
How can I extract the ZIP Code from the ggmap results so that I can create a new column in the data frame with the correspondent zip code? Somehow the zip code is stored in a very strange way (see picture). How can I access this information in the console?
The solutions does not have to work in ggmap, maybe there is a different approach to get the ZIP codes from the longitude and latitude data?
Thanks!
I don't have an API key, but try something like the following:
my_func <- function(longlat, output){
list(list(list(c(vector("list", 7), list(12345L)))))
}
test %>%
rowwise() %>%
mutate(zip = list(my_func(c(pickup_longitude, pickup_latitude), output = "all"))) %>%
mutate(zip = zip[[1]][[1]],
zip = zip[[8]])
Replace my_func with revgeocode . You'll have to figure out exactly how to pick out the zip code from the output. But you can try something like the above.
I have a data frame, imported with excel with library(readxl). It contains long columns of data each with its own column title. Now I need to store specific values in new variables. I stored the column titles in the vector "titles" and want to extract certain values from a specific row e.g 151 and store it in a new variable.
I tried with the code below. I am really new to R and tried a lot and failed...
example <- data.frame(c('N 1','N 2'), c(50, 60), c(70, 80))
titles <- c('N 1', 'N 2')
for (i in titles) {
(paste("nkorrigiert",i)) <- as.numeric(example[[paste(i)]][3])
}
dput(head(example))
and get this
Fehler in (paste("nkorrigiert", i)) <- as.numeric(example[[paste(i)]][3]) :
Ziel der Zuweisung expandiert zu keinem Sprachobjekt
> dput(head(example))
structure(list(c..N.1....N.2.. = structure(1:2, .Label = c("N 1",
"N 2"), class = "factor"), c.50..60. = c(50, 60), c.70..80. = c(70,
80)), .Names = c("c..N.1....N.2..", "c.50..60.", "c.70..80."),
row.names = 1:2, class = "data.frame")
What am I doing wrong?
You can use the assign command
example <- data.frame(c('N 1','N 2'), c(50, 60), c(70, 80))
titles <- c('N 1', 'N 2')
for (i in titles) {
assign(paste("nkorrigiert",i), as.numeric(example[[paste(i)]][3]))
}
dput(head(example))
R does not understand that you want to create a new variable
With your help and the post suggested by #lmo i was able to solve it. Thank you guys! :D
Now i have my first code almost running, yay! With this code under the foor loop it was possible
as.numeric(assign(paste("nkorrigiert",i), example[3, i]))
Now i just need to find out how to calculate with this values stored in variable names in a for loop!:D
All the best, Sebastian
I used R dump() to create a data.txt file as specified by the latest JAGS manual, but I keep running into this error:
Reading data file data.txt
syntax error, unexpected LIST, expecting DOUBLE or NA or ASINTEGER or 'c'
The data.txt produced by dump(), from which I have removed the "L" assigned by R:
M <- 4
N <- 2
x <- structure(list(Var1 = c(0, 1, 0, 1), Var2 = c(0, 0, 1, 1)), .Names = c("Var1",
"Var2"), out.attrs = structure(list(dim = c(2, 2), dimnames = structure(list(
Var1 = c("Var1=0", "Var1=1"), Var2 = c("Var2=0", "Var2=1"
)), .Names = c("Var1", "Var2"))), .Names = c("dim", "dimnames"
)), class = "data.frame", row.names = c(NA, -4))
counts <- c(377558, 1001, 2000, 2000)
total <- 382559
If I remove x, the data will import correctly, but obviously that is not what I want. The strangest part is that if using the RJAGS and R2JAGS packages instead, the whole thing works fine. Does anyone know how to format this data to work in JAGS?
As Martyn said over on the JAGS forum, a list (or data.frame) is not allowed in JAGS. You need to convert this to an array or matrix before using dump.
By the way, if you need to call JAGS externally then you might also want to check out the runjags package (on CRAN) which does a lot of the automation of creating files to call JAGS (try run.jags(..., method='interruptible', keep.jags.files='my_folder') for example). You will still need to convert your data frame to a matrix first though.
Matt
What seemed to fix this issue for me was a simple command per Martyn's suggestion on the JAGS board:
x <- as.matrix(x)
I'm pulling data from the Google Analytics API, processing it locally, then knitting an .Rmd file into text, tables, and visualisations. As part of the knitting/tabling process, I'm doing some basic formatting (e.g. rounding off percentages and adding % signs).
For this question, I have toPercent(), which works fine if used like this:
toPercent <- function(percentData){
percentData <- round(data, 2)
percentData <- mapply(toString, percentData)
percentData <- paste(percentData, "%", sep="")
}
devices <- toPercent(devices$avgSessionDuration)
However, manually setting the function for every table is time-intensive. I created the percentCheck() to look for columns that matched my criteria:
percentCheck <- function(data){
data[,grep("rate|percent", names(data), ignore.case=TRUE)] <- toPercent(data[,grep("rate|percent", names(data), ignore.case=TRUE)])
}
devices <- percentCheck(devices)
But I know this doesn't work on a dataset with multiple matches (e.g. a column for exitRate and a column for bounceRate).
Q1: Have I written toPercent() in a way that won't return multiple values to one entry?
Q2: How can I structure percentCheck() to map over the dataset and only apply toPercent() if the column name includes a given string?
Version/Packages:
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
library(rga)
library(knitr)
library(stargazer)
Data:
> dput(devices)
structure(list(deviceCategory = c("desktop", "mobile", "tablet"
), sessions = c(817, 38, 1540), avgSessionDuration = c(153.424888853179,
101.942758538617, 110.270988142292), bounceRate = c(39.0192297391397,
50.2915625371891, 50.1343873517787), exitRate = c(25.3257456030279,
32.0236280487805, 29.0991902834008)), .Names = c("deviceCategory",
"sessions", "avgSessionDuration", "bounceRate", "exitRate"), row.names = c(NA,
-3L), class = "data.frame")
How about this modification:
percentCheck <- function(data){
idx <- grepl("rate|percent", names(data), ignore.case=TRUE)
data[idx] <- lapply(data[idx], function(x) paste0(sprintf("%.2f", round(x,2)), "%"))
return(data)
}
Here, I first used grepl to create and index of columns which meet the specified criteria. Then, this index is used in lapply to apply it to all these columns and the function that is applied is similar to your toPercent function, only I found it a bit more compact like this.
Now you can apply it to your whole data set in one go:
percentCheck(devices)
# deviceCategory sessions avgSessionDuration bounceRate exitRate
#1 desktop 817 153.4249 39.02% 25.33%
#2 mobile 38 101.9428 50.29% 32.02%
#3 tablet 1540 110.2710 50.13% 29.10%