Redefining Dataframe for Regression-Analysis in R - r

i have a dataframe with timestamps of several transportations from a to b plus information about the material (volume, weight etc.).
I recreated the important parts of the raw excel sheet I use.
My first step is to calculate the time it needed by simply substracting the dates as i only need a daily precision. I put all the times in a numerical vector to have it easy for further calculations and plots.
BUT:
I'd like to perform a regression analysis on it. I know how to create an lm.
My problem is, Due to several NA's my numerical vector of "transport days" is shorter than my cols in the df.
How can I merge the cols from the df with my numerical vector so that the transport times match the several materials again?

Do you look for something like
library(dplyr)
df %>%
mutate(diff = as.numeric(t4-t1))
You then have a time difference column while the colume column is still in the df. You can tell lm() how to deal with NAs anyway, so you don't need to drop them (I also don't think that you were doing it anyway).

Related

R calculate averages by group with uneven categorical data

I want to calculate averages for categorical data. My data is in a long format, and I do not understand why I am not succeeding.
Here is an example (imagine it as individual participants, indicated by id, picking different options, in this example m_ex):
id <- (1,1,1,1,1,2,2,2,3,3,3)
m_ex <- ("a","b","c","b","a","b","c","b","a","a","c")
df <- data.frame(id , m_ex)
print (df)
I want to calculate averages for m_ex. That is, the average times specific m_ex are picked. I am trying to achieve this with dplyr. But I do not quite understand how to proceed with the id's having different lengths. What would I have to divide by then? And is it a problem that I do not have equal lengths of ids?
I really appreciate any help you can provide.
I have tried using dplyr and grouping by id and summarizing the results without much success. I would, in particular, like to understand what I do not understand right now.
I get something like this, but how do I get the averages?
[1]: https://i.stack.imgur.com/7nxze.jpg
[![example picture][1]][1]

More efficient way to get measurements from different items in R

I'm currently getting a bunch of accuracy measurements for 80k different items which I need to calculate the measurements independently but currently is taking too long, so want to determine a faster way to do it.
Here's my code in R with it's comments:
work_file: Contains 4 varables: item_id, Dates, demand and forecast
my code:
output<-0
uniques<- unique(work_file$item_id)
for( i in uniques){
#filter every unique item
temporal<- work_file %>% filter(item_id==i)
#Calculate the accuracy measure for each item
x<-temporal$demand
x1<-temporal$forecast
item_error<- c(i, accuracy(x1,x)
output<-rbind(output, item_error)}
For 80k~unique items is taking hours,
Any suggestions?
R is a vectorized language, as such one can avoid the use of the loop. Also the binding within a loop is especially slow since the output data structure is constantly being deleted and recreated with each iteration.
Provided the "accuracy()" function can accept a vector input this should work: Without sample data to test, there is always some doubt.
answer<- work_file %>%
group_by(item_id) %>%
summarize(accuracy(forecast, demand))
Here the dplyr's group_by function will collect the different item_ids and the pass those vectors to summarize the accuracy function.
Consider using data.table methods which would be efficient
library(data.table)
setDT(work_file)[, .(acc = accuracy(forecast, demand)), item_id]

Merging Two Datasets on Matched Column in R

I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example

Normalizing data frame while holding a categorical column out

very new to r.
I am trying to normalize multiple variables in matrix except the last column which has a categorical factor variable (in this case good/notgood).
I there any way to normalize the data without affecting the categorical column? I have tried to normalize while keeping the categorical column out, but can't seem to be able to add it back again.
minimum <- apply(mywines[,-12],2,min)
maximum <- apply(mywines[,-12],2,max)
mywinesNorm <- scale(mywines[,-12],center=minimum,scale=(maximum-minimum))
I still need the 12th column to build supervised models.
The short version is that you can simply reattach the column using cbind. However, it is just a little more complicated than that. scale returns a matrix not a data frame. In order to mix numbers and factors, you need a data.frame, not a matrix. So before the cbind, you will want to convert the scaled matrix back to a data.frame.
mywinesNorm = cbind(as.data.frame(mywinesNorm), mywines[ ,12])
A different approach would be to just change the data in place:
mywines[ ,12] = scale(mywines[ ,12])

Time period between dates in R

I have a data frame with Id Column & Date Column.
Essentially, I would like to create a third column (Diff) that calculates the difference between dates, preferably grouped by ID.
I have constructed a large POSIXlt from the following code
c_time <- as.POSIXlt(df$Gf_Date)
a <- difftime(c_time[1:(length(c_time)-1)], c_time[2:length(c_time)], units = weeks")
However when I try cbind onto my data.frame it errors
"arguments imply differing number of rows"
as a is one row shorter than the original data.frame.
Any help would be greatly appreciated.
Since the difference can only be taken between two subsequent dates, it is undefined for the first entry. Therefore a reasonable choice would be to set that first value to NA.
This might work:
c_time <- as.POSIXlt(df$Gf_Date)
a <- c(NA,`units<-`(diff(c_time),"weeks"))
cbind(df,diff.dates=a)
(Hat tip to #thelatemail for a valuable suggestion to simplify the definition of a).
PS: Note that the differences in a may have a different sign compared to your original approach. Depending on the convention that you prefer, you can use a <- -a to convert between the two.

Resources