look up for master data in R - r

i have a a problem with the following topic.
After using Excel, the workload of doing this is to high. Now i want to do it with R in a automatic way.
I have different Models of washing machines:
For Every Model, i have a data.frame with all required Components. FOR 1 MODEL AS EXAMPLE
Component = c("A","B","C","D","E","F","G","H","I","J")
Number = c(1,1,1,2,4,1,1,1,2,3)
Model.A= data.frame(Component,Quantity)
As second Information, i have a data.frame with all Components, which are used by all Models and in addition the actual Stock of these Components.
Component = c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z")
Stock = c(100,102,103,105,1800,500,600,400,50,80,700,900,600,520,35,65,78,95,92,50,36,34,96,74,5,76)
Comp.Stock = data.frame(Component,Stock)
The third and last Inforamtion is about the weekly production plan. I have 4 weekly production plans = Plan for 1 months. I got a data.frame with the Models of washing machines, which will be produced in the next 4 weeks and also the quantitiy of them.
pr.Models= c("MODEL.A","MODEL.B","MODEL.C","MODEL.D")
Quantity= c(15000,1000,18000,16000,5000)
Production= data.frame(pr.Models,Quantity)
My Problem is now, to combine these informations together, that i can compare the models which get produced ( last information) with the components. First with the used components for every Model on its own and in addition with the data.frame which has the information of all components and the stock.
The Aim is to get information and a warning, if the Component stock is not big enough for producing the models from the production plan.
Hind: ( Many same Components gets used by different Models)
Hopefully you understand what i mean and can help me by this problem.
Thank you =)
EDIT:
I can not follow all your steps:
Maybe this idea is also good, but i nead a hind how to do it:
Maybe it is possible to merge every produced Model (Production) with the Used Components. (considered with the Quantity for Producing and the Number need per Washing machine).
My prefered output is, to gert automattically dataframes for every produced model with the needed Components.
In the next step it should be able to merge these datas with Comp.Stock to see which Component are needed how often and compare this with the stock.
Have you any ideas on this way?
Maybe i am to stupid for the presented way... I really need an automatic way because there are more then 4k different components and more then 180 different models of washing machines.
Thank you
the Comp.Stock additionally with all used Models and their quanitity ( Production)

You need to have the model name as a column in the first data.frame (to match Production)
Model.A$pr.Models <- 'MODEL.A'
Then you can merge. Note that there are two "Quantity" columns, and you don't want to merge by those:
merged <- merge(merge(Model.A, Comp.Stock),Production, by='pr.Models')
Extra is how many you will have on-hand after production:
transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)
## pr.Models Component Quantity.x Stock Quantity.y Needed Extra
## 1 MODEL.A A 1 100 15000 15000 -14900
## 2 MODEL.A B 1 102 15000 15000 -14898
## 3 MODEL.A C 1 103 15000 15000 -14897
## 4 MODEL.A D 2 105 15000 30000 -29895
## 5 MODEL.A E 4 1800 15000 60000 -58200
## 6 MODEL.A F 1 500 15000 15000 -14500
## 7 MODEL.A G 1 600 15000 15000 -14400
## 8 MODEL.A H 1 400 15000 15000 -14600
## 9 MODEL.A I 2 50 15000 30000 -29950
## 10 MODEL.A J 3 80 15000 45000 -44920
If Extra is negative, you'll need more parts. You're seriously deficient.
transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)$Extra < 0
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Not enough of any part.
As a function:
Not.Enough.Parts <- function(Model, Comp.Stock, Production) {
Model$pr.Models <- toupper(substitute(Model))
merged <- merge(merge(Model, Comp.Stock),Production, by='pr.Models')
extra <- transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)
retval <- extra$Extra < 0
names(retval) <- extra$Component
return(retval)
}
Not.Enough.Parts(Model.A, Comp.Stock, Production)
## A B C D E F G H I J
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Related

R: Apriori Algorithm does not find any association rules

I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:
head(df_itemList)
ID PRD_LISTE
1 1 A,B,C
3 2 C,D
4 3 A,B
5 4 A,B,C,D,E
7 5 B,A,D
8 6 A,C,D
I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.
Next, I wrote the data.frame into an csv-file like this:
df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)
Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.
library(arules)
txn <- read.transactions(file="Basket_List_13-08-2020.csv",
rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn#itemInfo$labels <- gsub("\"","",txn#itemInfo$labels)
The summary-function yields the following output:
summary(txn)
transactions as itemMatrix in sparse format with
589455 rows (elements/itemsets/transactions) and
1737 columns (items) and a density of 0.0005757052
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
element (itemset/transaction) length distribution:
sizes
1
589455
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D
includes extended transaction information - examples:
transactionID
1
2 1
3 3
Now, I tried to run the apriori-algorithm:
basket_rules <- apriori(txn, parameter = list(sup = 1e-15,
conf = 1e-15, minlen = 2, target="rules"))
This is the output:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.01 0.1 1 none FALSE TRUE 5 1e-15 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.04s].
Even with a ridiculously low support and confidence, no rules are generated...
summary(basket_rules)
set of 0 rules
Is this really because of my dataset? Or was there a mistake in my code?
Your summary shows that the data is not read in correctly:
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv looks correct. Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.
#Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?
When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:
X Prd
1 A
2 A,B
3 B,A
4 B
5 C
Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?
Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:
most frequent items:
A B C A,B B,A
1256 1235 456 235 125
(numbers are chosen randomly)
So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.

Select rows from data frame based on condition by group

I have a data frame which is a list of lists - a list of storms, each point of a storm is a row. One column is whether each point in the storm is over land. I'm able to work out which storms have made landfall, however I do not know how to select only those storms, i.e. create a new data frame of only those storms that have made landfall.
This code lets me know whether a storm has made landfall (by grouping by ID it sums the in region column (1 or 0) and if greater than 1 says it's made landfall):
land_tracks <- all_tracks[, sum(inregion) > 0, by = ID]
Gives me:
ID V1
1: 1987051906_15933 TRUE
2: 1987060118_16870 TRUE
3: 1987061306_18015 TRUE
4: 1987062100_18878 TRUE
5: 1987062918_19507 FALSE
6: 1987070512_20168 TRUE
7: 1987070812_20341 TRUE
8: 1987071218_20635 TRUE
9: 1987071412_20762 TRUE
10: 1987071606_20881 TRUE
How do I use this to go through all_tracks to find all the rows which match the ID where V1 == TRUE?
I regularly have the issue that land_tracks has 41 rows, all_tracks has 1879 rows, and R raises an issue about recycling.
Maybe you can do someting like INNER JOIN between those two tables:
merge(all_tracks,land_tracks[which(land_tracks$V1== TRUE)], by = 'ID')

aggregating the data in different columns based on condition in r

order_id customer_id extension_of quantity cost duration
1 123 srujan NA 1 100 30
2 456 teja NA 1 100 30
3 789 srujan 123 1 100 30
I have sample data which contains order information. What I need to do is summarize the data(sum of cost etc) if value of order_id and extension_of columns matches.
I take it you want to add rows #1 and #3 as #1 is like the parent order and #3 is its extension. Here's one way you can do it (assuming dft is your data frame):
library(purrr)
dft$parent_id <- map2_dbl(dft$order_id, dft$extension_of, function(order_id, extension_of) if(is.na(extension_of)) order_id else extension_of)
aggregate(cost~parent_id, data=dft, FUN=sum)
This can not do recursion (i.e. an extension being extended itself), but it does sum up for one level.
If you need several levels, you can go for something like this:
library(purrr)
find_root <- function(order_id, extension_of){
if(is.na(extension_of)){
return(order_id)
}else{
parent_ext <- dft$extension_of[dft$order_id==extension_of]
return(find_root(extension_of, parent_ext))
}
}
dft$parent_id <- map2_dbl(dft$order_id, dft$extension_of, find_root)
aggregate(cost~parent_id, data=dft, FUN=sum) # same as before
(not it looks up the parent's parent, if necessary). One can certainly optimize for performance if really needed; the principle will stay the same, however.

R accessing variable column names for subsetting

The following works and does what I want it to do:
dat<-subset(data,NLI.1 %in% NLI)
However, I may need to subset via a different column (i.e. NLI.2 and NLI.3). I've tried
NLI_col<-"NLI.1"
NLI_col<-subset(data,select=NLI_col)
dat<-subset(data,NLI_col %in% NLI)
Unsurprisingly this doesn't work. How do I use NLI_col to achieve the result from the code that does work?
It was requested that I give an example of what data looks like. Here:
NLI.1<-c(NA,NA,NA,NA,NA,1,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,2,2,1,2,2,2)
NLI.2<-c(NA,NA,NA,NA,NA,NA,2,2,2,NA,NA,2,2,2,2,2,2,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,1,2,2,2,2,2,2,2)
NLI.3<-c(NA,35,40,NA,10,NA,31,NA,14,NA,NA,15,17,NA,NA,16,10,15,14,39,17,35,14,14,22,10,15,0,34,23,13,35,32,2,14,10,14,10,10,10,40,10,13,13,10,10,10,13,13,25,10,35,NA,13,NA,10,40,0,0,20,40,10,14,40,10,10,10,10,13,10,8,NA,NA,14,NA,10,28,10,10,15,15,16,10,10,35,16,NA,NA,NA,NA,30,19,14,30,10,10,8,10,21,10,10,35,15,34,10,39,NA,10,10,6,16,10,10,10,10,34,10)
other<-c(NA,NA,511,NA,NA,NA,NA,NA,849,NA,NA,NA,NA,1324,1181,832,1005,166,204,1253,529,317,294,NA,514,801,534,1319,272,315,572,96,666,236,842,980,290,843,904,528,27,366,540,560,659,107,63,20,1184,1052,214,46,139,310,872,891,651,687,434,1115,1289,455,764,938,1188,105,757,719,1236,982,710,NA,NA,632,NA,546,747,941,1257,99,133,61,249,NA,NA,1080,NA,645,19,107,486,1198,276,777,738,1073,539,1096,686,505,104,5,55,553,1023,1333,NA,NA,969,691,1227,1059,358,991,1019,NA,1216)
data<-cbind(NLI.1,NLI.2,NLI.3,other)
NLI<-c(10,13)
With this, after sub-setting I should get all the rows with tens and thirteens in data$NLI.3 if NLI_col <- "NLI.3"
Since this is relatively trivial I am guessing this is a duplicate question (my apologies), but the hours drag on and I still cant find a solution
Seems like you are unnecessarily using subset. Try this:
NLI_col <- 'NLI.3'
head(data[,NLI_col] %in% NLI)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
head(data[data[,NLI_col] %in% NLI, ])
## NLI.1 NLI.2 NLI.3 other
## 5 NA NA 10 NA
## 17 2 2 10 1005
## 26 2 2 10 801
## 31 2 2 13 572
## 36 2 2 10 980
## 38 2 2 10 843
I'm not sure I am following the question exactly. Are you asking to just subset the rows of NLI.3 that contain a 10 or a 13? Is it more complicated than that?
If you just want to get those rows....
df[ which(df$NLI.3==10 | df$NLI.3==13 ),]
Assuming your data is in a dataframe. Also, I changed the name of the dataframe from 'data' to 'df' - calling it 'data' can lead to issues.

How to compare two consecutive rows with a reference value in R?

I have a data frame of vehicle trajectories. Here's a snapshot:
> head(df)
vehicle frame globalx class velocity lane
1 2 43 6451214 2 37.76 2
2 2 44 6451217 2 37.90 2
3 2 45 6451220 2 38.05 2
4 2 46 6451223 2 38.18 2
5 2 47 6451225 2 38.32 2
6 2 48 6451228 2 38.44 2
where, vehicle= vehicle id (repeats because same vehicle is observed in several time frames), frame= frame id of time frames in which it was observed, globalx = x coordinate of the front center of the vehicle, class=type of vehicle (1=motorcycle, 2=car, 3=truck), velocity=speed of vehicles in feet per second, lane= lane number (there are 6 lanes). I think following illustration will better explain the problem:
The 'frame' represents one tenth of a second i.e. one frame is 0.1 seconds long. At frame 't' the vehicle has globalx coordinate x(t) and at frame 't-1' (0.1 seconds before) it was x(t-1). The reference location is 'U'(globalx=6451179.1116) and I simply want a new column in df called 'u' which has 'yes' in the row where globalx of the vehicle was greater than reference coordinate at 'U' AND the previous consecutive globalx coordinate of this vehicle was less than reference coordinate at 'U'. This means that if df has 100 vehicles then there will be 100 'yes' in 'u' column because every vehicle will meet the above criteria once. I have tried to do this by running the function with ifelse and also tried to do the same using a for loop but it doesn't work for me. The output should have one new column:
vehicle frame globalx class velocity lane u
I tried using ifelse inside for loop and a function but it doesn't work for me.
I assume the data frame is sorted primarily for vehicle and secondarily for globalx. If it's not you can do it by:
idx <- with(df,order(vehicle,globalx))
df <- df[idx,]
Now, you can perform it with the following vectorized operations:
# example reference line
U <- 6451220
# adding the extra column
samecar <- duplicated(df[,"vehicle"])
passU <- c(FALSE,diff(sign(df[,"globalx"]-U+1e-10))>0)
df[,"u"] <- ifelse(samecar & passU,"yes","no")
Here is my solution:
First create dummy data, based on your provided data (I have saved it to data.txt on my desktop), duplicate the data so that there are two cars with the same identical data, but different vehicle id's:
library(plyr)
df <- read.table("~/Desktop/data.txt",header=T)
df.B <- df; df.B$vehicle = 3 #For demonstration
df <- rbind(df,df.B); rm(df.B)
Then we can build a function to process:
mvt <- function(xref=NULL,...,data=df){
if(!is.numeric(xref)) #Input must be numeric
stop("xref must be numeric",call.=F)
xref = xref[1]
##Split on vehicle and process.
ddply(data,"vehicle",function(d){
L = nrow(d) #Number of Rows
d$u = FALSE #Default to Not crossing
#One or more rows can be checked.
if(L == 1)
d$u = (d$globalx > xref)
else if(L > 1){
ix <- which(d$globalx[2:L] > xref & d$globalx[1:(L-1)] <= xref)
if(length(ix) > 0)
d$u[ix + 1] = TRUE
}
#done
return(d)
})
}
Which can be used in the following manner:
mvt(6451216)
mvt(6451217)

Resources