K Means Clustering - ID's instead of indices in R - r

I cluster product IDs on amount of sales and profit of sales to identify product IDs on which I need to focus more.
The code below takes column 2 (amount of sales) and column 3 (profit of sales) as input for kmeans. Instead of the current labeling, row 1 is product 1, row 2 is product 2, etc. I want the labels to be product IDs (which is data_nz[,1]) instead of row indices.
k2 <- kmeans(data_nz[,2:3], centers = 3, nstart = 1000)
When I output the data examples in my clusters (exclude cluster 2 because these are the ones I don't care about):
k2$cluster[k2$cluster != 2]
I get the row indices and the cluster number, but what I want is the product ID and the cluster number.
Example of my dataset below: Product_ID, amount_of_sales, profit_of_sales
Can someone point me in the right direction?

You already have an ordered vector of product IDs in data_nz[, 1], which matches the vector with cluster number (k2$cluster). You can look at them side by side like this:
data.frame(product_id = data_nz[[1]],
cluster = k2$cluster)
If you want to drop certain rows you can:
data.frame(product_id = data_nz[[1]],
cluster = k2$cluster
)[k2$cluster != 2, ]

Related

Higher/lower than column in R

I have got a data frame with a column called WTP () filled with numbers from 5-30. In this data frame, I now want to add some new columns including price steps by 5 (i.e., buys at 5, buys at 10,..., buys at 30). My target is to fill this columns with a 1 if the WTP is higher than the price and a 0 if the WTP is lower than the price. What kind of function do I need to use to do this?
Thanks in advance!
df$buysat10 <- ifelse(df$WTP >= 10, 1, 0)
And replace with buysatX and >= X for the other cutoffs

Knapsack problem in R: How to use loop in R to check for each item instead of whole column

I've been taking a "Simulation" course in my uni and now have a task to solve a knapsack problem.
Basically I have 3 transporters (g1, g2, g3) which each can carry different weights, on top of that you can put each item 5 times in a transporter. My idea is to use a while loop so it would run as long as the total weight is below the maximum weight. And that the program would pick the item with the highest relation first as long as their pick index doesnt equal 0.
But here I got stuck, it seems that it takes all items of the column at once per loop and also doesnt stop when the value exceeds the maximum (it returns weight = 748 at the end).
Hence: how do I get the loop to just check per item and not the whole column and how can I ensure that it doesnt exceed the weight limit?
Thanks a lot for your help in advance!!
WeightIndex<- c(2, 3, 4, 1, 7, 5, 8, 15, 9, 11) #indexing the weight of each item
WorthIndex <-c(3, 4, 5, 1, 3, 7, 3, 21, 11, 10)#indexing the worth of each item
Relation <- WorthIndex/WeightIndex #creates a relation between weight/worth of the item
PickIndex <- rep(5,10) #each item can be picked 5 times
Items <-cbind(ItemIndex, WeightIndex, WorthIndex, Relation, PickIndex) #binding the indexes to a matrix
g1 <- 300 #Max weight of carrier 1
g2 <- 250 #max weight of carrier 2
g3 <- 200 #max weight of carrier 3
ItemsSorted<-Items[order(Items[,4],decreasing=TRUE),] #Sorts the items after the highest Relation of weight to worth
ItemsSorted1 <- ItemsSorted #creates a copied matrix for the first loop
TWE1 <-0#Total Weight of carrier 1
TWO1 <-0 #total worth of carrier 1
print (ItemsSorted1) #bug check
while (TWE1 <= g1){ #while the condition is true, the program should execute a for loop
if(ItemsSorted1[,5]!= 0) #in theory the loop should go bottom down and take the items with the highest price/weight relation first and it should only take items whose pick-index doesnt equal 0
{
TWE1<-TWE1+ItemsSorted1[,2] #the weight of the taken item gets appended to the list of the total weight
TWO1 <-TWO1 + ItemsSorted1[,3] #the worth of the taken item gets appended to the list of total worth
ItemsSorted1[,5] <- ItemsSorted1[,5] - 1 #deducts 1 from the pick- Index as the item has been taken
TWE1 <- sum(TWE1) #turns the list into one number
TWO1 <- sum(TWO1) #turns the list into one number
print(TWE1) #bug check
}
}
print(paste("Total weight C1",TWE1))
print(paste("Total worth C1",TWO1))
print(ItemsSorted1)```
Skimming your code, it looks like you're missing some way to index which number you're referencing in the ItemsSorted1 matrix. Your comment after the while statement says you're expecting the code to run a for-loop, but as written, it doesn't have a variable to increment.
You might be able to do something like the following to get at your expected behavior:
i <- 1 # create row-id variable to keep track of position in ItemsSorted1
while (TWE1 <= g1){
if(ItemsSorted1[i,5]!= 0){ # go item by item through the sorted list;
TWE1<-TWE1+ItemsSorted1[i,2] #the weight of the taken item gets appended to the list of the total weight
TWO1 <-TWO1 + ItemsSorted1[i,3] #the worth of the taken item gets appended to the list of total worth
ItemsSorted1[i,5] <- ItemsSorted1[i,5] - 1 #deducts 1 from the pick- Index as the item has been taken
TWE1 <- sum(TWE1) #turns the list into one number
TWO1 <- sum(TWO1) #turns the list into one number
print(TWE1) #bug check
}
else if (i<nrow(ItemsSorted1){
i <- i+1
}else{
break
}
}

Grouping conditional linked values within a data.table

I have a data.table with 3 input columns as follows and a fourth column representing my target output:
require(data.table)
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(9,15,15,23,27,27,31,39,49,49,50),
Valid_reversal = c(T,T,F,F,T,F,T,F,T,F,F),
Target_output = c(5,5,13,5,19,23,19,19,39,42,39))
I'm not sure if this is completely necessary, but I'll try to explain the dataset to hopefully make it easier to see what I'm trying to do. This is a little hard to explain in writing, so please bear with me!
The "Created" column represents the row number location of a price 'peak' (i.e. reversal point) in a time-series of financial data that I'm analysing. The "Next_peak" column represents the corresponding row number (in the original data set) of the next peak which exceeds the peak for that row. e.g. looking at row 1, the "Next_peak" value is 9, corresponding to the same row location as the "Created" level on row 2 of this summarised table. This means that the second peak exceeds the first peak. Conversely, in row 2 where the second peak's data is stored, the "Next peak" value of 15 suggests that it isn't until the 4th peak (i.e. corresponding to the '15' value in the "Created" column) that the second peak's price level is exceeded.
Lastly, the "Valid_reversal" column denotes whether the "Created" and "Next_peak" levels are within a predefined threshold. For example, "T" in the first row suggests that the peaks at rows 5 and 9 ("Next_peak") met this criteria. If I then go to the value of "Created" corresponding to a value of 9, there is also a "T", suggesting that the "Next_peak" value of 15 also meet the criteria. However, when I go to the 4th row where Created = 15, there is a "F", we find that the next peak does not meet the criteria.
What I'm trying to do is to link the 'chains' of valid reversal points and then return the original starting "Created" value. i.e. I want rows 1, 2 and 4 to have a value of '5', suggesting that the peaks for these rows were all within a predefined threshold of the original peak in row 5 of the original data-set.
Conversely, row 3 should simply return 13 as there were no valid reversals at the "Next_peak" value of 15 relative to the peak formed at row 13.
I can create the desired output with the following code, however, it's not a workable solution as the number of steps could easily exceed 3 with my actual data sets where there are more than 3 peaks which are 'linked' with the same reversal point.
I could do this with a 'for' loop, but I'm wondering if there is a better way to do this, preferably in a manner which is as vectorised as possible as the actual data set that I'm using contains millions of rows.
Here's my current approach:
Test[Valid_reversal == T,Step0 := Next_peak]
Test[,Step1 := sapply(seq_len(.N),function(x) ifelse(any(!(Created[x] %in% Step0[seq_len(x)])),
Created[x],NA))]
Test[,Step2 := unlist(ifelse(is.na(Step1),
lapply(.I,function(x) Step1[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step1))]
Test[,Step3 := unlist(ifelse(is.na(Step2),
lapply(.I,function(x) Step2[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step2))]
As you can see, while this data set only needs 3 iterations, the number of steps in the approach that I've taken is not definable in advance (as far as I can see). Therefore, to implement this approach, I'd have to repeat Step 2 until all values had been calculated, potentially via a 'while' loop. I'm struggling a little to work out how to do this.
Please let me know if you have any thoughts on how to address this in a more efficient way.
Thanks in advance,
Phil
Edit: Please note that I didn't mention in the above that the "Next_peak" values aren't necessarily monotonically increasing. The example above meant that nafill could be used, however, as the following example / sample output shows, it wouldn't give the correct output in the following instance:
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(27,15,15,19,23,27,42,39,42,49,50),
Valid_reversal = c(T,T,F,T,F,F,T,F,F,T,F),
Target_output = c(5,9,13,9,9,23,5,31,39,5,5))
Not sure if I understand your requirements correctly, you can use nafill after Step 1:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#your steps 2, 3, ...
Test[Valid_reversal | is.na(out), out := nafill(out, "locf")]
edit for the new example. You can use igraph to find the chains:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#steps 2, 3, ...
library(igraph)
g <- graph_from_data_frame(Test[Valid_reversal | is.na(out)])
DT <- setDT(stack(clusters(g)$membership), key="ind")[,
ind := as.numeric(levels(ind))[ind]][,
root := min(ind), values]
Test[Valid_reversal | is.na(out), out := DT[.SD, on=.(ind=Created), root]]
just for completeness, here is a while loop version:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#step 2, 3, ...
while(Test[, any(is.na(out))]) {
Test[is.na(out), out := Test[.SD, on=.(Next_peak=Created), mult="first", x.out]]
}
Test

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

What is the fastest way to lookup a large number of values using R?

I have a list of over 1,000,000 numbers. I have a lookup table that has a range of numbers and a category. For example, 0-200 is category A, 201-650 is category B (the ranges are not of equal length)
I need to simply iterate over the list of 1,000,000 numbers and get a list of the 1,000,000 corresponding categories.
EDIT:
For example, the first few elements of my list are - 100, 125.5, 807.5, 345.2, and it should return something like 1,1,8,4 as categories. The logic for the mapping is implemented in a function - categoryLookup(cd) and I'm using the following command to get the categories
cats <- sapply(list.cd, categoryLookup)
However, while this seems to be working quickly on lists of size up to 10000, it is taking a lot of time for the whole list.
What is the fastest way to do the same? Is there any form of indexing that can help speed up the process?
The numbers:
numbers <- sample(1:1000000)
groups:
groups <- sort(rep(letters, 40000))
lookup:
categories <- groups[numbers]
EDIT:
If you don't yet have the vector of "groups" you can create it first.
Assume you have data-frame with range info:
ranges <- data.frame(group=c("A","B","C"),
start=c(0,300001,600001),
end=c(300000,600000,1000000)
)
ranges
group start end
1 A 1 3e+05
2 B 300001 6e+05
3 C 600001 1e+06
# if groups are sorted and don't overlap:
groups <- rep(ranges$group, (ranges$end-ranges$start)+1)
Then continue as before
categories <- groups[numbers]
EDIT: as #jbaums said - you will have to add +1 to the (ranges$end-ranges$start) in this case. (already edited in the example above). Also in this case your starting coordinate should be 1 and not a 0

Resources