I have the following data structure:
Scope,Metric ID,Item ID,System,Color
TRUE,A1,123,A,Red
FALSE,A1,123,B,Red
FALSE,B1,234,C,Red
TRUE,B1,234,A,Red
FALSE,B1,415,A,Red
I'd like to group by Scope, filter on TRUE and get the unique list of Items, then count these Items and subtract from a total unique count for the Color = Red.
So, in the example above, I have 3 unique items for Color = Red and I have 2 unique items with Scope = TRUE, so the result should say 3 - 2 = 1.
Because of the data structure, simple filtering won't help. I realize I need to use a complex LOD syntax, but after having tried them for a few hours, I find them rather confusing.
Does anyone have an idea how to write an LOD expression to give me the desired count? Thanks!
Did you try using 3 calculated fields like this:
then doing a count distinct on them.
1:
if [Color]='Red' then [Item ID] end
2:
if [Scope]='TRUE' then [Item ID] end
3 :
subtract the 2 calculated fields i,e 2-1
It gives out 1.
Related
I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)
I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])
I'm new here and diving into R, and I'm encountering a problem while trying to solve a knapsack problem.
For optimization purposes I wrote a dynamic program in R, however, now that I am at the point of returning the items, which I succeeded in, I only get the binary numbers saying whether the item has been selected or not (1 = yes). Like this:
Select
[1] 1 0 0 1
However, now I would like the Select function to return the names of values instead of these binary values. Underneath I created an example of what my problem looks like.
This would be the data and a related data frame.
items <- c("Glasses","gloves","shoes")
grams <- c(4,2,3)
value <- c(100,20,50)
data <- data.frame(items,grams,value)
Now, I created various functions, with the final one clarifying whether a product has been selected by 1 (yes) or 0 (no). Like above. However, I would really like for it to return the related name of the item. Is there a manner to go around this by linking back to the dataframe created?
So that it would say instead of (in case all products are selected)
Select
[1] 1 1 1
Select
[1] Glasses gloves shoes
I believe I would have to create a new function. But as I mentioned, is there a good way to refer back to the data frame to take related values from another column in the data frame in case of a 1 (yes)?
I really hope my question is more clear now and someone can direct me in the right direction.
Best, Berber
Lets say your binary vector is
idx <- [1, 0, 1, 0, 1]
just use,
items[as.logical(idx)]
will give you the name for selected items, and
items[!as.logical(idx)]
will give you name for unselected items
I have a vector of binary variables which state whether a product is on promotion in the period. I'm trying to work out how to calculate the duration of each promotion and the duration between promotions.
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0))
So in other words: if promo.flag is same as previous period then running.total + 1, else running.total is reset to 1
I've tried playing with apply functions and cumsum but can't manage to get the conditional reset of running total working :-(
The output I need is:
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0)
rolling.sum = c(1,2,1,1,1,2,1,2,3,1,1,2,0)
Can anybody shed any light on how to achieve this in R?
It sounds like you need run length encoding (via the rle command in base R).
unlist(sapply(rle(promo.flag)$lengths,seq))
Gives you a vector 1 2 1 1 1 2 1 2 3 1 1 2 1. Not sure what you're going for with the zero at the end, but I assume it's a terminal condition and easy to change after the fact.
This works because rle() returns a list of two, one of which is named lengths and contains a compact sequence of how many times each is repeated. Then seq when fed a single integer gives you a sequence from 1 to that number. Then apply repeatedly calls seq with the single numbers in rle()$lengths, generating a list of the mini sequences. unlist then turns that list into a vector.
I want total count of repeated values in a datable.
as shown in image total count should be : 2 because repeated values are only 1 and 2.
I tried as given below:
DataView dv = new DataView(dtTemp);
int iRowCount = dv.ToTable(true, "Column1").Rows.Count;
but it returns 3 which is incorrect.
Does anyone knows how to do it.
I don't want to use loop becoz sometimes data table contains 4000-5000 rows so if we use loop it will take much more time to get the total count.
Thanks.