R/Power-Query: Replace value with matching - r

I am trying to cleaning up some data in a huge dataset.
One column holds values for the Sales aamount. Example values could be like those:
Clean Data:
Sales Potential
230
120
300
However, at some points there appear something like this:
Dirty Data
0, 0, 0, 0, 0
4, 0, 0, 0
0, 0, 480
0, 200, 0
In the first case of the dirty data the cell shoul only contains a zero: 0
In all other cases I would like to extract, if there is any non-zero number, this number and replace the cell with this value or add a new cleanded-column.
So the dirty data cleaned up:
Cleaned Data:
0
4
480
200
My approach was using RegExpressions in R as I am loading the data into Power-BI using Power-Query.
I tried to find a pattern where I extract the value I am looking for and place it in a new column. However, my resutls looks like nothing.
Is there maybe a much simpler approach to achieve this in R?
Code so far:
library(stringr)
OutputRegEx <- data.frame(MyDataset)
Splitter = function(x) substr(str_extract(x,'[1-9]'),1,7)
OutputRegEx[["RegExAuswertung"]] <- apply(OutputRegEx[43],1, function(x) Splitter(x) )

In Powerquery, insert a custom column with below formula
=List.Max(List.Transform(Text.Split(Text.From([Sales Potential]),","), each Number.FromText(_)))
The formula splits everything on commas, puts into a list, converts the list from text into numbers, then takes the maximum number from the list.

This R solution seems to do what you want:
SalesPotential <- c("0, 0, 0, 0, 0", "4, 0, 0, 0","0, 0, 480","0, 200, 0")
library(stringr)
str_extract(gsub(",", "", SalesPotential), "(?=(0\\s){4})\\d+|[1-9]+(0{1,})?")
[1] "0" "4" "480" "200"
Using gsubthis solution first removes the commas in gsub(",", "", SalesPotential) submitting this edited vector to str_extract. It then goes on to define two patterns, one for values where there are no other numbers but 0, another for values that start with non-0digits and may have one or more 0s at the end.
If you want to have clean numbers, convert to numeric:
as.numeric(str_extract(gsub(",", "", SalesPotential), "(?=(0\\s){4})\\d+|[1-9]+(0{1,})?"))
[1] 0 4 480 200

Well, you can achieve the desired result in Power Query itself either by using M-formula language or, by using the GUI itself.
Let me tell you the simplest approach.
If I am correct then, the column has some cleaned numbers and some comma delimited numbers.
So what you do is
Split the column by comma for each occurance.
So, you will get (n+1)-number of columns if the maximum no. Of comman in any cell is "n"
Now, you have to create a conditional column that checks for numbers greater than zero in all these columns and gives the output.
Bhmy doing so,you will get non-zero numbers in that calculated column for dirty data and the same number for the cleaned data.
After doing that you can delete all those comma delimited columns and keep the conditional column only.
Now the formula should be as follows :
if delcol1 <> 0 then delcol1 elseif delcol2 <> 0 then delcol2 elseif.......
delcol2 <> 0 then delcoln
This is the easiest way out of the probelm that I can think of.
However, there are other alternatives also for getting the same answer.

Related

DAX - formula recursion

I am struggling to recreate the following Excel logic in DAX:
IF(OR(IFERROR(--OFFSET(F2,,,-4)>0,)),"",D3)
Given the data of the index column and the Val column, the index is an index starting with 0, and val is an integer continuously increasing except 0. To calculate the preserveval column,
Logic:
From top to bottom, when Val = 0, preserveval = 0
When Val > 0, preserveval = Val, and then the following four consecutive values are set to blank,
Return to step 1
Examples I created in excel:
Any tips & solutions would be much appreciated. Thank you for your time.

Is there a way to create a data frame from two vectors to find all of the possible combinations between the vectors? (In r)

I have identical two vectors, S and T.
S <- seq(from = 0, to = 80, by = 2)
T is the exact same. I am trying to create a data frame so that column one would be all of the S values (2 through 80) but column two would be all of the T values (2 through 80). However, I want it so that row one would be 0, 0. Row 2 would be 0, 2. Row 3 would be 0,4. etc. And then row 42 would be 2, 0. I believe it would be possible using a for loop, but I am struggling on how to accomplish this. Any advice would greatly help. I understand that there would be close to if not over 1000 rows, but I feel like there is a simple way to accomplish this.
Don't label variables T or t in R. T is a popular abbreviation of TRUE, and t is a function (transpose).
expand.grid() is probably what you're looking for.
S <- seq(from = 0, to=80, by=2)
TT <- S
expand.grid(S,TT)
Yes, it's big.
dim(expand.grid(S,TT))
[1] 1681 2

R: how to set an if statement condition to only be triggered if whole column is equal to a value?

So let's say I have a list of data frames. Within each data frame, there is a column in which I want to create a new dummy column based on. This is how it works. For simplicity, let's just use vectors instead of a data frame in the example.
vect<-c(0, 0, 100, 100, 0, 0)
In this case, the dummy column created would be as follows:
dummy_vect<- c(0, 0, 0, 0, 1, 1)
The dummy essentially occurs in the indexes only after the last value in vect. I have the code written to do this and it works without any issues. The big issue I'm running into occurs in the rare instance when all of vect is 0s
vect<-c(0,0,0,0,0,0)
For the context of the problem, when this case occurs, I need the dummy columns to be 1 at every instance.
How would I translate this into code? So if every value in vect is 0, return all 1s in the dummy column, else just do the code I've written that works for other cases. Any help is greatly appreciated! It might be something simple and I'm just really over thinking it, but I don't know how to set the if condition up properly at all
Take absolute values, reverse the input and take the cumulative sum. Finally change the 0 values to TRUE, reverse and convert to numeric.
vect <- c(0, 0, 100, 100, 0, 0)
+rev(cumsum(rev(abs(vect))) == 0)
## [1] 0 0 0 0 1 1
+rev(cumsum(rev(abs(0*vect))) == 0) # 0*vect is all 0 input
## [1] 1 1 1 1 1 1
Just found a condition in an if statement that looks as though it is working.
if(all(df$x == 0){
df$dummy_col = 1
}else{
The code that does the process for all other cases...
}

finding a y value whose x value is closest to zero in R vectors

I have two columns of numbers. First column is called ddd and second column post. You can easily import my data into your Rstudio this way:
id <- "0B5V8AyEFBTmXM1VIYUYxSG5tSjQ"
Points <- read.csv(paste0("https://docs.google.com/uc?id=",id,"&export=download"))
My question is how I can find out first, what is post when ddd is 0 AND second, if there is no 0 for post when ddd is 0, find the closest to 0? (so I need R to do the both checks for me?)
I have used the following R code which doesn't work:
Points$post[Points$ddd == 0]
If you have a Points dataframe with two columns, post and ddd, zero or near zero could be acheived with which.min(abs(Points$ddd)) which will return the index so Points$post[which.min(abs(Points$ddd))] should get you there.
Note, you will have issues if you have multiple zeros or minimum values.

R: Create binary data from a data frame

i need some advise for the following problem:
I have a dataframe with two columns, one containing the date, the other the frequency of a an event.
Now i want to add a third column to this dataframe, wich should contain some binary data: 1 for days with a frequency of 100 and higher, 0 for the lower ones.
Has anyone an idea how to do this in a smart way (i'm affraid of writing it by hand;-)? Thanks for your answer in advance!
data$newcol = as.integer(data$freq >= 100)
alternatively
data$newcol = ifelse(data$freq >= 100, 1, 0)
alternatively
data$newcal = 0
data$newcol[data$freq >= 100] = 1
df$freq.gt.100 = as.integer(df$freq >= 100)
The bit inside brackets evaluates to TRUE or FALSE which can be converted to 1 or 0 via as.integer.
There's nothing to be "afraid" of: you can test the right-hand side of the expression on its own to check it works and only when you are happy with this do you add it as a new column to the original data.
EDIT: didn't see the above answer as I was creating this one and had a call to take!

Resources