How to assign weights to strings in WHERE clause? - teradata

The table is
id col1 col2
1 former good
2 future fair
3 now bad
4 former good
.............
GOAL : I need to SELECT only those rows that have a cumulative score higher than 0.8
1) If col1 = 'former' THEN the row gets 0.2 points, if 'now' THEN '0.7' , if 'future' THEN 0.3
2) If col2 = 'good' THEN the row gets 0.8 points, if 'bad' THEN '0.1' , if 'fair' THEN 0.5
Therefore I need to I need to assign numeric values in the WHERE clause. I want to avoid changing values in the SELECT because I need the user to be able to see the labels ('good', 'now' etc) but not numbers.
How can I do this?
SELECT *
FROM mytable
WHERE ?

Use a CASE to assign a weight based on your logic:
WHERE
CASE col1
WHEN 'former' THEN 0.2
WHEN 'now' THEN 0.7
WHEN 'future' THEN 0.3
ELSE 0
END +
CASE col2
WHEN 'good' THEN 0.8
WHEN 'bad' THEN 0.1
WHEN 'fair' THEN 0.5
ELSE 0
END > 0.8

SELECT * FROM myTable where col1 + col2 > 0.8
But provide us the real structure of the table.

Related

Kusto calculation with Loop

I am trying to make an iterative calculation, but it seems that it is not possible, has someone any clue if there is a workaround?
How my Table looks like:
Column1
Column2
Todo
A
B
0.5
A
C
-0.3
A
C
-0.3
What I want to see:
Column1
Column2
Todo
Calculated
A
B
0.5
1.0
A
C
-0.3
0.7
A
C
-0.3
0.6
Starting variable is 0.5, it would add 0.5 in the first row. In the second row it would substract the result from the first row. If the calculation is below Zero it has to set the result to 0.0
Would be great the have help here.
Thanks in advance
Based on what I understood, it looks like you're trying to get the cumulative sum of a column. You could use row_cumsum() after setting the right sort order:
let T = datatable(Column1:string, Column2:string, Todo:double)
[
"A", "B", 0.5,
"A", "C", -0.3,
"A", "C", -0.3,
];
T
| sort by Column1
| serialize Calculated = row_cumsum(Todo) + 0.5
I didn't get how the last row in your example ended up as 0.7 - 0.3 = 0.6? Shouldn't it be 0.4?

Insert condition (column A value) into function based on a value in a different column (B)

This is a follow-up to a question I previously asked (Replace only certain values in column based on multiple conditions). For context I'm including some of the same information.
I have a large dataframe that contains many columns, but the relevant ones are: ID (this is number assigned to subject), Time (time at which this subject's measurement was taken) and Concentration. A very simplified example would be:
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX"),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5))
I would like to replace only the "XXX" values in column Concentration based on the following conditions:
when the value in column Time is less than or equal to timeX ; "XXX"==0
when the value in column Time is greater than timeX; "XXX" should be replaced with the word "Missing" unless two consecutive "XXX" values appear for a single subject (ID) for Time>timeX then the first consecutive "XXX" should be replaced with 0.05 and the second consecutive "XXX" (or all the following "XXX" values if there are more) should be replaced with the word "Missing".
It's very important that the ID's are somehow seperated here because there could be "XXX" as the final Concentration of one ID and as the first Concentration of the next ID and I do not want that to be read as two consecutive "XXX" values for a single ID.
The solution I have, for when we assume timeX=3 is:
require(tidyverse)
df <- tibble(df) %>%
mutate(Concentration = as.character(Concentration),
Concentration_Original = Concentration) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Time <= 3, "0", Concentration)) %>%
group_by(ID) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Concentration == lead(Concentration),
"0.05", ifelse(Concentration == 'XXX',
"Missing", Concentration))) %>%
replace_na(list(Concentration = "Missing")) %>% ungroup()
To make the code more flexible and more importantly so that it doesn't require the user to manually check what the time cut off point should be and then manually insert it, I've been trying to make the code more automatic.
I would like to replace Time <= 3 with the following condition for timeX:
timeX is the value in column Time for that specific subject ID at which the value in column concentration is the highest. So basically the condition should be that timeX is that at which the concentration achieves it's maximum value.
For example: For ID 1 in my df, the highest concentration would be 0.7 and that concentration is achieved at Time = 3 so the value 3 should be inserted as timeX value.
Here are some thoughts/suggestions that might be helpful.
First, if you wish to look at maximum value for Concentration, I would not have this column be of character type. Instead, would make it numeric, and use NA for missing values. The first mutate sets that up.
After grouping, you can use mutate and case_when for your various situations. You can access the Time of maximum concentration through:
Time[which(Concentration == max(Concentration, na.rm = TRUE))]
(removing the missing values).
If it the Concentration is missing, and Time is less than the Time of maximum concentration, then change to 0.
In second case, if lead (or subsequent row) also is missing, then change to .05.
Otherwise, do not change Concentration.
Depending on further analyses and presentation, you can use "Missing" as a text label for missing data.
Edit: Based on OP comment, it appears that only the first "XXX" after max time should be replace with .05 for concentration, but all the following "XXX" after that as missing. To achieve this, add:
!is.na(lag(Concentration, default = 0))
as a condition for determining if value should be .05. The logic is: if the previous row's value is not NA, but the following value is NA, after the max time, then change to .05.
Here is the modified code:
library(tidyverse)
df %>%
mutate(Concentration = ifelse(Concentration == "XXX", NA_character_, Concentration),
Concentration = as.numeric(Concentration)) %>%
group_by(ID) %>%
mutate(Concentration_New = case_when(
is.na(Concentration) & Time < first(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) ~ 0,
is.na(Concentration) & Time > last(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) &
is.na(lead(Concentration, default = 0)) & !is.na(lag(Concentration, default = 0)) ~ .05,
TRUE ~ Concentration
))
Output
ID Concentration Time Concentration_New
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 0
2 1 0.3 2 0.3
3 1 0.7 3 0.7
4 1 0.6 4 0.6
5 1 NA 5 NA
6 2 NA 1 0
7 2 0.8 2 0.8
8 2 0.3 3 0.3
9 2 NA 4 0.05
10 2 NA 5 NA
11 3 NA 1 0
12 3 0.6 2 0.6
13 3 0.1 3 0.1
14 3 0.1 4 0.1
15 3 NA 5 NA

How to allow user inputs in a min-max range in R?

Good afternoon.
My question is very simple.
I'm wanting to retrieve some n user inputs ( a vector of length n ). Only values between 0 and 1 are accepted.
I know how to retrieve values with scan function but i don't know how to force users to enter only values in [min-max] interval.
Thank you for help !
Code :
x <- scan(,n=3)
One way can be using a while loop:
stayInLoop <- TRUE
N<-3 # number of elements in vector
while(stayInLoop){
print("Please insert x")
x <- scan(,n=N) #readLines(,n=N)
if (any(x<0) | any(x>1)) {
print("Re-enter the values, as valid values can be between 0 and 1")
x <- scan(,n=N)
}
stayInLoop<-any(x<0) & any(x>1)
}
[1] "Please insert x"
1: 1
2: 2
3: 3
Read 3 items
[1] "Re-enter the values, as valid values can be between 0 and 1"
1: 0.2
2: 0.2
3: 0.4
Read 3 items
> x
[1] 0.2 0.2 0.4

How to retrieve column in sequence on SQLite using R

I uses a censor data which provide the wavelengths from 100-300nm with sequence 1, means: 100, 101,102,103,..300. I keep the data in SQLite using R, with table name as data
> data
obs 100 101 102 103 104 ... 300
1 0.1 0.1 0.9 0.1 0.2 0.5
2 0.8 1.0 0.9 0.0 1.0 0.4
3 0.7 0.8 0.3 0.8 0.5 0.2
4 0.7 0.1 0.2 0.4 0.7 0.6
5 0.9 0.4 0.6 0.6 0.6 0.4
6 0.7 0.1 0.6 0.7 0.9 0.9
I am interested to retrieve the column number with sequence 4 only starting 100. Means: 100, 104, 108, ...
I tried using sqldf("select 100, 104, 108, ... from data") but seems not efficient work. Is there someone can help using R? thanks!
You can use paste() inside of sqldf to make things like this easier. So the basic idea would be:
sqldf(paste("select",
paste0("`",seq(100,300,4),"`",collapse=", "),
"from data"))
Columns or tables with numeric names typically need to be surrounded with backticks. So that's why I've adjusted the statement to find `100`, instead of 100.
The full statement (simplified above) looks like this:
[1] "select `100`, `104`, `108`, `112`, `116`, `120`, `124`, `128`, `132`, `136`,
`140`, `144`, `148`, `152`, `156`, `160`, `164`, `168`, `172`, `176`,
`180`, `184`, `188`, `192`, `196`, `200`, `204`, `208`, `212`, `216`,
`220`, `224`, `228`, `232`, `236`, `240`, `244`, `248`, `252`, `256`,
`260`, `264`, `268`, `272`, `276`, `280`, `284`, `288`, `292`, `296`,
`300` from data"
sqldf loads the gsubfn package which provides fn$ for dealing with string interpolation. fn$ can preface any function invocation so, for example, use fn$sqldf("... $var ...") and then $var is replaced with its value.
Note that select 100 selects the number 100 and not the column named 100 so we use select [100] instead.
cn <- toString(sprintf("[%d]", seq(100, 300, 4))) # "[100], [104], ..."
fn$sqldf("select $cn from data")
or if we want to create the SQL statement in a variable and then run it:
sql <- fn$identity("select $cn from data")
sqldf(sql)
Note that this is pretty easy to do in straight R as well:
data[paste(seq(100, 300, 4))]

How can I normalize column values in a data frame for all rows that share the same ID given in another column?

I have a dataframe that looks like this
ID value
1 0.5
1 0.6
1 0.7
2 0.5
2 0.5
2 0.5
and I would like to add a column with normalization for values of the same ID like this: norm = value/max(values with same ID)
ID value norm
1 0.5 0.5/0.7
1 0.6 0.6/0.7
1 0.7 1
2 0.5 1
2 0.3 0.3/0.5
2 0.5 1
Is there an easy way to do this in R without first sorting and then looping?
Cheers
A solution using basic R tools:
data$norm <- with(data, value / ave(value, ID, FUN = max))
Function ave is pretty useful, and you may want to read ?ave.
# Create an example data frame
dt <- read.csv(text = "ID, value
1, 0.5
1, 0.6
1, 0.7
2, 0.5
2, 0.5
2, 0.5")
# Load package
library(tidyverse)
# Create a new data frame with a column showing normalization
dt2 <- dt %>%
# Group the ID, make sure the following command works only in each group
group_by(ID) %>%
# Create the new column norm
# norm equals each value divided by the maximum value of each ID group
mutate(norm = value/max(value))
We can use data.table
library(data.table)
setDT(dt)[, norm := value/max(value), ID]

Resources