I want to assign value corresponding to key in dictionary as replacement for column value in pyspark - dictionary

I am trying to write this function in pyspark such that the values stored in mapping_table will get assigned to the values in cat_columns. But, its not working. I have tried many ways. Can you please help me to understand where I am wrong?
def apply_hierarchy_masking(dataframes, mapping, categorical_columns, apply_mask_flag=0): if apply_mask_flag == 1: for i, dataframe in zip(range(len(dataframes)), dataframes): for colmn in cat_cols_list: if colmn in mapping_table: #dataframe = dataframe.withColumn(colmn, mapping[colmn][colmn]) # Create a udf for the mapping function dataframe = dataframe.withColumn(colmn, dataframe[colmn].cast(T.StringType())) mapping_udf = F.udf(lambda x: mapping[colmn].get(x), T.StringType()) # Use the udf to apply the mapping function to the column dataframe = dataframe.withColumn(colmn, mapping_udf(col(colmn))) # # Define a conditional expression that maps values to their masked equivalents # mapping_expr = F.when(dataframe[colmn].isin(list(mapping[colmn])), mapping[colmn][colmn]).otherwise(dataframe[colmn]) # # Apply the conditional expression to the column using withColumn # dataframe = dataframe.withColumn(colmn, mapping_expr) # # Define the mapping function using udf # mapping_udf = F.udf(lambda x: mapping[x][x]) # # Apply the mapping function to the column using withColumn # dataframe = dataframe.withColumn(colmn, mapping_udf(colmn)) else: print(f"Not applying masking for column: {colmn}") # null_count = dataframe.select([col(c).isNull().alias(c) for c in dataframe.columns]).sum().asDict() # print(f"Dataframe {i} null value counts: \n{null_count}") # print(f"Dataframe {i} shape before dropping null values: {dataframe.count()}, {len(dataframe.columns)}") # # Drop the rows with null values # dataframe = dataframe.dropna() # print(f"Dataframe {i} shape after dropping null values: {dataframe.count()}, {len(dataframe.columns)}") # dataframes[i] = dataframe else: print("Not applying hierarchy masking to the given data") return dataframes
I have tried different ways and I have hashed them in this code. But none of them are working. The udf function is not giving error but just returning original values.

If I understood correctly you need to replace the value of a column (with dictionary value) if the column value is present as a key in the dictionary.
You can create a DF out of that dictionary.
dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])
and then perform a join like this (to replace value of column_a):
df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"), "left")\
.withColumn("column_a", F.coalesce(F.col("df2.value"), F.col("df1.column_a")))\
.show()
Input DF:
Dict DF:
Output (You can drop the intermediate columns, I kept it for understanding):

Related

How to define a certain number of nested for-loops (based on input length in R Shiny app)?

Here is the context : I work on a R shiny web application. The user uploads a dataframe. Then he selects a certain n number of columns with a selectInput. This number of columns selected can vary from one to six.
Based on this number of columns, I would like to generate the appropriate number nested for-loops automatically. At that time, I use if() conditions by testing each possible number of columns selected.
I want to pass through each unique value of each column selected. That makes my code very long :
my_columns = input$colnames #The user selects column names
if(length(mycolumns) == 1){
for(var1 in unique(mydataframe[,my_columns[1]])){
...
}
}
if(length(mycolumns) == 2){
for(var1 in unique(mydataframe[,my_columns[1]])){
for(var2 in unique(mydataframe[,my_columns[2]])){
...
}
}
}
if(length(mycolumns) == 3){
for(var1 in unique(mydataframe[,my_columns[1]])){
for(var2 in unique(mydataframe[,my_columns[2]])){
for(var3 in unique(mydataframe[,my_columns[3]])){
...
}
}
}
}
and so on ...
Is there a solution to avoid this ?
Thank you
Correct me if I am mistaken, but you seem to compute something that needs to cover all possible value combinations of the selected columns.
R does not need nested for-loops for this case
my_columns <- data.frame(
"A" = c(1,2,3),
"B" = c(11,12,13),
"C" = c(21,22,23))
# find all unique values per column
list_uniques <- lapply(seq_along(my_columns),
function(x){unique(my_columns[[x]])}
)
# find out all possible combinations of the given values
# the output is a dataframe
all_combinations <- expand.grid(list_uniques)
# Now you can iterate over the frame and do something with them
# example rowsums
rowSums(all_combinations) # vectorized functions like this are faster
# example custom function
apply(all_combinations,
MARGIN = 1, # iterate rowwise
# you can now use your own function
# the input i is a row as a named vector
FUN = function(i){paste(i,collapse = " and ")})
# This function will output:
# "1 and 11 and 21" "2 and 11 and 21" ....

Grab data from table by coordinates as input

I have a matrix with 99 rows called r010, r020, ... r990 and 99 cols called c010, c020, ... c990.
Additionally I have defined variables for every row and every column, like r010.
This variable contains all data within the row r010.
My Input data can contain values like "{r010, c070} + {r020, c070} == {r030, c090}". So here it is important to call specific CELLS of the Matrix.
I would like R to go into the Matrix at row 1 (r010) and col 7 (c070) if the Input data is like {r010, c070}.
So I need a function which detects the "," and gives the value in that cell of the Matrix (row on the left side of "," and col on the left)
This works as follows:
s <- "{r390, c010} == {r400, c010} + {r410, c010} + {r420, c010}"
pat <- "\\d+(?>\\d)\\B"
pat2 <- "\\{r..., c...\\}"
getCell=function(data,string){
y=regmatches(string,gregexpr(pat,string,perl = T))
data[do.call(rbind,lapply(y,as.numeric))]
}
pos<-regmatches(string,gregexpr(pat2,s))
unlist(pos)->pos # convert to character
getCell(table,pos)#Will give the values in (39,1),(40,1),(41,1),(42,1)
Now I Need to put the results back into the original formula s to evaluate with eval(parse(text=s))
This can work with gsubfn, but I don't get it completely.
b <- gsubfn(pat, getCell, s); b will lead to Error in structure(.External(.C_dotTcl, ...), class = "tclObj") :
[tcl] couldn't compile regular expression pattern: quantifier operand invalid.
Can anyone help?
In base R, we can use regmatches to extract the values that are between r or c before the last 0. Then we transform then into numeric. this automatically drops the leading zero. We then use those to extract the required values from the data
fun=function(data,x){
y=regmatches(x,gregexpr("\\d+(?>\\d)\\B",x,perl = T))
data[do.call(rbind,lapply(y,as.numeric))]
}
pos=c("{r010, c070}","{r0200, c0700}","{r0990, c0990}")
fun(data,pos)#Will give you values in (1,7),(20,70),(99,99)

R - Use names in a list to feed named objects to a loop?

I have a data frame of some 90 financial symbols (will use 3 for simplicity)
> View(syM)
symbol
1 APPL
2 YAHOO
3 IBM
I created a function that gets JSON data for these symbols and produce an output. Basically:
nX <- function(x) {
#get data for "x", format it, and store it in "nX"
nX <- x
return(nX)
}
I used a loop to get the data and store the zoo series named after each symbol accordingly.
for (i in 1:nrow(syM)) {
assign(x = paste0(syM[i,]),
value = nX(x = syM[i,]))
Sys.sleep(time = 1)
}
Which results in:
[1] "APPL" "YAHOO" "IBM"
Each is a zoo series with 5 columns of data.
Further, I want to get some plotting done to each series and output the result, preferably using a for loop or something better.
yN <- function(y) {
#plot "y" series, columns 2 and 3, and store it in "yN"
yN <- y[,2:3]
return(yN)
}
Following a similar logic to my previous loop I tried:
for (i in 1:nrow(syM)) {
assign(x = paste0(pairS[i,],".plot"),
value = yN(y = paste0(syM[i,])))
}
But so far the data is not being sent to the function, only the name of the symbol, so I naturally get:
y[,2:3] : incorrect number of dimensions
I have also tried:
for (i in 1:nrow(syM)) {
assign(x = paste0(syM[i,],".plot"),
value = yN(y = ls(pattern = paste0(syM[i,]))))
}
With similar results. When I input the name of the series manually it does save the plot of the first symbol as "APPL.Plot".
assign(paste0(syM[1,], ".Plot"),
value = yN(p = APPL))
Consider lapply with setNames to create a named list of nX returned objects:
nX_list <- setNames(lapply(syM$symbol, nX), syM$symbol)
# OUTPUT ZOO OBJECTS BY NAMED INDEX
nX_list$AAPL
nX_list$YAHOO
nX_list$IBM
# CREATE SEPARATE OBJECTS FROM LIST
# BUT NO NEED TO FLOOD GLOBAL ENVIR W/ 90 OBJECTS, JUST USE 1 LIST
list2env(nX_list, envir=.GlobalEnv)
For plot function, first add a get inside function to retrieve an object by its string name, then similarly run lapply with setNames:
yN <- function(y) {
#plot "y" series, columns 2 and 3, and store it in "yN"
yobj <- get(nX_list[[y]]) # IF USING ABOVE LIST
yobj <- get(y) # IF USING SEPARATE OBJECT
yN <- yobj[,2:3]
return(yN)
}
plot_list <- setNames(lapply(syM$symbol, yN), paste0(syM$symbol, ".plot"))
# OUTPUT PLOTS BY NAMED INDEX
plot_list$AAPL.plot
plot_list$YAHOO.plot
plot_list$IBM.plot
# CREATE SEPARATE OBJECTS FROM LIST
# BUT NO NEED TO FLOOD GLOBAL ENVIR W/ 90 OBJECTS, JUST USE 1 LIST
list2env(plot_list, envir=.GlobalEnv)
As you note, you're calling yN with a character argument in:
for (i in 1:nrow(syM)) {
assign(x = paste0(pairS[i,],".plot"),
value = yN(y = paste0(syM[i,])))
}
paste0(syM[i,]) is going to resolve to a character and not the zoo object it appears you're trying to reference. Instead, use something like get():
for (i in 1:nrow(syM)) {
assign(x = paste0(pairS[i,],".plot"),
value = yN(y = get(paste0(syM[i,]))))
}
Or perhaps just store your zoo objects in a list in the first place and then operate on all elements of the list with something like lapply()...

Accessing ... function arguments by (string) name inside the function in R?

I'm trying to write a function with dynamic arguments (i.e. the function argument names are not determined beforehand). Inside the function, I can generate a list of possible argument names as strings and try to extract the function argument with the corresponding name (if given). I tried using match.arg, but that does not work.
As a (massively stripped-down) example, consider the following attempt:
# Override column in the dataframe. Dots arguments can be any
# of the column names of the data.frame.
dataframe.override = function(frame, ...) {
for (n in names(frame)) {
# Check whether this col name was given as an argument to the function
if (!missing(n)) {
vl = match.arg(n);
# DO something with that value and assign it as a column:
newval = vl
frame[,n] = newval
}
}
frame
}
AA = data.frame(a = 1:5, b = 6:10, c = 11:15)
dataframe.override(AA, b = c(5,6,6,6,6)) # Should override column b
Unfortunately, the match.arg apparently does not work:
Error in match.arg(n) : 'arg' should be one of
So, my question is: Inside a function, how can I check whether the function was called with a given argument and extract its value, given the argument name as a string?
Thanks,
Reinhold
PS: In reality, the "Do something..." part is quite complicated, so simply assigning the vector to the dataframe column directly without such a function is not an option.
You probably want to review the chapter on Non Standard Evaluation in Advanced-R. I also think Hadley's answer to a related question might be useful.
So: let's start from that other answer. The most idiomatic way to get the arguments to a function is like this:
get_arguments <- function(...){
match.call(expand.dots = FALSE)$`...`
}
That provides a list of the arguments with names:
> get_arguments(one, test=2, three=3)
[[1]]
one
$test
[1] 2
$three
[1] 3
You could simply call names() on the result to get the names.
Note that if you want the values as strings you'll need to use deparse, e.g.
deparse(get_arguments(one, test=2, three=3)[[2]])
[1] "2"
P.S. Instead of looping through all columns, you might want to use intersect or setdiff, e.g.
dataframe.override = function(frame, ...) {
columns = names(match.call(expand.dots = FALSE)$`...`)[-1]
matching.cols <- intersect(names(frame), names(columns))
for (i in seq_along(matching.cols) {
n = matching.cols[[i]]
# Check whether this col name was given as an argument to the function
if (!missing(n)) {
vl = match.arg(n);
# DO something with that value and assign it as a column:
newval = vl
frame[,n] = newval
}
}
frame
}
P.P.S: I'm assuming there's a reason you're not using dplyr::mutate for this.

Vector-version / Vectorizing a for which equals loop in R

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})

Resources