I have two arrays in Julia, X = Array{Float64,2} and Y = Array{Float64,2}. I'd like to perform a vlookup as per Excel functionality. I can't seem to find something like this.
the following code returns first matched from s details matrix using related record from a master matrix.
function vlook(master, detail, val)
val = master[findfirst(x->x==val,master[:,2]),1]
return detail[findfirst(x->x==val,detail[:,1]),2]
end
julia> vlook(a,b,103)
1005
A more general approach is to use DataFrame.jl, for working with tabular data.
VLOOKUP is a popular function amongst Excel users, and has signature:
VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
I've never much liked that last argument range_lookup. First it's not clear to me what "range_lookup" is intended to mean and second it's an optional argument defaulting to the much-less-likely-to-be-what-you-want value of TRUE for approximate matching, rather than FALSE for exact matching.
So in my attempt to write VLOOKUP equivalents in Julia I've dropped the range_lookup argument and added another argument keycol_index_num to allow for searching of other than the first column of table_array.
WARNING
I'm very new new to Julia, so there may be some howlers in the code below. But it seems to work for me. Julia 0.6.4. Also, and as already commented, using DataFrames might be a better solution for looking up values in an array-like structure.
#=
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Procedures: vlookup and vlookup_withfailsafe
Purpose : Inspired by Excel VLOOKUP. Searches a column of table_array for
lookup_values and returns the corresponding elements from another column of
table_array.
Arguments:
lookup_values: a value or array of values to be searched for inside
column keycol_index_num of table_array.
table_array: An array with two dimensions.
failsafe: a single value. The return contains this value whenever an element
of lookup_values is not found.
col_index_num: the number of the column of table_array from which values
are returned.
keycol_index_num: the number of the column of table_array to be searched.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=#
vlookup = function(lookup_values, table_array::AbstractArray, col_index_num::Int = 2, keycol_index_num::Int = 1)
if ndims(table_array)!=2
error("table_array must have 2 dimensions")
end
if isa(lookup_values,AbstractArray)
indexes = indexin(lookup_values,table_array[:,keycol_index_num])
if(any(indexes==0))
error("at least one element of lookup_values not found in column $keycol_index_num of table_array")
end
return(table_array[indexes,col_index_num])
else
index = indexin([lookup_values],table_array[:,keycol_index_num])[1]
if(index==0)
error("lookup_values not found in column $keycol_index_num of table_array")
end
return(table_array[index,col_index_num])
end
end
vlookup_withfailsafe = function(lookup_values, table_array::AbstractArray, failsafe, col_index_num::Int = 2, keycol_index_num::Int = 1)
if ndims(table_array)!=2
error("table_array must have 2 dimensions")
end
if !isa(failsafe,eltype(tablearray))
error("failsafe must be of the same type as the elements of table_array")
end
if isa(lookup_values,AbstractArray)
indexes = indexin(lookup_values,table_array[:,keycol_index_num])
Result = Array{eltype(table_array)}(size(lookup_values))
for i in 1:length(lookup_values)
if(indexes[i]==0)
Result[i] = failsafe
else
Result[i] = table_array[indexes[i],col_index_num]
end
end
return(Result)
else
index = indexin([lookup_values],table_array[:,keycol_index_num])[1]
if index == 0
return(failsafe)
else
return(table_array[index,col_index_num])
end
end
end
Related
I have a customer who sends electronic payments but doesn't bother to specify which invoices. I'm left guessing which ones and I would rather not try every single combination manually. I need some sort of pseudo-code to do it and then I can adapt it but I'm not sure I can come up with a good algorithm myself. . I'm familiar with php, bash, and python but I can adapt.
I would need an array with the following numbers: [357.15, 223.73, 106.99, 89.96, 312.39, 120.00]. Those are the amounts of the invoices. Then I would need to find a sum of any combination of two or more of those numbers that adds up to 596.57. Once found the program would need to tell me exactly which numbers it used to reach the sum so I can then know which invoices got paid.
This is very similar to the Subset Sum problem and can be solved using a similar approach to the typical brute-force method used for that problem. I have to do this often enough that I keep a simple template of this algorithm handy for when I need it. What is posted below is a slightly modified version1.
This has no restrictions on whether the values are integer or float. The basic idea is to iterate over the list of input values and keep a running list of every subset that sums to less than the target value (since there might be a later value in the inputs that will yield the target). It could be modified to handle negative values as well by removing the rule that only keeps candidate subsets if they sum to less than the target. In that case, you'd keep all subsets, and then search through them at the end.
import copy
def find_subsets(base_values, taget):
possible_matches = [[0, []]] # [[known_attainable_value, [list, of, components]], [...], ...]
matches = [] # we'll return ALL subsets that sum to `target`
for base_value in base_values:
temp = copy.deepcopy(possible_matches) # Can't modify in loop, so use a copy
for possible_match in possible_matches:
new_val = possible_match[0] + base_value
if new_val <= target:
new_possible_match = [new_val, possible_match[1]]
new_possible_match[1].append(base_value)
temp.append(new_possible_match)
if new_val == target:
matches.append(new_possible_match[1])
possible_matches = temp
return matches
find_subsets([list, of input, values], target_sum)
This is a very inefficient algorithm and it will blow up quickly as the size of the input grows. The Subset Sum problem is NP-Complete, so you are not likely to find a generalized solution that will work in all cases and is efficient.
1: The way lists are being used here is kludgy. If the goal was to simply find any match, the nested lists could be replaced with a dictionary, and we could exit right away once a match is found. But doing that will cause intermediate subsets that sum to the same value to also map to the same dictionary slot, so only one subset with that sum is kept. Since we need to report all matching subsets (because the values represent checks and are presumably not fungible even if the dollar amounts are equal), a dictionary won't work.
You can use itertools.combinations(t,r) to list all combinations of r elements in array t.
So we loop on the possible values of r, then on the results of itertools.combinations:
import itertools
def find_sum(t, obj):
t = [x for x in t if x < obj] # filter out elements which are too big
for r in range(1, len(t)+1): # loop on number of elements
for subt in itertools.combinations(t, r): # loop on combinations of r elements
if sum(subt) == obj:
return subt
return None
find_sum([1,2,3,4], 6)
# (2, 4)
find_sum([1,2,3,4], 10)
# (1, 2, 3, 4)
find_sum([1,2,3,4], 11)
# none
find_sum([35715, 22373, 10699, 8996, 31239, 12000], 59657)
# none
Rounding errors:
The code above is meant to be used with integers, rather than floats.
To use with floats, replace the test sum(subt) == obj with the more forgiving test sum(subt) - obj < 0.01.
Relevant documentation:
itertools.combinations
I have tried to get multiple outputs of the function I made
ratio_marker_out_2 = function(marker_gene, cluster_id){
marker_gene = list(row.names(FindMarkers(glioblastoma, ident.1 = cluster_id)))
for (gene in marker_gene){
all_cells_all_markers = glioblastoma#assays$RNA#counts[gene,]
selected_cells_all_marker = all_cells_all_markers[cluster_id!=Idents(glioblastoma)]
gene_count_out_cluster = glioblastoma#assays$RNA#counts[,cluster_id!=Idents(glioblastoma)]
ratio_out = sum(selected_cells_all_marker)/sum(gene_count_out_cluster)
}
return(ratio_out)
}
Here, the length of marker_gene is about hundreds. Let's say the length is 100. I want to get ratio_out of each gene in marker_gene. However, when running this function, I only get one output instead of a list of 100 ratio_out. Could please anyone helps how to fix it?
The output I got for
ratio_marker_out_2(marker_gene, 0)
is 1 0.5354895. Please see the pict below
It can be that sum built-in function.
By default, it returns a number. So when you do:
ratio_out = sum(selected_cells_all_marker)/sum(gene_count_out_cluster)
you're actually dividing two numerics.
So if you want to return a list, you must divide, depending on your calculations, just
ratio_out = (selected_cells_all_marker)/sum(gene_count_out_cluster)
I have solved this issue using
all_cells_all_markers[marker_gene, cluster_id!=Idents(glioblastoma)]
ratio_out = (selected_cells_all_marker)/sum(gene_count_out_cluster).
I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])
I am trying to get a sum of values of specific field from the below structure but look like its not working as I am getting error as expected zero or one value but got two or more.
<v4:CalculateResponse xmlns:v4="http://services.xx.net/mm/va">
<v4:CalculateResponseSizeType>
<v4:CalculateCCs>
<v4:Container>
<v4:GrossBookedWeight>31.6</v4:GrossBookedWeight>
<v4:NetPredictedWeight>50</v4:NetPredictedWeight>
<v4:GrossPredictedWeight>53.6</v4:GrossPredictedWeight>
<v4:TypeOfWeightUsed>P</v4:TypeOfWeightUsed>
</v4:Container>
<v4:Container>
<v4:GrossBookedWeight>31.6</v4:GrossBookedWeight>
<v4:NetPredictedWeight>50</v4:NetPredictedWeight>
<v4:GrossPredictedWeight>53.6</v4:GrossPredictedWeight>
<v4:TypeOfWeightUsed>B</v4:TypeOfWeightUsed>
</v4:Container>
<v4:Container>
<v4:GrossBookedWeight>31.6</v4:GrossBookedWeight>
<v4:NetPredictedWeight>50</v4:NetPredictedWeight>
<v4:GrossPredictedWeight>53.6</v4:GrossPredictedWeight>
<v4:TypeOfWeightUsed>B</v4:TypeOfWeightUsed>
</v4:Container>
<v4:Container>
<v4:GrossBookedWeight>31.6</v4:GrossBookedWeight>
<v4:NetPredictedWeight>50</v4:NetPredictedWeight>
<v4:GrossPredictedWeight>53.6</v4:GrossPredictedWeight>
<v4:TypeOfWeightUsed>P</v4:TypeOfWeightUsed>
</v4:Container>
</v4:CalculateCCs>
</v4:CalculateResponseSizeType>
<v4:Status>P</v4:Status>
<v4:StatusCode>1000</v4:StatusCode>
</v4:CalculateResponse>
I have tried summing these values using below function but look like its only onpecting one value.
<Weight>
{
sum(
data($calculateResponse1/*:CalculateResponseSizeType/*:CalculateCCs/*:Container[data(*:TypeOfWeightUsed) = "B"]/*:GrossBookedWeight),
data($calculateResponse1/*:CalculateResponseSizeType/*:CalculateCCs/*:Container[data(*:TypeOfWeightUsed) = "P"]/*:GrossPredictedWeight)
)
}
</Weight>
here calculation is simple, say if TypeOfWeightUsed = 0 then I want to use GrossPredictedWeight element value or if TypeOfWeightUsed = B then I want to use GrossBookedWeight.
we can have multiple container in a structure.
Pls suggest what is wrong with above syntex.
here calculation is simple, say if TypeOfWeightUsed = 0 then I want to use GrossPredictedWeight element value or if TypeOfWeightUsed = B then I want to use GrossBookedWeight.
You can use FLOWR expression with the help of if else construct to get all numbers needed for doing the sum() :
<Weight>
{
sum(
for $c in $calculateResponse1/*:CalculateResponseSizeType/*:CalculateCCs/*:Container
return
if($c/*:TypeOfWeightUsed = "B") then $c/*:GrossBookedWeight
else $c/*:GrossPredictedWeight
)
}
</Weight>
demo
output :
<Weight>170.4</Weight>
When the sum() function has two arguments, the second argument provides a value to be used as the result when the first argument is an empty sequence. (This is a clumsy way of dealing with the fact that without static type checking, the sum() function cannot distinguish an empty sequence of doubles from an empty sequence of durations, and you don't really want an integer-zero result when you are summing durations).
You have called the function with two arguments, but I think you want both sequences to be regarded as inputs to be summed. Just add another pair of parentheses to make it a single argument: replace sum(x, y) by sum((x, y)).
The reason you got an error is that the second argument, if supplied, must be a singleton value, not a sequence.
First of all thank you for reading my post.
I would like to ask how can I replicate R subset mechanism in excel-vba?
Here is my r function:
Subdeck2 = deck2[(deck2[,3]>=10 & deck2[,4]<=30),]
The code uses r to create a data.frame object called Subdeck2 which is a subset of a data.frame object called deck2 that contain the rows of deck2 that have a third column value of more than or equal to ten, and a fourth column value of less than or equal to thirty.
I would like to replicate this in excel-vba, and a worksheet that is a subset of a the worksheet with the source data. I think the array naming in excel is very helpful to reference the rows and columns.
In r, it tends to get confusing when I have to do this repeatedly, because I have to remember the row and column numbers that I have already input.
I only need to do this one particular thing in excel. I already bought a book about vba programming but it's like 1000 pages long and I cant seem to find the word subset in there.
Any suggestions on how to do this or where i can learn to do this will be very appreciated. Thanks!
Here is an example - nowhere near as concise as your r function though.
The method is commented - but basically, it iterates the rows of the source range and checks each row for the criteria. Then it selects the output range and resizes it to the size of the filtered data before output.
Option Explicit
Sub FilterLikeRSubset()
Dim rngData As Range
Dim rngRow As Range
Dim rngFilter As Range
Dim rngOutput As Range
'get data
Set rngData = ThisWorkbook.Worksheets("Sheet1").Range("A1:D5")
'iterate rows in data
For Each rngRow In rngData.Rows
'test row criteria
If rngRow.Cells(1, 3) >= 10 And rngRow.Cells(1, 4) <= 30 Then
'success
If rngFilter Is Nothing Then
Set rngFilter = rngRow
Else
Set rngFilter = Union(rngFilter, rngRow)
End If
End If
Next rngRow
'set range for output
Set rngOutput = ThisWorkbook.Worksheets("Sheet1").Range("A10")
Set rngOutput = rngOutput.Resize(rngFilter.Rows.Count, rngFilter.Columns.Count)
'output
rngOutput.Value = rngFilter.Value
End Sub
Sample output: