How can I replicate R subset mechanism in Excel VBA? - r

First of all thank you for reading my post.
I would like to ask how can I replicate R subset mechanism in excel-vba?
Here is my r function:
Subdeck2 = deck2[(deck2[,3]>=10 & deck2[,4]<=30),]
The code uses r to create a data.frame object called Subdeck2 which is a subset of a data.frame object called deck2 that contain the rows of deck2 that have a third column value of more than or equal to ten, and a fourth column value of less than or equal to thirty.
I would like to replicate this in excel-vba, and a worksheet that is a subset of a the worksheet with the source data. I think the array naming in excel is very helpful to reference the rows and columns.
In r, it tends to get confusing when I have to do this repeatedly, because I have to remember the row and column numbers that I have already input.
I only need to do this one particular thing in excel. I already bought a book about vba programming but it's like 1000 pages long and I cant seem to find the word subset in there.
Any suggestions on how to do this or where i can learn to do this will be very appreciated. Thanks!

Here is an example - nowhere near as concise as your r function though.
The method is commented - but basically, it iterates the rows of the source range and checks each row for the criteria. Then it selects the output range and resizes it to the size of the filtered data before output.
Option Explicit
Sub FilterLikeRSubset()
Dim rngData As Range
Dim rngRow As Range
Dim rngFilter As Range
Dim rngOutput As Range
'get data
Set rngData = ThisWorkbook.Worksheets("Sheet1").Range("A1:D5")
'iterate rows in data
For Each rngRow In rngData.Rows
'test row criteria
If rngRow.Cells(1, 3) >= 10 And rngRow.Cells(1, 4) <= 30 Then
'success
If rngFilter Is Nothing Then
Set rngFilter = rngRow
Else
Set rngFilter = Union(rngFilter, rngRow)
End If
End If
Next rngRow
'set range for output
Set rngOutput = ThisWorkbook.Worksheets("Sheet1").Range("A10")
Set rngOutput = rngOutput.Resize(rngFilter.Rows.Count, rngFilter.Columns.Count)
'output
rngOutput.Value = rngFilter.Value
End Sub
Sample output:

Related

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

Can't complete cases of a data.frame

I'm coming because, I don't need help to realize the exercise, but I need help on an error that I can't fix..
This is the subject:
In R the more appropriate indicator for missing data is “NA” (not available). Therefore, replace each occurrence of “?” with “NA”.
a. For this exercise, create an R data frame for the mammographic data using only datapoints that have no missing values. This can be done using the complete.cases function which inputs a data frame and returns a Boolean vector v, where v[i] equals TRUE iff the i the data-frame sample is complete (meaning it does not possess an NA). For example, if the data-frame is stored in mammogram.frame, then mammogram2.frame = mammogram.frame[complete.cases(mammogram.frame),] creates a new data frame called mammogram2.frame that has all the complete mammogram data samples.
So I coded that:
mammogram = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data",
sep=",",
col.names=c("Birads","Age","Shape","Margin","Density","Severity"),
fill=TRUE,
strip.white=TRUE)
#Replace N/A by -1
mammogram2.frame = mammogram.frame[complete.cases(mammogram.frame),]
#Display data frame
mammogram2
However I get this error:
> mammogram2.frame = mammogram.frame[complete.cases(mammogram.frame),]
Error: object 'mammogram.frame' not found
I can't find on internet any solution about it, I tried lot of stuff but the missing values are still '?'
Thank

Dynamically assign variable names for vectors in R?

I'm new to R and I am trying to create variables referencing vectors within a for loop, where the index of the loop will be appended to the variable name. However, the following code below, where I'm trying to insert the new vectors into the appropriate place in the larger data frame, is not working and I've tried many variations of get(), as.vector(), eval() etc. in the data frame construction function.
I want num_incorrect.8 and num_incorrect.9 to be vectors with a value of 0 and then be inserted into mytable.
cols_to_update <- c(8,9)
for (i in cols_to_update)
{
#column name of insertion point
insertion_point <- paste("num_correct",".",i,sep="")
#create the num_incorrect col -- as a vector of 0s
assign(paste("num_incorrect",".",i,sep=""), c(0))
#index of insertion point
thespot <- which(names(mytable)==insertion_point)
#insert the num_incorrect vector and rebuild mytable
mytable <- data.frame(mytable[1:thespot], as.vector(paste("num_incorrect",".",i,sep="")), mytable[(thespot+1):ncol(mytable)])
#update values
mytable[paste("num_incorrect",".",i,sep="")] <- mytable[paste("num_tries",".",i,sep="")] - mytable[paste("num_correct",".",i,sep="")]
}
When I look at how the column insertion went, it looks like this:
[626] "num_correct.8"
[627] "as.vector.paste..num_incorrect........i..sep........2"
...
[734] "num_correct.9"
[735] "as.vector.paste..num_incorrect........i..sep........3"
Basically, it looks like it's taking my commands as literal text. The last line of code works as expected and creates new columns at the end of the data frame (since the line before it didn't insert the column into the proper place):
[1224] "num_incorrect.8"
[1225] "num_incorrect.9"
I am kind of out of ideas, so if someone could please give me an explanation of what's wrong and why, and how to fix it, I would appreciate it. Thanks!
The mistake is in the second last lines of your code, excluding the comments where you are creating the vector and adding it to your data frame.
You just need to add the vector and update the name. You can remove the assign function as it's not creating a vector instead just assigning a value of 0 to the variable.
Instead of the second last line of your code put the code below and it should work.
#insert the vector at the desired location
mytable <- data.frame(mytable[1:thespot], newCol = vector(mode='numeric',length = nrow(mytable)), mytable[(thespot+1):ncol(mytable)])
#update the name of new location
names(mytable)[thespot + 1] = paste("num_incorrect",".",i,sep="")

Excel vlookup in Julia

I have two arrays in Julia, X = Array{Float64,2} and Y = Array{Float64,2}. I'd like to perform a vlookup as per Excel functionality. I can't seem to find something like this.
the following code returns first matched from s details matrix using related record from a master matrix.
function vlook(master, detail, val)
val = master[findfirst(x->x==val,master[:,2]),1]
return detail[findfirst(x->x==val,detail[:,1]),2]
end
julia> vlook(a,b,103)
1005
A more general approach is to use DataFrame.jl, for working with tabular data.
VLOOKUP is a popular function amongst Excel users, and has signature:
VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
I've never much liked that last argument range_lookup. First it's not clear to me what "range_lookup" is intended to mean and second it's an optional argument defaulting to the much-less-likely-to-be-what-you-want value of TRUE for approximate matching, rather than FALSE for exact matching.
So in my attempt to write VLOOKUP equivalents in Julia I've dropped the range_lookup argument and added another argument keycol_index_num to allow for searching of other than the first column of table_array.
WARNING
I'm very new new to Julia, so there may be some howlers in the code below. But it seems to work for me. Julia 0.6.4. Also, and as already commented, using DataFrames might be a better solution for looking up values in an array-like structure.
#=
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Procedures: vlookup and vlookup_withfailsafe
Purpose : Inspired by Excel VLOOKUP. Searches a column of table_array for
lookup_values and returns the corresponding elements from another column of
table_array.
Arguments:
lookup_values: a value or array of values to be searched for inside
column keycol_index_num of table_array.
table_array: An array with two dimensions.
failsafe: a single value. The return contains this value whenever an element
of lookup_values is not found.
col_index_num: the number of the column of table_array from which values
are returned.
keycol_index_num: the number of the column of table_array to be searched.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=#
vlookup = function(lookup_values, table_array::AbstractArray, col_index_num::Int = 2, keycol_index_num::Int = 1)
if ndims(table_array)!=2
error("table_array must have 2 dimensions")
end
if isa(lookup_values,AbstractArray)
indexes = indexin(lookup_values,table_array[:,keycol_index_num])
if(any(indexes==0))
error("at least one element of lookup_values not found in column $keycol_index_num of table_array")
end
return(table_array[indexes,col_index_num])
else
index = indexin([lookup_values],table_array[:,keycol_index_num])[1]
if(index==0)
error("lookup_values not found in column $keycol_index_num of table_array")
end
return(table_array[index,col_index_num])
end
end
vlookup_withfailsafe = function(lookup_values, table_array::AbstractArray, failsafe, col_index_num::Int = 2, keycol_index_num::Int = 1)
if ndims(table_array)!=2
error("table_array must have 2 dimensions")
end
if !isa(failsafe,eltype(tablearray))
error("failsafe must be of the same type as the elements of table_array")
end
if isa(lookup_values,AbstractArray)
indexes = indexin(lookup_values,table_array[:,keycol_index_num])
Result = Array{eltype(table_array)}(size(lookup_values))
for i in 1:length(lookup_values)
if(indexes[i]==0)
Result[i] = failsafe
else
Result[i] = table_array[indexes[i],col_index_num]
end
end
return(Result)
else
index = indexin([lookup_values],table_array[:,keycol_index_num])[1]
if index == 0
return(failsafe)
else
return(table_array[index,col_index_num])
end
end
end

data table and data frame operations

I am using some R code that uses a data table class, instead of a data frame class.
How would I do the following operation in R without having to transform map.dt to a map.df?
map.dt = data.table(chr = c("chr1","chr1","chr1","chr2"), ref = c(1,0,3200,3641), pat = c(1,3020,3022, 3642), mat = c(1,0,3021,0))
parent = "mat"
chrom = "chr1"
map.df<-as.data.frame(map.dt);
parent.block.starts<-map.df[map.df$chr == chrom & map.df[,parent] > 0,parent];
Note: parent needs to be dynamically allocated, its an input from the user. In this example I chose "mat" but it could be any of the columns.
Note1: parent.block.starts should be a vector of integers.
Note2: map.dt is a data table where the column names are c("chr","ref","pat","mat").
The problem is that in data tables I cannot access a given column by name, or at least I couldn't figure out how.
Please let me know if you have some suggestions!
Thanks!
It's a little unclear what the end goal is here, especially without sample data, but if you want to access rows by character name there are two ways to do this:
Columns = c("A", "B")
# .. means "look up one level"
dt[,..Columns]
dt[,get("A")]
dt[,list(get("A"), get("B"))]
But if you find yourself needing to use this technique often, you're probably using data.table poorly.
EDIT
Based on your edit, this line will return the same result, without having to do any as.data.frame conversion:
> map.dt[chr==chrom & get(parent) > 0, get(parent)]

Resources