Define empty or uninitialized columns in Julia IndexedTables ndsparse - julia

I'm defining NDSparse table, with two columns "state" and "action" that shall be filled later. How should I define these columns upon initialization?
using Pkg
Pkg.add("IndexedTables")
using IndexedTables
start_period = 1 # numer of first period to consider
num_agents = 100 # number of individuals to simulate
periods = [start_period, start_period + 3]
# https://github.com/JuliaData/IndexedTables.jl
df = ndsparse((Identifier = [1, num_agents], Period = periods), (state = [???], action = [???]))
print(df)
Furthermore, while "action" is an integer, "state" is supposed to be an integer array of a variable length.

Related

How to get the difference between two rows in a vector layer field

So the field is from a vector layer attribute table. What I want is to be able to have the result when each row value in the field named “Distance” subtracts from the previous one, I get a result which I can then use for other calculations. So essentially I want to be able to say: row 3 in column 4 minus row 2 in column 4 (same columns but different rows subtracting each other). My code is shown below:
fn = ‘C:/PLUGINS1/sample/checking.shp’
layer = iface.addVectorLayer(fn, ”, ‘ogr’)
layer=iface.activeLayer()
idx=layer.fields().indexFromName(‘Distance’)
with edit(layer):
for f in layer.getFeatures():
dist1 = float(row[2], column [4]) # since row 1 contains the field name
dist2 = float(row[3], column [4])
final = abs(dist2 – dist1)
An error appears. Am stuck here.
This really works:
fn = 'C:/PLUGINS1/sample/checking.shp'
layer = iface.addVectorLayer(fn, '', 'ogr') # '' means empty layer name
for i in range(0, 1):
feat1 = layer.getFeature(i)
ab = feat1[0] # this is the first column and first row value
for i in range(0, 2):
feat2 = layer.getFeature(i)
dk = feat2[0] # this is the first column and second row value
lenggok = (dk - ab) # Note this is the difference between both rows
print(lenggok)

Dataframe in R, different numbers of rows and columns

I am working with a document in excel, which I import in R as a list. This list consists of multiple dataframe types. For instance, when I type
data_list <- import_list("my_doc.xlsx")
I obtain a list with 3 types of dataframes- either 1* 30, 30* 31 or 0* 1. As one can imagine, the 0*1 are scalar values.
After this, I make a consolidated dataframe as follows:
my_data<- ldply (data_list, data.frame)
my_data<-t(my_data)
colnames(my_data) <- my_data[1,]
my_data<- my_data[-1,]
my_data1<-matrix(as.numeric(unlist(my_data)),nrow=nrow(my_data))
my_data1<-data.frame(my_data1)
I now obtain a single dataframe, entitled my_data1, with variables appropriately named. However, I lose all scalar variables. Intuitively, one way to go about it, would be to identify all the scalars, and make a vector of them which repeats in value, and is of the same length (i.e. 30), as the other variables. At the moment, they simply disappear.
Any help is much appreciated!
An example of the datastructure is as follows. a is the scalar, and b represents an example of the 1*30 variable. The ... represent the continuation from period 2 to 30.
a= structure(list(`24` = logical(0)), row.names = character(0), class = "data.frame"))
b= structure(list(period1 = 1, period2 = 2,
period3 = 3, period4 = 4,
period5 = 5), row.names = 1L, class = "data.frame"),
One issue here is that a is stored as logical(0). How can I change this?
Try using dplyr::bind_rows which keep the column from 0 * 1 dataframe and add it in the final dataframe with NAs.
result <- dplyr::bind_rows(a, b)
result
# 24 period1 period2 ...period30
#1 NA 5 4 4

Recycling error while using stringdist and data.table in R

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.
However, I get an error message given below:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).
Some more information about the input data and approach:
The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
The first_names_dict and the author_corrected_df can be accessed from the link below:
https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
While execution I get the following error message:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).
Thanks in advance.
Following previous comments, here I select the gender most present in your selection :
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
# Analysis of factor gender
gender <- as.character( first_names_dict$gender[row_idx1] )
# I take the (first) gender most present in selection
df_count <- as.data.frame( table(gender) )
ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
value <- unique ( test[which(test == ref)] )
# Affecting single character value to data frame
author_corrected_df$Gender_Dict[ijk] <- value
}
}
Hope this helps :)

Data Extraction in matrix form in R

I need to make a matrix by extracting certain information from a file called flowdata. I need to extract the meanValue of all inputGroup == 5. As the column names, I need the corresponding information on the "name" column, and as row names I need the information written on the cell 4 above the "inputGroup == 5" [i-4]. In this case "Methanol, at plant"
Here is the link to the dropbox that has the data
https://www.dropbox.com/s/x2knuqq1odbt5zg/flowdata01.txt?dl=0
Here is the R code I have:
flowdata <-flowdata01
in.up = length(unique(flowdata$name)) # number of unit processes
in.p = length(unique(flowdata$.attrs[which(flowdata$metadata=='name')])) # number of inputs/outputs
input.mat = matrix(0, in.p,in.up) #empty matrix
colnames(input.mat) = unique(flowdata$name) # up names
rownames(input.mat) = unique(flowdata$attrs[which(flowdata$metadata=='name')]) # inputs/outputs names
for (i in 1:nrow(flowdata)){ # for every row in flowdata
if (flowdata$metadata[i]=="inputGroup" && flowdata$attrs[i] == 5){ # if it is an inputGroup 5
col.name = flowdata$name[i] # up name
row.name = flowdata$attrs[i-4] # i/o name 4 cells above
value = as.numeric(flowdata$attrs[i-5]) #value 5 cells above
input.mat[row.name,col.name] = value}}
input.mat = input.mat[-which(rowSums(input.mat)==0),] # if the row is empty, then the flow was an input or output of no interest
`
When I run the above R code, I get this error message:
Error in `[<-`(`*tmp*`, row.name, col.name, value = 6397) :
subscript out of bounds
This is how the matrix should look like

Finding sequences in rows in R based on the rep function on a certain column

I'm trying to find a sequence of 0's in a row based on the rep function of a certain column. Below is my best attempt so far which throws an error. I tried using an apply loop but failed miserably and I don't really want to use a for loop unless I have to as my true dataset is about 800,000 rows. I have tried looking up solutions but can't seem to find anything and have spent a few hours at this and had no luck. I have also attached the desired output.
library(data.table)
TEST_DF <- data.table(INDEX = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2))
IN_FUN <- function(x, y)
{
x <- rle(x)
if( max(as.numeric(x$lengths[x$values == 0])) >= y )
{
"Y"
}
else
{
"N"
}
}
TEST_DF$DEFINITION <- apply(TEST_DF[, c(2:5), with = FALSE], 1,
FUN = IN_FUN(TEST_DF[, c(2:5), with = FALSE], TEST_DF$DAYS))
DESIRED <- TEST_DF <- data.table(P_ID = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2).
DEFINITION = c("Y","N","Y","N"),
INDEX = c(2,NA,4,NA)
For the first row I want to see if four 0's are within COL_1 to COL_4, four 0's within row 2 and two 0's within rows 3 and 4. Basically the number of 0's is given by the value in the DAYS column. So since four 0's are within row 1, DEFINITION gets a value of "Y", row 2 gets a value of "N" since there is only three 0's row 4 should get a value of "Y" since there are two 0's, etc.
Also, if possible, if the DEFINITION column has a value of "Y" in it, then it should return the column index of the first occurrence of the desired sequence, e.g. in row 1 since the first occurrence of a 0 in the 4 0's we're looking for is in COL_1 then we should get a value of 2 for the INDEX column and row 2 get a NA since DEFINITION is "N", etc.
Feel free to make any edits to make it clearer for other users and let me know if you need better information.
Cheers in advance :)
EDIT:
Below is a slightly extended data table. Let me know if this is sufficient.
TEST_DF <- data.table(P_ID = c(1,2,3,4,5,6,7,8,10),
COL_1 = c(0,0,0,0,0,0,0,5,90),
COL_2 = c(0,0,0,0,0,0,3,78,6),
COL_3 = c(0,0,0,0,0,0,7,5,0),
COL_4 = c(0,0,0,0,0,5,0,2,0),
COL_5 = c(0,0,0,0,0,7,2,0,0),
COL_6 = c(0,0,0,0,0,9,0,0,5),
COL_7 = c(0,0,0,0,0,1,0,0,6),
COL_8 = c(0,0,0,0,0,0,0,1,8),
COL_9 = c(0,0,0,0,0,1,6,1,0),
COL_10 = c(0,0,0,0,0,0,7,1,0),
COL_11 = c(0,0,0,0,0,0,8,3,0),
COL_12 = c(0,0,0,0,0,0,9,6,7),
DAYS = c(10,8,12,4,5,4,3,4,7))
Where the DEFINITION column for the rows would be c(1,1,1,1,1,0,1,0,0) where 1 is "Y" and 0 is "N". Either is ok.
For the INDEX column in the new edit the values should be c(2,2,2,2,2,NA,7,NA,NA)
Was able to do this with some math trickery. I created a binary matrix where an element is 1 if it was originally 0 and 0 otherwise. Then, for each row I set the nth element in the row equal to the (n-1th element + the nth element) times the nth element. In this transformed matrix, the value of an element is equal to the number of consecutive prior elements which were 0 (including this element).
m<-as.matrix(TEST_DF[, 2:(ncol(TEST_DF)-1L)])
m[m==1]<-2
m[m==0]<-1
m[m!=1]<-0
for(i in 2:ncol(m)){
m[,i]=(m[,i-1]+m[,i])*m[,i]
}
# note the use of with=FALSE -- this forces ncol to be evaluated
# outside of TEST_DF, leading the result to be used as a
# column number instead of just evaluating to a scalar
m<-as.matrix(cbind(m, Days=TEST_DF[,ncol(TEST_DF),with=FALSE]))
indx<-apply(m[,-ncol(m)] >= m[,ncol(m)],1,function(x) match(TRUE,x) )
TEST_DF$DEFINITION<-ifelse(is.na(indx),0,1)
TEST_DF$INDEX<-indx-TEST_DF$DAYS+2
Note: I stole some stuff from this post
I think I understand this better now that the question has been edited some. This has loops so it might not be optimal speed-wise, but the set statement should help with this. It still has some of the speed-up that data.table provides.
#Combined all column values in giant string
TEST_DF[ , COL_STRING := paste(COL_1,COL_2,COL_3,COL_4,COL_5,COL_6,COL_7,COL_8,COL_9,COL_10,COL_11,COL_12,sep=",")]
TEST_DF[ , COL_STRING := paste0(COL_STRING,",")]
#Using the Days variable, create a string to be searched
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="FIND",value=paste(rep("0,",TEST_DF[i]$DAYS),sep="",collapse=""))
#Find where pattern starts. A negative 1 value means it does not exist
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=regexpr(TEST_DF[i]$FIND,TEST_DF[i]$COL_STRING,fixed=TRUE)[1])
#Define DEFINITION
TEST_DF[ , DEFINITION := 1*(INDEX != -1)]
#Find where pattern starts. A negative 1 value means it does not exist
require(stringr)
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=str_count(substr(TEST_DF[i]$COL_STRING,1,TEST_DF[i]$INDEX),","))
#Clean up variables
TEST_DF[ , INDEX := INDEX + DEFINITION*2L]
TEST_DF[INDEX==0L, INDEX := NA_integer_]
You might explore the IRanges package. I just defined the test dataset as a data.frame, since I am not familiar with data.table. I then expanded it to your dataset size of 800000:
TEST_DF <- TEST_DF[sample(nrow(TEST_DF), 800000, replace=TRUE),]
Then, we put IRanges to work:
library(IRanges)
m <- t(as.matrix(TEST_DF[,2:13]))
l <- relist(Rle(m), PartitioningByWidth(rep(nrow(m), ncol(m))))
r <- ranges(l)
validRuns <- width(r) >= TEST_DF$DAYS
TEST_DF$DEFINITION <- sum(validRuns) > 0
TEST_DF$INDEX <- drop(phead(start(r)[validRuns], 1)) + 1L
The first step simplifies the table to a matrix, so we can transpose and get things in the right layout for a light-weight partitioning (PartitioningByWidth) of the data into a type of list. The data are converted into a run-length encoding (Rle) along the way, which finds the runs of zeros in each row. We can extract the ranges representing the runs and then compute on them more efficiently than we might on the split Rle directly. We find the runs that meet or exceed the DAYS and record which groups (rows) have at least one such run. Finally, we find the start of the valid runs, take the first start for each group with phead, and drop so that those with no runs become NA.
For 800,000 rows, this takes about 4 seconds. If that's not fast enough, we can work on optimization.

Resources