Data Extraction in matrix form in R - r

I need to make a matrix by extracting certain information from a file called flowdata. I need to extract the meanValue of all inputGroup == 5. As the column names, I need the corresponding information on the "name" column, and as row names I need the information written on the cell 4 above the "inputGroup == 5" [i-4]. In this case "Methanol, at plant"
Here is the link to the dropbox that has the data
https://www.dropbox.com/s/x2knuqq1odbt5zg/flowdata01.txt?dl=0
Here is the R code I have:
flowdata <-flowdata01
in.up = length(unique(flowdata$name)) # number of unit processes
in.p = length(unique(flowdata$.attrs[which(flowdata$metadata=='name')])) # number of inputs/outputs
input.mat = matrix(0, in.p,in.up) #empty matrix
colnames(input.mat) = unique(flowdata$name) # up names
rownames(input.mat) = unique(flowdata$attrs[which(flowdata$metadata=='name')]) # inputs/outputs names
for (i in 1:nrow(flowdata)){ # for every row in flowdata
if (flowdata$metadata[i]=="inputGroup" && flowdata$attrs[i] == 5){ # if it is an inputGroup 5
col.name = flowdata$name[i] # up name
row.name = flowdata$attrs[i-4] # i/o name 4 cells above
value = as.numeric(flowdata$attrs[i-5]) #value 5 cells above
input.mat[row.name,col.name] = value}}
input.mat = input.mat[-which(rowSums(input.mat)==0),] # if the row is empty, then the flow was an input or output of no interest
`
When I run the above R code, I get this error message:
Error in `[<-`(`*tmp*`, row.name, col.name, value = 6397) :
subscript out of bounds
This is how the matrix should look like

Related

How to get the difference between two rows in a vector layer field

So the field is from a vector layer attribute table. What I want is to be able to have the result when each row value in the field named “Distance” subtracts from the previous one, I get a result which I can then use for other calculations. So essentially I want to be able to say: row 3 in column 4 minus row 2 in column 4 (same columns but different rows subtracting each other). My code is shown below:
fn = ‘C:/PLUGINS1/sample/checking.shp’
layer = iface.addVectorLayer(fn, ”, ‘ogr’)
layer=iface.activeLayer()
idx=layer.fields().indexFromName(‘Distance’)
with edit(layer):
for f in layer.getFeatures():
dist1 = float(row[2], column [4]) # since row 1 contains the field name
dist2 = float(row[3], column [4])
final = abs(dist2 – dist1)
An error appears. Am stuck here.
This really works:
fn = 'C:/PLUGINS1/sample/checking.shp'
layer = iface.addVectorLayer(fn, '', 'ogr') # '' means empty layer name
for i in range(0, 1):
feat1 = layer.getFeature(i)
ab = feat1[0] # this is the first column and first row value
for i in range(0, 2):
feat2 = layer.getFeature(i)
dk = feat2[0] # this is the first column and second row value
lenggok = (dk - ab) # Note this is the difference between both rows
print(lenggok)

Recycling error while using stringdist and data.table in R

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.
However, I get an error message given below:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).
Some more information about the input data and approach:
The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
The first_names_dict and the author_corrected_df can be accessed from the link below:
https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
While execution I get the following error message:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).
Thanks in advance.
Following previous comments, here I select the gender most present in your selection :
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
# Analysis of factor gender
gender <- as.character( first_names_dict$gender[row_idx1] )
# I take the (first) gender most present in selection
df_count <- as.data.frame( table(gender) )
ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
value <- unique ( test[which(test == ref)] )
# Affecting single character value to data frame
author_corrected_df$Gender_Dict[ijk] <- value
}
}
Hope this helps :)

Using the apply function over each column for adjusting of data.frame

So my hope is to change columns 14:18 into 1 column "Type". I wanted to give each of the entries in this new column (for matching observations in the previous) the value of which of the 5 is a 1 (because only 1 of them can be true). This is my best attempt at doing this in R (and beyond frustrated).
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
carSubset <- apply(carSubset[,types],
2,
function(each_obs){
hit_index <- which(each_obs == 1)
carSubset$Type <- types[hit_index]
})
head(carSubset) # output:
1 2 3 4 5
"sedan" "coupe" "convertible" "convertible" "convertible"
Which is what I wanted ... however, I also wanted the rest of my data.frame to come along with it, like I just wanted the new column of "Type" but I cannot even access it with the following line of code...
head(carSubset$Type) # output: Error in carSubset$Type : $ operator is invalid for atomic vectors
Any help on how to Add a new column dynamically while appending previously related data observations to it?
I actually figured it out! Probably not the best way to do it, but hey, it works.
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
head(carSubset[,types])
carSubset[,types]
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
newSubset <- c()
newSubset <- apply(carSubset[,types],
1,
function(obs){
hit_index <- which(obs == 1)
newSubset <- types[hit_index]
})
newSubset
carSubset$Type <- cbind(Type = newSubset)
head(carSubset[, !(names(carSubset) %in% types)])

Error in data table: Item has no length? - R

I have a R script that contains a function, which I recieved in an answer for this question: R: For loop nested in for loop.
The script has been working fine on the first part of my data set, but I am now trying to use it on another part, which as far as I can tell, has the exact same format as the first, but for some reason I get an error when trying to use the script. I cannot figure out, what causes the error.
This is the script I am using:
require(data.table)
MappingTable_Calibrated = read.csv2(file.choose(), header=TRUE)
head(MappingTable_Calibrated)
#The data is sorted primarily after Scaffold number in ascending order, and secondarily after Cal_Startgen in ascending order.
MappingTable_Calibratedord = MappingTable_Calibrated[order(MappingTable_Calibrated$Scaffold, MappingTable_Calibrated$Cal_Startgen),]
head(MappingTable_Calibratedord)
dt <- data.table(MappingTable_Calibratedord, key = "Scaffold,Cal_Startgen")
head(dt)
# The following function creates pairs of loci for each scaffold.
# The function is a modified version of a function found retrieved from http://www.stackoverflow.com
fn = function(dtIn, id){
# Creates the object dtHead containing as many lines as in dtIn minus the last line)
dtHead = head(dtIn, n = nrow(dtIn) - 1)
# The names of dtHead are appended with _a. paste0 short for: paste(x, sep="")
setnames(dtHead, paste0(colnames(dtHead), "_a"))
# Creates the object dtTail containing as many lines as in dtIn minus the first line)
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
# The names of dtTail are appended with _b.
setnames(dtTail, paste0(colnames(dtTail), "_b"))
# dtHead and dtTail are combined. Scaffold is defined as id. The blank column "Pairwise_Distance is added to the table.
cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0)
}
#The function is run on the data. .SDcols defines the rows to be included in the output.
output = dt[, fn(.SD, Scaffold), by = Scaffold, .SDcols = c("Name", "Startpos", "Endpos", "Rev", "Startgen", "Endgen", "Cal_Startgen", "Cal_Endgen", "Length")]
output = as.data.frame(output[, with = FALSE])
But when trying to create "output" I get the following error:
Error in data.table(..., key = key(..1)) : Item 1 has no length. Provide at least one item (such as NA, NA_integer_etc) to be repeated to match the 2 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
dt looks like this:
Name Length Startpos Endpos Scaffold Startgen Endgen Rev Match Cal_Startgen Cal_Endgen
1: Locus_7173 144 0 144 34 101196 101340 1 1 101196 101340
2: Locus_133 110 0 110 34 223659 223776 1 1 223659 223776
3: Locus_2746 161 0 89 65 101415 101504 1 1 101415 101576
A full dput of "dt" can be found here: https://www.dropbox.com/sh/3j4i04s2rg6b63h/AADkWG3OcsutTiSsyTl8L2Vda?dl=0
Start with tracking the data which cause the error by:
function(dtIn, id){
dtHead = head(dtIn, n = nrow(dtIn) - 1)
setnames(dtHead, paste0(colnames(dtHead), "_a"))
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
setnames(dtTail, paste0(colnames(dtTail), "_b"))
r <- tryCatch(cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0), error = function(e) NULL)
if(is.null(r)) browser()
r
}
Then you can see you are trying to cbind elements of different nrow/length:
Browse[1]> dtHead
Empty data.table (0 rows) of 9 cols: Name_a,Startpos_a,Endpos_a,Rev_a,Startgen_a,Endgen_a...
Browse[1]> dtTail
Empty data.table (0 rows) of 9 cols: Name_b,Startpos_b,Endpos_b,Rev_b,Startgen_b,Endgen_b...
Browse[1]> id
[1] 76
Browse[1]> 0
[1] 0
Which is not allowed.
I recommend to put an if(nrow( or something similar and then add columns id = integer(), Pairwise_Distance = numeric() for nrow = 0 cases.

R: Extracting data from a data from for analysis

I am trying to extract data from a data frame for analysis.
heightweight <- function(person, health) {
## Read in data
data <- read.csv("heightweight.csv", header = TRUE,
colClasses = "character")
## Check that the outcomes are valid
measure = c("height", "weight")
if(health %in% measure == FALSE){
stop("Valid inputs are height and weight")
}
## Truncate the data matrix to only what columns are needed
data <- data[c(1, 5, 7)]
## Rename columns
names(data)[1] <- "Name"
names(data)[2] <- "Height"
names(data)[3] <- "Weight"
## Convert numeric columns to numeric
data[, 2] <- as.numeric(data[, 3])
data[, 3] <- as.numeric(data[, 4])
## Convert NAs to 0 after coercion
data[is.na(data)] <- 0
## Check that the name is valid
name <- data[, 1]
name <- unique(name)
if(person %in% name == FALSE){
stop("Invalid person")
}
## Return person with lowest height or weight
list <- data[data$name == person & data[health],]
outcomes <- list[, health]
minumum <- which.min(outcomes)
## Min Rate
minimum[rowNum, ]$name
}
The problem I am having is occurring with
list <- data[data$name == person & data[health],]
That is, I run heightweight("Bob", "weight"), I get the following message
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent
I have Googled this message and checked out some threads here but can't determine what the problem is.
Unless I'm missing something, if you only need the lowest weight or height for a given name, the last three lines of code are a bit redundant.
Here's a simple way to get the minimum health measurement for a given person:
min(data[data$name==person, "height"])
The first part selects only the rows of data that correspond to that person, it acts as a row index. The second part, after the comma, selects only the desired variable (column). Once you have selected the desired data, you look for the minimum in that subset of the data.
An example to illustrate the result:
data<-data.frame(name=as.character(c(rep("carlos",2),rep("marta",3),rep("johny",2),"sara")))
set.seed(1)
data$height <- rnorm(8,68,3)
data$weight <- rnorm(8,160,10)
The corresponding data frame:
name height weight
1 carlos 66.12064 165.7578
2 carlos 68.55093 156.9461
3 marta 65.49311 175.1178
4 marta 72.78584 163.8984
5 marta 68.98852 153.7876
6 johny 65.53859 137.8530
7 johny 69.46229 171.2493
8 sara 70.21497 159.5507
Let's say we want the minimum weight for marta:
person <- "marta"
health <- "weight"
The minimum "weight" for "marta" is,
min(data[data$name==person,health])
which gives the desired result:
[1] 153.7876
Here is the simplified analogue of your function:
heightweight <- function(person,health) {
data.set <- data.frame(names=rep(letters[1:5],each=3),height=171:185,weight=seq(95,81,by=-1))
d1 <- data.set[data.set$name == person,]
d2 <- d1[d1[,health]==min(d1[,health]),]
d2[,c('names',health)]
}
The first line produces a sample data set. The second line selects all records for a given person. The last line finds a record corresponding to the minimum value of health.
heightweight('b','height')
# names height
# 4 b 174

Resources