Split comma- and pound-separated strings into different columns in R

Split comma- and pound-separated strings into different columns in R - r

I have a dataframe , a column of which contains colon and pound-separated strings.
data$col1
col1
1: 3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1]
2: 3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]
3: 4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0]
4: 2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
5: 3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0]
6: 5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1]
I want to extract the elements of the string and split it into different columns.
col1 col2 col3 EVS1 ... EVS12
3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1] 3 Tier_III_Uncertain 1 1
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0] 3 Tier_III_Uncertain 0 0
4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0] 4 Tier_III_Uncertain 0 0
2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0] 2 Tier_IV_benign 0 0
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0] 3 Tier_III_Uncertain 0 0
5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1] 5 Tier_III_Uncertain 0 1

read.table(text=gsub("[^A-Za-z_0-9-]", " ", data$col1),
col.names = c(paste0('col', 2:4), paste0('EVS', 1:12)))[-3]
col2 col3 EVS1 EVS2 EVS3 EVS4 EVS5 EVS6 EVS7 EVS8 EVS9 EVS10 EVS11 EVS12
1 3 Tier_III_Uncertain 1 0 0 1 0 0 0 0 0 -1 1 1
2 3 Tier_III_Uncertain 0 0 0 1 0 0 0 0 0 1 1 0
3 4 Tier_III_Uncertain 0 0 0 1 0 0 0 0 2 0 1 0
4 2 Tier_IV_benign 0 0 0 1 0 0 0 0 0 0 1 0
5 3 Tier_III_Uncertain 0 0 0 1 0 0 0 0 1 0 1 0
6 5 Tier_III_Uncertain 0 0 1 1 0 0 0 0 1 0 1 1

Assuming DT shown reproducibly in the Note at the end replace non-word characters and also EVS= with space. Then read that using fread and set the names. Finally cbind DT to it.
DT2 <- fread(text = gsub("EVS=|\\W", " ", DT$col1))
names(DT2) <- c("col2", "col3", paste0("EVS", 1:(ncol(DT2)-2)))
cbind(DT, DT2)
Note
library(data.table)
L <- "3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1]
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]
4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0]
2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0]
5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1]"
DT <- data.table(col1 = trimws(readLines(textConnection(L))))

Related

Counting elements inside a matrix

I'm generating random matrices filled with zero and ones. The dimension of them might be different for each simulation.
An example matrix below
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 1 0 0
[2,] 0 1 1 0 0 0 0 0 0 0
[3,] 0 0 0 0 1 0 0 0 0 1
[4,] 0 1 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 1 0 0 0 0 1
[6,] 1 0 1 0 0 0 1 1 1 0
[7,] 0 0 0 0 0 0 1 1 0 0
[8,] 0 0 0 0 0 0 0 0 0 0
[9,] 0 0 1 0 0 1 0 0 1 1
[10,] 0 0 0 0 0 0 0 1 0 0
And a little visualisation
Dput version.
structure(c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0), .Dim = c(10L,
10L))
I would like to compute two things:
the number of clusters formed by ones (by cluster we mean a set of adjacent ones, where the elements on the diagonal are not adjacent),
the number of ones within each cluster.
I think I managed to solve the first point with this function
library(raster)
count_clusters <- function(grid) {
attr(clump(raster(grid), direc=4), 'data')#max
}
This function would return 14 for the matrix above which is correct.
Unfortunately I don't how to solve the second task. The needed function should return the following output: c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 5).
I would appreciate any hints or tips.

To compute the number of ones within each cluster:
grid <-structure(c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0), .Dim = c(10L,
10L)) + 10L))
x <- clump(raster(grid), direc=4)
get the values from the RasterLayer #data#values.
vals <- x#data#values
Create a data frame with the values:
dt <- tibble(cluster = vals)
Remove NA values, group by cluster and count
result <- dt %>%
filter(!is.na(cluster)) %>%
group_by(cluster) %>%
tally()
result$n
[1] 1 2 1 1 1 1 1 1 1 5 1 1 2 1

Create a new variable based on other columns values

I have a paneldata dataframe structure, something like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
I want to generate a new dummy variable, that takes the value 1, if the rows contains 1 in any of the three columns or otherwise 0 if not. It should end up like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"Final_status" = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0))
Can anyone help me achieve this?

We can use if_any on the columns that starts_with 'Status', to check for any 1 value in a row and it returns TRUE if there is one or else FALSE which is coerced to binary with as.integer/+
library(dplyr)
df %>%
mutate(Final_status = +(if_any(starts_with('Status'), ~ . ==1)))
-outptu
id Status_2014 Status_2015 Status_2016 Final_status
1 1 1 0 0 1
2 1 1 0 0 1
3 1 1 0 0 1
4 1 1 0 0 1
5 2 0 1 0 1
6 2 0 1 0 1
7 2 0 1 0 1
8 2 0 1 0 1
9 3 0 0 0 0
10 3 0 0 0 0
11 3 0 0 0 0
12 3 0 0 0 0
Or using rowSums from base R
df$Final_status <- +(rowSums(df[-1] > 0) > 0)

You write an if condition to define the variable as 1 or 0, and inside this condition the most straight forward ways would be a dplyr pipe.
I don't have the dplyr syntax in my head, to long not used, but dplyr is what you want.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
best greetings

separate long vectors into individual numbers in a data frame

I have a data frame of 1000 vectors which are all similar to this 001010.... etc.
I'm trying to create a data frame where each vector is a column and each row is a single number from the vector.
So my first vector would be:
vector1
0
0
1
0
1
0
...
This is what I've tried so far but I haven't gotten it working yet.
text <- data_frame()
for (i in 1:length(text_vector_data)){
for (digit in i){
text_df <- rbind(digit, text)}
}
The output of str(text_vector_data) is
tibble [2,225 × 1] (S3: tbl_df/tbl/data.frame)
$ wordcountvec: chr [1:2225] "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ ...

Maybe you can try strsplit like below
> data.frame(setNames(strsplit(v, ""), paste0("V", seq_along(v))))
V1 V2 V3
1 0 1 0
2 0 0 0
3 1 1 0
4 0 1 1
5 1 0 0
6 0 0 1
Dummy Data
v <- c("001010", "101100", "000101")

Another option is read.fwf
read.fwf(textConnection(v), widths = rep(1, nchar(v[1])))
# V1 V2 V3 V4 V5 V6
#1 0 0 1 0 1 0
#2 1 0 1 1 0 0
#3 0 0 0 1 0 1
and to return the transpose
as.data.frame(t(read.fwf(textConnection(v), widths = rep(1, nchar(v[1])))))
data
v <- c("001010", "101100", "000101")

For the given combination in a data frame, calculate the frequency of occurrence of that combination in another data frame in R

I am having a data frame that has various combinations as follows:
structure(list(`Q1` = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0), `Q2` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `Q3` = c(0, 1, 0, 0, 0, 1, 1, 0, 0,
0), `Q4` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `Q5` = c(0, 0, 1, 0,
0, 1, 0, 1, 1, 0), `Q6` = c(1, 1, 0, 1, 1, 0, 0, 1, 1, 1), `Q7` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `Q8` = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
1), `Q9` = c(1, 0, 1, 0, 0, 1, 1, 0, 1, 0), `Q10` = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), `Q11` = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
`Q12` = c(1, 1, 1, 1, 1, 0, 1, 1, 0, 1)), row.names = c(NA,
-10L), class = "data.frame")
I am having a base data frame where I have different combinations with the weightage for each combination.
structure(list(Q1 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1), Q2 = c(0,
1, 1, 0, 0, 0, 0, 0, 0, 0), Q3 = c(1, 0, 0, 1, 0, 0, 0, 0, 0,
0), Q4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Q5 = c(1, 0, 1, 0,
0, 0, 1, 0, 0, 1), Q6 = c(1, 1, 1, 0, 1, 0, 0, 1, 0, 1), Q7 = c(0,
0, 1, 1, 1, 0, 0, 0, 0, 0), Q8 = c(1, 0, 1, 0, 0, 1, 0, 0, 0,
0), Q9 = c(1, 0, 0, 0, 0, 0, 0, 1, 1, 0), Q10 = c(0, 0, 1, 0,
0, 1, 0, 0, 0, 0), Q11 = c(0, 0, 1, 0, 0, 1, 0, 0, 0, 0), Q12 = c(1,
0, 0, 0, 1, 0, 1, 0, 0, 0), RatingBinary = c(1, 1, 0, 1, 0, 1,
0, 1, 1, 1)), row.names = c(NA, 10L), class = "data.frame")
The problem statement is for each 1's combination in 1st data frame (i.e.Q6, Q9, Q12 in 1st row, Q3, Q6, Q12 in 2nd row), I need to get the number of rows that get satisfied in the base data frame.
For example: In the combination data frame (1st Df), in the 1st row Q6, Q9 & Q12 have the binary value 1. I need to get the count of this combination(Q6, Q9 & Q12 which have 1's) in the base data and get the number of rows that have the RatingBinary values 0's and 1's.
How can I get this implemented in R? Can anyone suggest a suitable solution for this scenario?

Here's an algorithmic approach.
Let's call a set in the first data frame a combo set; this is a set of three questions in a given row. Let's also call a set in the base data a base set; this is the set in a given row for which we are trying to find whether a given combo set is part of.
The approach is essentially to iterate through each combo set and find matches over all base sets. Sets seem to only be in threes, so I take advantage of that by hard coding a sum == 3 rather than doing an agnostic match. We store matches in a structure I call pair. A match is indicated by a 1. I define pair(x,y) where x is the row number of the combo data set and y is the row number of base dataset.
pair <- matrix(nrow = 10, ncol = 10)
for(i in 1:nrow(df)) {
ind <- which(df[i,] == 1)
for(j in 1:nrow(df2)) {
if(sum(df2[j, ind]) == 3){
pair[i,j] <- 1
} else {
pair[i,j] <- 0
}
}
}
The pair object is:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 0 0 0 0 0 0 0 0
[2,] 1 0 0 0 0 0 0 0 0 0
[3,] 1 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0
[6,] 1 0 0 0 0 0 0 0 0 0
[7,] 1 0 0 0 0 0 0 0 0 0
[8,] 1 0 0 0 0 0 0 0 0 0
[9,] 1 0 0 0 0 0 0 0 0 0
[10,] 1 0 0 0 0 0 0 0 0 0
This means for only the first combo set did we find matches in all the base sets except for base set 4 and base set 5. Because there is only one match, the answer to your second question about the number of rows that have RatingBinary 0 or 1 becomes trivial -- it's just the RatingBinary for that row/base set in the base data set.

How to find bounding boxes of objects in raster?

I have a binary raster consisting of objects (1) and background (0). How can I find bounding boxes of objects? Each object should have its own bouding box.
Input:
library("raster")
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
)
ras = raster(mat)
I expect this result:
result = raster(matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 0, 0, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
))

Here in an approach
Example data
library(raster)
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE )
ras <- raster(mat)
Solution
f <- function(r) {
x <- reclassify(ras, cbind(0,NA))
y <- rasterToPolygons(x, dissolve=TRUE)
z <- disaggregate(y)
e <- sapply(1:length(z), function(i) extent(z[i,]))
p <- spPolygons(e)
r <- rasterize(p, r)
d <- boundaries(r)
reclassify(d, cbind(NA, 0))
}
r <- f(res)
as.matrix(r)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 0 0 0
#[2,] 0 1 1 1 1 0
#[3,] 0 1 1 1 1 0
#[4,] 0 0 0 0 0 0
#[5,] 0 1 1 1 1 0
#[6,] 0 1 0 0 1 0
#[7,] 0 1 1 1 1 0
#[8,] 0 0 0 0 0 0
It is of course possible that bounding boxes of objects overlap, in which there is no solution, I suppose.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split comma- and pound-separated strings into different columns in R - r

Related

Counting elements inside a matrix

Create a new variable based on other columns values

separate long vectors into individual numbers in a data frame

For the given combination in a data frame, calculate the frequency of occurrence of that combination in another data frame in R

How to find bounding boxes of objects in raster?

Categories

Resources