Reading and counting of consecutive points - r
I have problems reading coordinates from a 2D space from a data.table as the following and reading out different qualities from it:
DT <- data.table(
A = c(rep("aa",2),rep("bb",2)),
B = c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time =c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0")
)
The "Time" and "Intensity" columns are referring to x and y values of a 2D space. The "Low" and "High" columns are referring to to boundaries on the x-axis ("Time").
Now I would like to check different qualities of the y ("Intensity") dimension within (< >) those boarders:
The highest number of consecutive points > 0: (row1: 1, row 2: 2,..)
The total number of points > 0: (row1: 1, row2: 3,..)
The highest number of consecutive points > baseline (the baseline value should be taken from the Intensity value of the Low or High boundary, which ever is lower (so for row3 it would be 12, for the others 0)): (row3: 4, for all other rows it is the same as in 1.)
So the output should be a table like that:
DT <- data.table(
A =c(rep("aa",2),rep("bb",2)),
B =c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time = c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0"),
First = c(1,2,7,0),
Second= c(1,3,7,0),
Third = c(1,2,4,0)
)
Has anyone an idea how that task could be handled? I was trying with data.table until now but if someone knows a better package for such tasks I would also be happy.
Thank you a lot in advance!
Yasel
Here is one method with base R. We split the 'Intensity', 'Time' columns by , into a list, then loop through the corresponding elements of the list along with the elements of 'High', 'Low' column, extract the values in the 'Intensity' based on the index from 'Low' to 'High', check whether it is greater than 0 (and also based on the conditional checking of values in 'Low'). Use rle to find the length of consecutive elements that are greater than 0 (or the 'Low' index). Create a data.frame, rbind the contents of list and cbind with the original dataset
newCols <- do.call(rbind, Map(function(u, v, x, y) {
u1 <- as.numeric(u)
v1 <- as.numeric(v)
v2 <- as.numeric(v1[u1 >x & u1 < y])
i1 <- with(rle(v2 > 0), pmax(max(lengths[values]), 0))
i2 <- sum(v2 > 0)
lb <- match(x, u1)
ub <- match(y, u1)
v3 <- as.numeric(v[(lb+1):(ub-1)])
i3 = with(rle(v3 > min(as.numeric(v[c(lb, ub)]))),
pmax(max(lengths[values]), 0))
data.frame(First = i1, Second = i2, Third = i3)
},
strsplit(DT$Time, ","), strsplit(DT$Intensity, ","), DT$Low, DT$High))
cbind(DT, newCols)
# A B Low High Time Intensity First Second Third
#1: aa H 0 8 0,1,2,3,4,5,6,7,8,9,10 0,0,0,0,561464,0,0,0,0,0,0 1 1 1
#2: aa H 3 10 0,1,2,3,4,5,6,7,8,9,10 0,0,0,6548,5464,5616,0,0,0,68716,0 2 3 2
#3: bb Na 1 9 0,1,2,3,4,5,6,7,8,9,10 5658,12,6548,6541,8,5646854,54565,56465,546,65,0 7 7 4
#4: bb Na 1 8 0,1,2,3,4,5,6,7,8,9,10 0,561464,0,0,0,0,0,0,0,0,0 0 0 0
Related
how to create a row that is calculated from another row automatically like how we do it in excel?
does anyone know how to have a row in R that is calculated from another row automatically? i.e. lets say in excel, i want to make a row C, which is made up of (B2/B1) e.g. C1 = B2/B1 C2 = B3/B2 ... Cn = Cn+1/Cn but in excel, we only need to do one calculation then drag it down. how do we do it in R?
In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned: df["C1"] <- df["B2"]/df["B1"] df["C2"] <- df["B3"]/df["B2"] In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on. EDIT 1: There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R: Example dataset for demonstration: df <- data.frame("B1" = c(1, 2, 3), "B2" = c(2, 4, 6), "B3" = c(4, 8, 12)) Column calculation: for (i in 1:ncol(df)-1) { col_name <- paste0("C", i) df[col_name] <- df[, i+1]/df[, i] } Output: B1 B2 B3 C1 C2 1 1 2 4 2 2 2 2 4 8 2 2 3 3 6 12 2 2 So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents. EDIT 2: Rowwise, as you actually meant it apparently, works similarly: a <- c(10,15,20, 1) df <- data.frame(a) for (i in 1:nrow(df)) { df$b[i] <- df$a[i+1]/df$a[i] } Output: a b 1 10 1.500000 2 15 1.333333 3 20 0.050000 4 1 NA
You can do this just using vectors, without a for loop. a <- c(10,15,20, 1) df <- data.frame(a) df$b <- c(df$a[-1], 0) / df$a print(df) a b 1 10 1.500000 2 15 1.333333 3 20 0.050000 4 1 0.000000 Explanation: In the example data, df$a is the vector 10 15 20 1. df$a[-1] is the same vector with its first element removed, 15 20 1. And using c() to add a new element to the end so that the vector has the same lenght as before: c(df$a[-1],0) which is 15 20 1 0 What we want for column b is this vector divided by the original df$a. So: df$b <- c(df$a[-1], 0) / df$a
Limiting Duplication of Specified Columns
I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female. Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical. Name<-c("Jane","Brad","Harry","Shana","Debra","Jason") Sex<-c("F","M","M","F","F","M") Score<-c(25,50,36,40,39,62) Team<-c("A","A","A","B","B","B") Grade<-c(1,2,1,2,1,2) Rep<-c("C","D","C","D","D","D") df<-data.frame(Name,Sex,Score,Team,Grade,Rep) df Name Sex Score Team Grade Rep 1 Jane F 25 A 1 C 2 Brad M 50 A 2 D 3 Harry M 36 A 1 C 4 Shana F 40 B 2 D 5 Debra F 39 B 1 D 6 Jason M 62 B 2 D library(Rglpk) num <- length(df$Name) obj<-df$Score var.types<-rep("B",num) matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F")) direction <- c("==","==") rhs<-c(2,1) sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE) df[sol$solution==1,] Name Sex Score Team Grade Rep 2 Brad M 50 A 2 D 4 Shana F 40 B 2 D 6 Jason M 62 B 2 D What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative. I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.
It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people. For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block: pairs <- t(combn(nrow(df), 2)) (blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) { x[pairs[,1]] == x[pairs[,2]] })) >= 3,]) # [,1] [,2] # [1,] 1 3 # [2,] 4 6 We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints: library(Rglpk) num <- length(df$Name) obj<-df$Score var.types<-rep("B",num) matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"), outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "==")) direction <- rep(c("==", "<="), c(2, nrow(blocked))) rhs<-c(2, 1, rep(1, nrow(blocked))) sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE) df[sol$solution==1,] # Name Sex Score Team Grade Rep # 2 Brad M 50 A 2 D # 5 Debra F 39 B 1 D # 6 Jason M 62 B 2 D The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.
For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this: a'x <= 1 To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1. The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45. Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving: nr <- nrow(df) s <- split(1:nr, df[4:6]) s1 <- s[lengths(s) > 1] excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0 matrix2 <- rbind(matrix, excl) direction2 <- c(direction, rep("<=", nrow(excl))) rhs2 <- c(rhs, rep(1, nrow(excl))) sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2, dir = direction2, rhs = rhs2, types = "B", max = TRUE) df[ sol2$solution == 1, ] giving: Name Sex Score Team Grade Rep 2 Brad M 50 A 2 D 5 Debra F 39 B 1 D 6 Jason M 62 B 2 D
Divide rows by its number of columns without an NA value
I have a dataframe containing a single NA in the first row of column b: a <- c(16.54868281, 47.64097026, 51.0773201) b <- c(NA, 39.40217391, 13.04347826) c <- c(17.80821918, 42.92237443, 36.75799087) d <- c(22.90809328, 56.37860082, 61.04252401) data <- data.frame(cbind(a, b, c, d)) data a b c d 1 16.54868 NA 17.80822 22.90809 2 47.64097 39.40217 42.92237 56.37860 3 51.07732 13.04348 36.75799 61.04252 Here, I am trying to acquire an average score of each row. But, because of the NA, the first row returns an NA for its mean. safety <- data.frame( (data$a + data$b + data$c + data$d) / 4 ) names(safety)[1] <- "safety" safety safety 1 NA 2 46.58603 3 40.48033 To resolve this, I have replaced NA to 0. Unfortunately, the computer is treating the missing value as a number, and I am dividing every row by 4. Therefore, I am getting a wrong mean for the first row. a <- c(16.54868281, 47.64097026, 51.0773201) b <- c(NA, 39.40217391, 13.04347826) c <- c(17.80821918, 42.92237443, 36.75799087) d <- c(22.90809328, 56.37860082, 61.04252401) data <- data.frame(cbind(a, b, c, d)) data[is.na(data)] <- 0 safety <- data.frame( (data$a + data$b + data$c + data$d) / 4 ) names(safety)[1] <- "safety" safety safety 1 14.31625 2 46.58603 3 40.48033 I need the first row to read 19.08833 instead of 14.31625. Is there a function in R that allows me to divide each row by the number of columns in its equation? I can probably create a long way to solving this issue. But, as the dataset grows bigger, my primitive shortcomings would soon meet its end.
Use rowMeans with na.rm = TRUE: rowMeans(data, na.rm = TRUE) # [1] 19.08833 46.58603 40.48033
Flag rows in matrix that contain the same set of values
I have a matrix of integers m <- rbind(c(1,2), c(3,6), c(5,1), c(2,1), c(6,3)) and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value. For the above example, the desired output would be: flag <- c(1, 2, 3, 1, 2) So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag. The solution should work for any number of columns. The only thing I could come up with is the following approach ... FlagSymmetric <- function(x) { vec_sim <- rep(NA, nrow(x)) # object containing flags ind_ord <- ncol(x) counter <- 1 for(i in 1:nrow(x)) { if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ... vec_sim[i] <- counter # ... and give the next free flag for(j in (i+1):nrow(x)) { if( (i+1) > nrow(x) ) next # in case of tiny matrices ind <- x[j, ] %in% x[i, ] if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag } counter <- counter + 1 } } return(vec_sim) } ... which does what I want: > FlagSymmetric(m) [1] 1 2 3 1 2 If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows. I guess there must be a smarter way of doing this. EDIT: Additional, more general example (sorting row and pasting to character string not possible): m2 <- rbind(c(1,112), c(11,12), c(12,11), c(112,1), c(6,3)) flag2 <- c(1, 2, 2, 1, 3) # desired output FlagSymmetric(m2) # works [1] 1 2 2 1 3
Assuming you only have numeric data in your matrix. First converting the matrix to dataframe, m <- data.frame(m) We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = "")))) m # X1 X2 flag #1 1 2 1 #2 3 6 3 #3 5 1 2 #4 2 1 1 #5 6 3 3 EDIT The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example, as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-")))) #[1] 1 2 2 1 3
R: Compute number of rows in data frame that have 0 colSums for specific columns using a function
I have a data frame with n rows and m columns where m > 30. My first column is an age variable and the rest are medical conditions that are either on or off (binary). Now I would like to compute the number of observations where none of the medical conditions is switched on i.e. the number of healthy patients. I thought I could use the rowSums function to count observations wherever the row sum is zero (of course excluding the age variable) but I tried some functions and did not succeed. Here is an example how it could work but always involving a lot of AND / OR statements which is not practical. I was looking for a non-loop solution. example <- as.data.frame(matrix(data=c(40,1,1,1,36,1,0,1,56,0,0,1,43,0,0,0), nrow=4, ncol=4, byrow=T, dimnames <- list(c("row1","row2","row3", "row4"),c("Age","x","y","z")))) Two impractical alternatives to arrive at desired outcome: nrow(subset(example, x==0 & y==0 & z==0)) table(example$x==0 & example$y==0 & example$z==0) What I actually wanted is sth like this: nrow(example[rowSums(example[,2:ncol(example)])==0])
You can use apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)) ## row1 row2 row3 row4 ## FALSE FALSE FALSE TRUE Here you are applying FUN on every row of the example[,-1]. It gives you logical vector indicating which rows satisfy the condition that all of the variables in that row are equal to 0. You get this by using all function inside your FUN argument function. You can use this result to get rows containing all healthy patients or those containing atleast 1 non healthy patient. example[apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)), ] ## Age x y z ## row4 43 0 0 0 example[!apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)), ] ## Age x y z ## row1 40 1 1 1 ## row2 36 1 0 1 ## row3 56 0 0 1 And you can get number of healthy rows or otherwise as below # healthy rows sum(apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0))) ## [1] 1 # rows with atleast one unhealthy condition sum(!apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0))) ## [1] 3
You just want the total numbers of observations/rows that satisfy this condition right? Then you can use - nrow(example[example$x==0 & example$y==0 & example$z==0,]) Else, if you want to use rowSums, this will work - nrow(example[rowSums(example[,2:4])==0,])