There might be a simple solution to this but I'm struggling.
I have a code as follows:
for(i in 1:nrow(df)){
x[i] <- df[i,]$X4
if(length(unique(df[1:i,]$X4)) == length(unique(df$X4))){
collect <- data.frame(df[1,]$X1, df[1,]$X2, df[i+1,]$X3)
The loop breaks after the if condition length(unique(df[1:i,]$X4)) == length(unique(df$X4)) is reached. However, I want to start the same loop again from i+1'th iteration, and keep checking until the same if condition is met again, till the end of my dataframe.
My sample data is as follows:
1 930000 1000000 E2-A
1 1890000 2110000 E2-A
1 2120000 2330000 D
1 2340000 3350000 E2-B
1 3365000 3405000 B
1 5695000 5810000 E2-A
1 6305000 6405000 E2-B
1 6425000 6465000 E1-A
1 6780000 6960000 E2-B
1 7100000 7270000 D
1 7730000 7810000 D
1 8030000 8380000 E2-A
1 8970000 9170000 E1-A
1 9345000 9555000 E1-B
1 9845000 9930000 E1-A
1 10000000 10100000 E1-B
1 10430000 10560000 E3
1 11720000 11780000 B
1 11900000 11960000 C
1 12185000 12270000 E1-A
1 12450000 12680000 A #break point of loop because if(length(unique(df[1:i,]$X4)) == length(unique(df$X4)))
1 13990000 14290000 B
1 15250000 15355000 E2-B
1 15475000 15600000 D
1 15655000 15755000 A
1 15920000 16080000 E2-A
1 16120000 16280000 C
1 16400000 16570000 E1-B
1 17280000 17380000 E1-B
1 17450000 17735000 A
1 17760000 17820000 E1-B
1 17825000 17935000 A
1 18925000 19150000 E1-A
1 19220000 19410000 C
1 19680000 19980000 C
1 20230000 20820000 E3 #the if condition is met again after the break, but using break exits the loop
1 20845000 20970000 E2-A
1 21580000 21695000 D
1 21700000 21920000 E2-A
1 22430000 22750000 B
1 22740000 22980000 A
1 23300000 23515000 C
1 23870000 23965000 A
1 24525000 24720000 E2-B
1 25010000 25160000 D
1 25170000 25430000 B
1 25930000 26130000 A
1 26220000 26330000 E2-B
1 26435000 26485000 C
My expected output is:
1 930000 12680000
1 13990000 20820000
But what I get so far is:
1 930000 12680000
How do I do so?
# I saved the data you provided into a file and read them back into R session
df <- read.table("df.txt",quote="#")
# It looks like you are using X1, X2, and so on in your code example
# so I renamed the column names
names(df) <- c("X1","X2","X3","X4")
# check the structure of the data frame
# 'data.frame': 49 obs. of 4 variables:
# $ X1: int 1 1 1 1 1 1 1 1 1 1 ...
# $ X2: int 930000 1890000 2120000 2340000 3365000 5695000 6305000 6425000 6780000 7100000 ...
# $ X3: int 1000000 2110000 2330000 3350000 3405000 5810000 6405000 6465000 6960000 7270000 ...
# $ X4: Factor w/ 9 levels "A","B","C","D",..: 7 7 4 8 2 7 8 5 8 4 ...
result <- list() = 1
j = 0
# number of unique values in the 4th column
n.unique <- length(unique(df$X4))
for ( i in seq(nrow(df) )) {
if(length(unique(df[ : i,"X4" ])) == n.unique ){
j = j+1
result[[j]] <- c( df[, 2], df[i, 3]) = i + 1
# [[1]]
# [1] 930000 12680000
# [[2]]
# [1] 13990000 20820000
# If Dataframe is needed:, result)
#c.930000L..13990000L. c.12680000L..20820000L.
#1 930000 12680000
#2 13990000 20820000
#If matrix is OK
matrix(unlist(result, use.names=F), ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] 930000 12680000
#[2,] 13990000 20820000
ulist<- unique(df$X4)
uniq <- length(unique(df$X4))
brk <- 0
rn <- length(df$X1)
coor <- NULL
while (brk < rn) {
found <- rep(0,uniq)
coor.s <- 0
coor.e <- 0
coor.s <- df$X2[brk+1]
for (i in (brk+1):rn) {
for (j in 1:uniq) {
if(df$X4[i] == ulist[j]) {found[j]<-1}
if (sum(found)==uniq) {coor.e <- df$X3[i];brk=i;break}
if(sum(found)<uniq) {
} else {
collect.df <-,c(coor.s,coor.e)))
I want to be able to extract specific characters from a character vector in a data frame and return a new data frame. The information I want to extract is auditors remark on a specific company's income and balance sheet. My problem is that the auditors remarks are stored in vectors containing the different remarks. For instance:
vec = c("A C G H D E"). Since "A" %in% vec won't return TRUE, I have to use strsplit to break up each character vector in the data frame, hence "A" %in% unlist(strsplit(dat[i, 2], " "). This returns TRUE.
Here is a MWE:
dat <- data.frame(orgnr = c(1, 2, 3, 4), rat = as.character(c("A B C")))
dat$rat <- as.character(dat$rat)
dat[2, 2] <- as.character(c("A F H L H"))
dat[3, 2] <- as.character(c("H X L O"))
dat[4, 2] <- as.character(c("X Y Z A B C"))
Now, to extract information about every single letter in the rat coloumn, I've tried several approaches, following similar problems such as Roland's answer to a similar question (How to split a character vector into data frame?)
DF <- data.frame(, strsplit(dat$rat, " ", fixed = TRUE)))
X1 X2 X3 X4 X5 X6
1 A B C A B C
2 A F H L H A
3 H X L O H X
4 X Y Z A B C
This returnsthe following error message: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 2)
It would be a desirable approach since it's fast, but I can't use DF since it recycles.
Is there a way to insert NA instead of the recycling because of the different length of the vectors?
So far I've found a solution to the problem by using for-loops in combination with ifelse-statements. However, with 3 mill obs. this approach takes years!
dat$A <- 0
for(i in seq(1, nrow(dat), 1)) {
dat[i, 3] <- ifelse("A" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
dat$B <- 0
for(i in seq(1, nrow(dat), 1)) {
dat[i, 4] <- ifelse("B" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
This gives the results I want:
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1
I've searched through most of the relevant questions I could find here on StackOverflow. This one is really close to my problem: How to convert a list consisting of vector of different lengths to a usable data frame in R?, but I don't know how to implement strsplit with that approach.
We can use for-loop with grepl to achieve this task. + 0 is to convert the column form TRUE or FALSE to 1 or 0
for (col in c("A", "B")){
dat[[col]] <- grepl(col, dat$rat) + 0
# orgnr rat A B
# 1 1 A B C 1 1
# 2 2 A F H L H 1 0
# 3 3 H X L O 0 0
# 4 4 X Y Z A B C 1 1
If performance is an issue, try this data.table approach.
# Convert to data.table
# Create a helper function
dummy_fun <- function(col, vec){
grepl(col, vec) + 0
# Apply the function to A and B
dat[, c("A", "B") := lapply(c("A", "B"), dummy_fun, vec = rat)]
# orgnr rat A B
# 1: 1 A B C 1 1
# 2: 2 A F H L H 1 0
# 3: 3 H X L O 0 0
# 4: 4 X Y Z A B C 1 1
using Base R:
a=strsplit(dat$rat," ")
orgnr rat A B C F H L O X Y Z
1 1 A B C 1 1 1 0 0 0 0 0 0 0
2 2 A F H L H 1 0 0 1 2 1 0 0 0 0
3 3 H X L O 0 0 0 0 1 1 1 1 0 0
4 4 X Y Z A B C 1 1 1 0 0 0 0 1 1 1
From here you can Just call those columns that you want:,b))
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1
I have a dataframe that has two types of value. I'd like to slice it in groups.
This groups are expected to provide two conditions. Each group should be;
Conditions 1: max cumulative value of w <= 75
Conditions 1: max cumulative value of n <= 15
If one of these criteria reach the max cumulative value, it should reset the cumulative sums
and start over again for both.
id<- sample(1:33)
w <- c(2,1,32,5,1,1,12,1,2,32,32,32,1,3,2,12,1,1,1,1,1,1,5,3,5,1,1,1,2,7,2,32,1)
n <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df <- data.frame(id, w, n)
the expected result (made manully)
w cumsum_w n cumsum_n group
2 2 1 1 1
1 3 1 2 1
32 35 1 3 1
5 40 1 4 1
1 41 1 5 1
1 42 1 6 1
12 54 1 7 1
1 55 1 8 1
2 57 1 9 1
32 32 1 2 2
32 64 1 3 2
32 32 1 1 3
1 33 1 2 3
3 36 1 3 3
2 38 1 4 3
12 50 1 5 3
1 51 1 6 3
1 52 1 7 3
1 53 1 8 3
1 54 1 9 3
1 55 1 10 3
1 56 1 11 3
5 61 1 12 3
3 64 1 13 3
5 69 1 14 3
1 70 1 15 3
1 1 1 1 4
1 2 1 2 4
2 4 1 3 4
7 11 1 4 4
2 13 1 5 4
32 45 1 6 4
1 46 1 7 4
I tried to solve some methods:
Method 1
chunk(df, chunk.size = 75, n.chunks = 15)
Error in chunk(df, chunk.size = 75, n.chunks = 15) :
You must provide exactly one of 'chunk.size', 'n.chunks' or 'props'
Method 2
cumsum_with_reset_group <- function(w, n, threshold_w, threshold_n) {
cumsum_w <- 0
cumsum_n <- 0
group <- 1
result <- numeric()
for (i in 1:length(w)) {
cumsum_w <- cumsum_w + w[i]
cumsum_n <- cumsum_n + n[i]
if (cumsum_w > threshold_w | cumsum_n > threshold_n) {
group <- group + 1
cumsum_w <- cumsum_w + w[i]
cumsum_n <- cumsum_n + n[i]
result = c(result, group)
return (result)
# cumsum with reset
cumsum_w_with_reset <- function(w, threshold_w) {
cumsum_w <- 0
group <- 1
result <- numeric()
for (i in 1:length(w)) {
cumsum_w <- cumsum_w + w[i]
if (cumsum_w > threshold_w) {
group <- group + 1
cumsum_w <- w[i]
result = c(result, cumsum_w)
return (result)
# cumsum with reset
cumsum_n_with_reset <- function(n, threshold_n) {
cumsum_n <- 0
group <- 1
result <- numeric()
for (i in 1:length(n)) {
cumsum_n <- cumsum_n + n[i]
if (cumsum_n > threshold_n | cumsum_w > threshold_w) {
group <- group + 1
cumsum_n <- n[i]
result = c(result, cumsum_n)
return (result)
# use functions above as window functions inside mutate statement
y<-df %>% group_by() %>%
cumsum_w = cumsum_w_with_reset(w, 75),
cumsum_n =cumsum_n_with_reset(n, 15),
group = cumsum_with_reset_group(w, n, 75, 15)
) %>%
Error in mutate_impl(.data, dots) :
Evaluation error: object 'cumsum_w' not found
Here is a hack, which is done by repeated subsetting and binding. As such, this will be very slow with large data sets. This takes the whole data set as an input.
cumsumdf <- function(df){
cumsum_75 <- function(x) {cumsum(x) %/% 76}
cumsum_15 <- function(x) {cumsum(x) %/% 16}
cumsum_w75 <- function(x) {cumsum(x) %% 76}
cumsum_n15 <- function(x) {cumsum(x) %% 16}
m <- nrow(df)
df$grp <- 0
df <- df %>%
group_by(grp) %>%
mutate(cumsum_w = numeric(m), cumsum_n = numeric(m))
n = 0
df2 <- df[0,]
while(nrow(df) >0 ){
df$cumsum_w = cumsum_75(df$w)
df$cumsum_n = cumsum_15(df$n)
n <- n + 1
df1 <- df[df$cumsum_n == 0 & df$cumsum_w == 0,]
df <- df[df$cumsum_n != 0 | df$cumsum_w != 0,]
df1$grp <- n
df1 <- df1 %>% group_by(grp) %>%
mutate(cumsum_w = cumsum_w75(w), cumsum_n = cumsum_n15(n))
df2 <- rbind(df2,df1)
Here's my problem I couldn't solve it all.
Suppose that we have the following code as follows:
## A data frame named a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
## 1st function calculates all the combinaisons of colnames of a and the output is a character vector named item2
items2 <- c()
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
items2 <- c(items2, paste(colnames(a[i]), colnames(a[j]), collapse = '', sep = ""))
countJ <- countJ + 1
countI <- countI + 1
And here's my code I'm trying to solve (the output is a numeric vector called count_1):
## 2nd function
colnames(a) <- NULL ## just for facilitating the calculation
count_1 <- numeric(ncol(a)*2)
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
s <- a[, i]
p <- a[, j]
count_1[i*2] <- as.integer(s[i] == p[j] & s[i] == 1)
countJ <- countJ + 1
countI <- countI + 1
But when I execute this code in RStudio Console, a non-expectation result returned!:
[1] 0 0 0 0 0 1 0 1 0 0
However, I am expecting the following result:
[1] 1 2 2 2 1 1 1 1 2 1
You can see visit the following URL where you can find an image on Dropbox for detailed explanation.
I'll try to explain a little more,
I posted the 1st function (code) just to show you what I'm looking for exactly that is an example that's all.
What I'm trying to get from the second function (code) is calculating the number of occurrences of number 1 (firstly we put counter = 0) in each row (while each row of two columns (AB, for example) must equal to one in both columns to say that counter = counter + 1) we continue by combing each column by all other columns (with AC, AD, AE, BC, BD, BE, CD, CE, and then DE), combination is n!/2!(n-2)!, that means for example if I have the following data frame:
a =
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
Then, the number of occurrences of the number 1 for each row by combining the two first columns is as follows: (Note that I put colnames(a) <- NULL just to facilitate the work and be more clear)
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
### Example 1: #####################################################
so from here I put (for columns A and B (AB))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 1 0 1 0 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 1
### Example 2: #####################################################
From here I put (for columns A and D (AD))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 0 0 1 1 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 2
And so on,
I'll have a numeric vector named count_1 equal to:
[1] 1 2 2 2 1 1 1 1 2 1
while each index of count_1 is a combination of each column by others (without the names of the data frame)
1 2 2 2 1 1 1 1 2 1
Not clear what you're after at all.
As to the first code chunk, that is some ugly R coding involving a whole bunch of unnecessary while/for loops.
You can get the same result items2 in one single line.
items2 <- sort(toupper(unlist(sapply(1:4, function(i)
sapply(5:(i+1), function(j)
paste(letters[i], letters[j], sep = ""))))));
# [1] "AB" "AC" "AD" "AE" "BC" "BD" "BE" "CD" "CE" "DE"
As to the second code chunk, please explain what you're trying to calculate. It's likely that these while/for loops are as unnecessary as in the first case.
Note that this is based on a as defined at the beginning of your post. Your expected output is based on a different a, that you changed further down the post.
There is no need for a for/while loop, both "functions" can be written in two one-liners.
# Your sample dataframe a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
# Function 1
items2 <- toupper(unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
paste(letters[i], letters[j], sep = "")))));
# Function 2
count_1 <- unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
sum(a[, i] + a[, j] == 2))));
# Add names and sort
names(count_1) <- items2;
count_1 <- count_1[order(names(count_1))];
# Output
# 1 2 2 2 1 1 1 2 1 1
I would like to add a counter column in a data frame based on a set of identical rows. To do this, I used the package data.table. In my case, the comparison between rows need doing from the combination of columns "z" AND ("x" OR "y").
I tested:
DF[ , Index := .GRP, by = c("x","y","z") ]
but the result is the combination of "z" AND "x" AND "y".
How can I have the combination of "z" AND ("x" OR "y") ?
Here is a data example:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
DF <- data.table(DF)
I would like to have this output:
> DF
x y z Index
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4
The new group starts if the value for z is changing or the values both for x and y are changing.
Try this example.
DF <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
# The functions to compare if value is not equal with the previous value
is.not.eq.with.lag <- function(x) c(T, tail(x, -1) != head(x, -1))
DF[, x1 := is.not.eq.with.lag(x)]
DF[, y1 := is.not.eq.with.lag(y)]
DF[, z1 := is.not.eq.with.lag(z)]
DF[, Index := cumsum(z1 | (x1 & y1))]
I know a lot of people warn against a for loop in R, but in this instance I think it is a very direct way of approaching the problem. Plus, the result isn't growing in size so performance issues aren't a large issue. The for loop approach would be:
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
dt$grp[i] = dt$grp[i-1] + 1
Trying this on OPs original problem, the result is:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
dt <- data.table(DF)
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
dt$grp[i] = dt$grp[i-1] + 1
x y z grp
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4
Trying this on the data.table in #Frank's comment, gives the expected result as well:
dt<-data.table(x = c("b", "a", "a"), y = c(1, 1, 2), z = c("F", "F", "F"))
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
dt$grp[i] = dt$grp[i-1] + 1
x y z grp
1: b 1 F 1
2: a 1 F 1
3: a 2 F 1
EDITED TO ADD: This solution is in some ways a more verbose version of the one advocated by djhurio above. I think it shows what is happening a bit more so I'll leave it.
I think this is a task easier to do if it is broken down a little bit. The below code creates TWO indices at first, one for changes in x (nested in z) and one for changes in y (nested in z). We then find the first row from each of these indices. Taking the cumulative sum of the case where both FIRST.x and FIRST.y is true should give your desired index.
dt_example <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
z = c("M","M","M","F","F","M","M","F","F"))
dt_example[,Index_x := .GRP,by = c("z","x")]
dt_example[,Index_y := .GRP,by = c("z","y")]
dt_example[,FIRST.x := !duplicated(Index_x)]
dt_example[,FIRST.y := !duplicated(Index_y)]
dt_example[,Index := cumsum(FIRST.x & FIRST.y)]
x y z Index_x Index_y FIRST.x FIRST.y Index
1: a 1 M 1 1 TRUE TRUE 1
2: a 3 M 1 2 FALSE TRUE 1
3: a 2 M 1 3 FALSE TRUE 1
4: b 8 F 2 4 TRUE TRUE 2
5: c 8 F 3 4 TRUE FALSE 2
6: d 4 M 4 5 TRUE TRUE 3
7: e 4 M 5 5 TRUE FALSE 3
8: f 6 F 6 6 TRUE TRUE 4
9: f 0 F 6 7 FALSE TRUE 4
This approach looks for changes in x & z | y & z. The extra columns are left in the data.table to show the calculations.
DF[, c("Ix", "Iy", "Iz", "dx", "dy", "min.change", "Index") :=
#Create index of values based on consecutive order
list(ix <- rleid(x), iy <- rleid(y), iz <- rleid(z),
#Determine if combinations of x+z OR y+z change
ix1 <- c(0, diff(rleid(ix+iz))),
iy1 <- c(0, diff(rleid(iy+iz))),
#Either combination is constant (no change)?
change <- pmin(ix1, iy1),
#New index based on change
cumsum(change) + 1
x y z Ix Iy Iz dx dy min.change Index
1: a 1 M 1 1 1 0 0 0 1
2: a 3 M 1 2 1 0 1 0 1
3: a 2 M 1 3 1 0 1 0 1
4: b 8 F 2 4 2 1 1 1 2
5: c 8 F 3 4 2 1 0 0 2
6: d 4 M 4 5 3 1 1 1 3
7: e 4 M 5 5 3 1 0 0 3
8: f 6 F 6 6 4 1 1 1 4
9: f 0 F 6 7 4 0 1 0 4
Suppose I have m vectors: a_1 = (a_{11}...a_{1n}) ... a_m = (a_{m1}...a_{mn})
I want a new vector b of length mn such that
b = (a_{11}...a_{m1} a_{12}...a_{m2}...a_{1n}...a_{mn})
I can think of a for loop, for example:
>for (i in 1:3) {x<-c(x,c(a[i],b[i],c[i]))}
[1] 1 2 3 1 2 3 1 2 3
Is there a better way?
Or using mapply...
c( mapply( c , a , b , c ) )
[1] 1 2 3 1 2 3 1 2 3
c(matrix(c(a, b, c), nrow=length(a), byrow=TRUE))