Reducing Row Sequences In R Lengths - r

I'm looking for a nice way to count the longest number of consecutive reductions in a row in a data.table (package version 1.9.2) in R. I am horribly lost and any help is much appreciated. For the example I am trying to do, a reduction is where a value is less than or equal to the previous value (<=).
Below is an toy sample of the data I am dealing with. I have also put down my best attempt so far which to be honest went horribly wrong and it returned an error. My attempt also uses 2 for loops which I'm not hugely keen on since I have been advised apply loops are more often used in R. I have tried searching this site and the web for a similar solution but haven't had any luck. The number of rows I actually have in my full data table is just over 1 million and the number of columns I have is 17.
library(data.table)
TEST_DF <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4))
TEST_DF$COUNT <- as.numeric(0)
for( i in 1:NROW(TEST_DF))
{
for (j in 1:(NCOL(TEST_DF) - 1))
{
TEST_DF$COUNT[j] <- if (TEST_DF[i, j, with = FALSE] >=
TEST_DF[i, j + 1, with = FALSE])
{
TEST_DF$COUNT[j] + 2
}
}
}
DESIRED <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4),
COUNT = c(4,2,1,0))
The desired output appears at the bottom of the code. As the 4 four "COL" columns appear in the longest reduction sequence, the COUNT column for the first row would get a value of 4. In the second row, there is a reduction in the first 2 columns and the last two but none in between so the COUNT would get a value of 2 for this. In the last column, there is a reduction from COL_3 to COL_4 so COUNT would get a value of 2 for this row. In any row where there is no reduction such as the last there would be a value of 0 for the COUNT.
Let me know if any further clarification or information is needed.
Thank you so much in advance.

You can use the functions diff() and rle() to build a function to extract the run lengths. Then use apply() across the rows of your data:
foo <- function(x) {
runs <- rle(c(x[2] <= x[1], diff(x) <= 0))
if(all(runs$value == 0)) 0 else max(runs$lengths[runs$value == 1])
}
apply(TEST_DF, 1, foo)
[1] 4 2 1 0

I used apply with one four loop to accomplish what you're looking for. The apply acts on each row, and the for loop compares consecutive columns.
COUNT <- rep(0,4)
for (i in 1:(ncol(TEST_DF)-1)) {
COUNT<-COUNT+apply(TEST_DF,1,function(x) ifelse(x[i]>=x[i+1],1,0))
}
This produces: 3, 2, 1, 0, as there are 3 reductions in the first row. The last column has nothing to compare to, so there can only be three comparisons. I'm not sure why you want it to be 4?
If you want count to be part of your original table:
TEST_DF$COUNT<-COUNT

Related

How to exclude a range of data points by index from a dataframe in R [duplicate]

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Finding sequences in rows in R based on the rep function on a certain column

I'm trying to find a sequence of 0's in a row based on the rep function of a certain column. Below is my best attempt so far which throws an error. I tried using an apply loop but failed miserably and I don't really want to use a for loop unless I have to as my true dataset is about 800,000 rows. I have tried looking up solutions but can't seem to find anything and have spent a few hours at this and had no luck. I have also attached the desired output.
library(data.table)
TEST_DF <- data.table(INDEX = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2))
IN_FUN <- function(x, y)
{
x <- rle(x)
if( max(as.numeric(x$lengths[x$values == 0])) >= y )
{
"Y"
}
else
{
"N"
}
}
TEST_DF$DEFINITION <- apply(TEST_DF[, c(2:5), with = FALSE], 1,
FUN = IN_FUN(TEST_DF[, c(2:5), with = FALSE], TEST_DF$DAYS))
DESIRED <- TEST_DF <- data.table(P_ID = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2).
DEFINITION = c("Y","N","Y","N"),
INDEX = c(2,NA,4,NA)
For the first row I want to see if four 0's are within COL_1 to COL_4, four 0's within row 2 and two 0's within rows 3 and 4. Basically the number of 0's is given by the value in the DAYS column. So since four 0's are within row 1, DEFINITION gets a value of "Y", row 2 gets a value of "N" since there is only three 0's row 4 should get a value of "Y" since there are two 0's, etc.
Also, if possible, if the DEFINITION column has a value of "Y" in it, then it should return the column index of the first occurrence of the desired sequence, e.g. in row 1 since the first occurrence of a 0 in the 4 0's we're looking for is in COL_1 then we should get a value of 2 for the INDEX column and row 2 get a NA since DEFINITION is "N", etc.
Feel free to make any edits to make it clearer for other users and let me know if you need better information.
Cheers in advance :)
EDIT:
Below is a slightly extended data table. Let me know if this is sufficient.
TEST_DF <- data.table(P_ID = c(1,2,3,4,5,6,7,8,10),
COL_1 = c(0,0,0,0,0,0,0,5,90),
COL_2 = c(0,0,0,0,0,0,3,78,6),
COL_3 = c(0,0,0,0,0,0,7,5,0),
COL_4 = c(0,0,0,0,0,5,0,2,0),
COL_5 = c(0,0,0,0,0,7,2,0,0),
COL_6 = c(0,0,0,0,0,9,0,0,5),
COL_7 = c(0,0,0,0,0,1,0,0,6),
COL_8 = c(0,0,0,0,0,0,0,1,8),
COL_9 = c(0,0,0,0,0,1,6,1,0),
COL_10 = c(0,0,0,0,0,0,7,1,0),
COL_11 = c(0,0,0,0,0,0,8,3,0),
COL_12 = c(0,0,0,0,0,0,9,6,7),
DAYS = c(10,8,12,4,5,4,3,4,7))
Where the DEFINITION column for the rows would be c(1,1,1,1,1,0,1,0,0) where 1 is "Y" and 0 is "N". Either is ok.
For the INDEX column in the new edit the values should be c(2,2,2,2,2,NA,7,NA,NA)
Was able to do this with some math trickery. I created a binary matrix where an element is 1 if it was originally 0 and 0 otherwise. Then, for each row I set the nth element in the row equal to the (n-1th element + the nth element) times the nth element. In this transformed matrix, the value of an element is equal to the number of consecutive prior elements which were 0 (including this element).
m<-as.matrix(TEST_DF[, 2:(ncol(TEST_DF)-1L)])
m[m==1]<-2
m[m==0]<-1
m[m!=1]<-0
for(i in 2:ncol(m)){
m[,i]=(m[,i-1]+m[,i])*m[,i]
}
# note the use of with=FALSE -- this forces ncol to be evaluated
# outside of TEST_DF, leading the result to be used as a
# column number instead of just evaluating to a scalar
m<-as.matrix(cbind(m, Days=TEST_DF[,ncol(TEST_DF),with=FALSE]))
indx<-apply(m[,-ncol(m)] >= m[,ncol(m)],1,function(x) match(TRUE,x) )
TEST_DF$DEFINITION<-ifelse(is.na(indx),0,1)
TEST_DF$INDEX<-indx-TEST_DF$DAYS+2
Note: I stole some stuff from this post
I think I understand this better now that the question has been edited some. This has loops so it might not be optimal speed-wise, but the set statement should help with this. It still has some of the speed-up that data.table provides.
#Combined all column values in giant string
TEST_DF[ , COL_STRING := paste(COL_1,COL_2,COL_3,COL_4,COL_5,COL_6,COL_7,COL_8,COL_9,COL_10,COL_11,COL_12,sep=",")]
TEST_DF[ , COL_STRING := paste0(COL_STRING,",")]
#Using the Days variable, create a string to be searched
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="FIND",value=paste(rep("0,",TEST_DF[i]$DAYS),sep="",collapse=""))
#Find where pattern starts. A negative 1 value means it does not exist
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=regexpr(TEST_DF[i]$FIND,TEST_DF[i]$COL_STRING,fixed=TRUE)[1])
#Define DEFINITION
TEST_DF[ , DEFINITION := 1*(INDEX != -1)]
#Find where pattern starts. A negative 1 value means it does not exist
require(stringr)
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=str_count(substr(TEST_DF[i]$COL_STRING,1,TEST_DF[i]$INDEX),","))
#Clean up variables
TEST_DF[ , INDEX := INDEX + DEFINITION*2L]
TEST_DF[INDEX==0L, INDEX := NA_integer_]
You might explore the IRanges package. I just defined the test dataset as a data.frame, since I am not familiar with data.table. I then expanded it to your dataset size of 800000:
TEST_DF <- TEST_DF[sample(nrow(TEST_DF), 800000, replace=TRUE),]
Then, we put IRanges to work:
library(IRanges)
m <- t(as.matrix(TEST_DF[,2:13]))
l <- relist(Rle(m), PartitioningByWidth(rep(nrow(m), ncol(m))))
r <- ranges(l)
validRuns <- width(r) >= TEST_DF$DAYS
TEST_DF$DEFINITION <- sum(validRuns) > 0
TEST_DF$INDEX <- drop(phead(start(r)[validRuns], 1)) + 1L
The first step simplifies the table to a matrix, so we can transpose and get things in the right layout for a light-weight partitioning (PartitioningByWidth) of the data into a type of list. The data are converted into a run-length encoding (Rle) along the way, which finds the runs of zeros in each row. We can extract the ranges representing the runs and then compute on them more efficiently than we might on the split Rle directly. We find the runs that meet or exceed the DAYS and record which groups (rows) have at least one such run. Finally, we find the start of the valid runs, take the first start for each group with phead, and drop so that those with no runs become NA.
For 800,000 rows, this takes about 4 seconds. If that's not fast enough, we can work on optimization.

Deleting many, specific rows in R [duplicate]

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

How do I delete rows in a data frame?

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Is it possible to swap columns around in a data frame using R?

I have three variables in a data frame and would like to swap the 4 columns around from
"dam" "piglet" "fdate" "ssire"
to
"piglet" "ssire" "dam" "tdate"
Is there any way I can do the swapping using R?
Any help would be very much appreciated.
Baz
dfrm <- dfrm[c("piglet", "ssire", "dam", "tdate")]
OR:
dfrm <- dfrm[ , c("piglet", "ssire", "dam", "tdate")]
d <- data.frame(a=1:3, b=11:13, c=21:23)
d
# a b c
#1 1 11 21
#2 2 12 22
#3 3 13 23
d2 <- d[,c("b", "c", "a")]
d2
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3
or you can do same thing using index:
d3 <- d[,c(2, 3, 1)]
d3
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3
To summarise the other posts, there are three ways of changing the column order, and two ways of specifying the indexing in each method.
Given a sample data frame
dfr <- data.frame(
dam = 1:5,
piglet = runif(5),
fdate = letters[1:5],
ssire = rnorm(5)
)
Kohske's answer: You can use standard matrix-like indexing using column numbers
dfr[, c(2, 4, 1, 3)]
or using column names
dfr[, c("piglet", "ssire", "dam", "fdate")]
DWin & Gavin's answer: Data frames allow you to omit the row argument when specifying the index.
dfr[c(2, 4, 1, 3)]
dfr[c("piglet", "ssire", "dam", "fdate")]
PaulHurleyuk's answer: You can also use subset.
subset(dfr, select = c(2, 4, 1, 3))
subset(dfr, select = c(c("piglet", "ssire", "dam", "fdate")))
You can use subset's 'select' argument;
#Assume df contains "dam" "piglet" "fdate" "ssire"
newdf<-subset(df, select=c("piglet", "ssire", "dam", "tdate"))
I noticed that this is almost an 8-year old question. But for people who are starting to learn R and might stumble upon this question, like I did, you can now use a much flexible select() function from dplyr package to accomplish the swapping operation as follows.
# Install and load the dplyr package
install.packages("dplyr")
library("dplyr")
# Override the existing data frame with the desired column order
df <- select(df, piglet, ssire, dam, tdate)
This approach has following advantages:
You will have to type less as the select() does not require variable names to be enclosed within quotes.
In case your data frame has more than 4 variables, you can utilize select helper functions such as starts_with(), ends_with(), etc. to select multiple columns without having to name each column and rearrange them with much ease.
Relevance Note: In response to some users (myself included) that would like to swap columns without having to specify every column, I wrote this answer up.
TL;DR: A one-liner for numerical indices is provided herein and a function for swapping exactly 2 nominal and numerical indices at the end, neither using imports, that will correctly swap any two columns in a data frame of any size is provided. A function that allows the reassignment of an arbitrary number of columns that may cause unavoidable superfluous swaps if not used carefully is also made available (read more & get functions in Summary section)
Preliminary Solution
Suppose you have some huge (or not) data frame, DF, and you only know the indices of the two columns you want to swap, say 1 < n < m < length(DF). (Also important is that your columns are not adjacent, i.e. |n-m| > 1 which is very likely to be the case in our "huge" data frame but not necessarily for smaller ones; work-arounds for all degenerate cases are provided at the end).
Because it is huge, there are a ton of columns and you don't want to have to specify every other column by hand, or it isn't huge and you're just lazy someone with fine taste in coding, either way, this one-liner will do the trick:
DF <- DF[ c( 1:(n-1), m, (n+1):(m-1), n, (m+1):length(DF) ) ]
Each piece works like this:
1:(n-1) # This keeps every column before column `n` in place
m # This places column `m` where column `n` was
(n+1):(m-1) # This keeps every column between the two in place
n # This places column `n` where column `m` was
(m+1):length(DF) # This keeps every column after column `m` in place
Generalizing for Degenerates
Because of how the : operator works, i.e. allowing "backwards-ranges" like this,
> 10:0
[1] 10 9 8 7 6 5 4 3 2 1 0
we have to be careful about our choices and placements of n and m, hence our previous restrictions. For instance, n < m doesn't lose us any generality (one of the columns has to be before the other one if they are different), however, it means we do need to be careful about which goes where in our line of code. We can make it so that we don't have to check this condition with the following modification:
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
We have replaced every instance of n and m with min(n,m) and max(n,m) respectively, meaning that the correct ordering for our code will be preserved even in the case that m > n.
In the cases where min(n,m) == 1, max(n,m) == length(DF), both of those at the same time, and |n-m| == 1, we we will make some unreadable less aesthetic modifications involving if\else to forget about having to check if these are the case. Versions for where you know that one of these are the case, (i.e. you are always swapping some interior column with the first column, swapping some interior column with the last column, swapping the first and last columns, or swapping two adjacent columns), you can actually express these actions more succinctly because they usually just require omitting parts from our restricted case:
# Swapping not the last column with the first column
# We just got rid of 1:(min(n,m)-1) because it would be invalid and not what we meant
# since min(n,m) == 1
# Now we just stick the other column right at the front
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
# Also equivalent since we know min(n,m) == 1, for the leftover index i
DF <- DF[ c( i, 2:(i-1), 1, (i+1):length(DF) ) ]
# Swapping not the first column with the last column
# Similarly, we just got rid of (max(n,m)+1):length(DF) because it would also be invalid
# and not what we meant since max(n,m) == length(DF)
# Now we just stick the other column right at the end
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent since we know max(n,m) == length(DF), for the leftover index, say i
DF <- DF[ c( 1:(i-1), length(DF), (i+1):(length(DF)-1), i ) ]
# Swapping the first column with the last column
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent (for if you don't actually know the length beforehand, as assumed
# elsewhere)
DF <- DF[ c( length(DF), 2:(length(DF)-1), 1 ) ]
# Swapping two interior adjacent columns
# Here we drop the explicit swap on either side of our middle column segment
# This is actually enough because then the middle segment becomes a backwards range
# because we know that `min(n,m) + 1 = max(n,m)`
# The range is just an ordering of the two adjacent indices from largest to smallest
DF <- DF[ c( 1:(min(n,m)-1), (min(n,m)+1):(max(n,m)-1), (max(n,m)+1):length(DF) )]
"But!", I hear you saying, "What if more than one of these cases occur simultaneously, like did in the third version in the block above!?". Right, coding up versions for each case is an enormous waste of time if one wants to be able to "swap columns" in the most general sense.
Swapping any Two Columns
It will be easiest to generalize our code to cover all of the cases at the same time, because they all employ essentially the same strategy. We will use if\else to keep our code a one-liner:
DF <- DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ]
That's totally unreadable and probably pretty unfriendly to anyone who might try to understand or recreate your code (including yourself), so better to box it up in a function.
# A function that swaps the `n` column and `m` column in the data frame DF
swap <- function(DF, n, m)
{
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
A more robust version that can also swap on column names and has semi-explanatory comments:
# Returns data frame object with columns `n` and `m` swapped
# `n` and `m` can be column names, numerical indices, or a heterogeneous pair of both
swap <- function(DF, n, m)
{
# Of course, first, we want to make sure that n != m,
# because if they do, we don't need to do anything
if (n==m) return(DF)
# Next, if either n or m is a column name, we want to get its index
# We assume that if they aren't column names, they are indices (integers)
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(supressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
# Make sure each index is actually valid
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
# Also, for readability, lets go ahead and set which column is earlier, and which is later
earlier <- min(n,m)
later <- max(n,m)
# This constructs the first third of the indices
# These are the columns that, if any, come before the earlier column you are swapping
firstThird <- if ( earlier==1 ) c() else 1:(earlier-1)
# This constructs the last third of the the indices
# These are the columns, if any, that come after the later column you are swapping
lastThird <- if ( later==length(DF) ) c() else (later+1):length(DF)
# This checks if the columns to be swapped are adjacent and then constructs the
# secondThird accordingly
if ( earlier+1 == later )
{
# Here; the second third is a list of the two columns ordered from later to earlier
secondThird <- (earlier+1):(later-1)
}
else
{
# Here; the second third is a list of
# the later column you want to swap
# the columns in between
# and then the earlier column you want to swap
secondThird <- c( later, (earlier+1):(later-1), earlier)
}
# Now we assemble our indices and return our permutation of DF
return (DF[ c( firstThird, secondThird, lastThird ) ])
}
And, for ease of repatriation with less of the spatial cost, a comment-less version that checks index validity and can handle column names, i.e. does everything in pretty close to the smallest space it can (yes, you could vectorize, using ifelse(...), the two checks that get performed, but then you'd have to unpack the vector back into n,m or change how the final line is written):
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
Permutations (or How to Do Specifically What the Question Asked and More!)
With our swap function in tow, we can try to actually do what the original question asked. The easiest way to do this, is to build a function that utilizes the really cool power that comes with a choice of heterogeneous arguments. Create a mapping:
mapping <- data.frame( "piglet" = 1, "ssire" = 2, "dam" = 3, "tdate" = 4)
In the case of the original question, these are all of the columns in our original data frame, but we will build a function where this doesn't have to be the case:
# A function that takes two data frames, one with actual data: DF, and the other with a
# rearrangement of the columns: R
# R must be structured so that colnames(R) is a subset of colnames(DF)
# Alternatively, R can be structured so that 1 <= as.integer(colnames(R)) <= length(DF)
# Further, 1 <= R$column <= length(DF), and length(R$column) == 1
# These structural requirements on R are not checked
# This is for brevity and because most likely R has been created specifically for use with
# this function
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}
Wait, that's it? Yup. This will swap every column name to the appropriate placement. The power for such simplicity comes from swap taking heterogeneous arguments meaning we can specify the moving column name that we want to put somewhere, and so long as we only ever try to put one column in each position (which we should), once we put that column where it belongs, it won't move again. This means that even though it seems like later swaps could undo previous placements, the heterogeneous arguments make certain that won't happen, and so additionally, the order of the columns in our mapping also doesn't matter. This is a really nice quality because it means that we aren't kicking this whole "organizing the data" issue down the road too much. You only have to be able to determine which placement you want to send each column you want to move to.
Ok, ok, there is a catch. If you don't reassign the entire data frame when you do this, then you have superfluous swaps that occur, meaning that if you re-arrange over a subset of columns that isn't "closed", i.e. not every column name has an index that is represented in the rearrangement, then other columns that you didn't explicitly say to move may get moved to other places they don't exactly belong. This can be handled by creating your mapping very carefully, or simply using numerical indices mapping to other numerical indices. In the latter case, this doesn't solve the issue, but it makes more explicit what swaps are taking place and in what order so planning the rearrangement is more explicit and thus less likely to lead to problematic superfluous swaps.
Summary
You can use the swap function that we built to successfully swap exactly two columns or the rearrange function with a "rearrangement" data frame specifying where to send each column name you want to move. In the case of the rearrange function, if any of the placements chosen for each column name are not already occupied by one of the specified columns (i.e. not in colnames(R)), then superfluous swaps can and are very likely to occur (The only instance they won't is when every superfluous swap has a partner superfluous swap that undoes it before the end. This is, as stated, very unlikely to happen by accident, but the mapping can be structured to accomplish this outcome in practice).
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}
I quickly wrote a function that takes a vector v and column indexes a and b which you want to swap.
swappy = function(v,a,b){ # where v is a dataframe, a and b are the columns indexes to swap
name = deparse(substitute(v))
helpy = v[,a]
v[,a] = v[,b]
v[,b] = helpy
name1 = colnames(v)[a]
name2 = colnames(v)[b]
colnames(v)[a] = name2
colnames(v)[b] = name1
assign(name,value = v , envir =.GlobalEnv)
}
I was using the function by KhĂ´ra Willis, which is helpful. But I encountered an error. I tried to make corrections. Here is R code that finally works. The arguments n and m could either be column names or column numbers in data frame DF.
require(tidyverse)
swap <- function(DF, n, m)
{
if (class(n)=="character") n <- which(colnames(DF)==n)
if (class(m)=="character") m <- which(colnames(DF)==m)
p <- NCOL(DF)
if (!(1<=n & n<=p)) stop("`n` represents invalid index!")
if (!(1<=m & m<=p)) stop("`m` represents invalid index!")
index <- 1:p
index[n] <- m; index[m] <- n
DF0 <- DF %>% select(all_of(index))
return(DF0)
}

Resources