Related
I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding
Given two data tables (tbl_A and tbl_B), I would like to select all the rows in the tbl_A which have matching rows in tbl_B, and I would like the code to be expressive. If the %in% operator were defined for data.tables, something like this would be be ideal:
subset <- tbl_A[tbl_A %in% tbl_B]
I can think of many ways to accomplish what I want such as:
# double negation (set differences)
subset <- tbl_A[!tbl_A[!tbl_B,1,keyby=a]]
# nomatch with keyby and this annoying `[,V1:=NULL]` bit
subset <- tbl_B[,1,keyby=.(a=x)][,V1:=NULL][tbl_A,nomatch=0L]
# nomatch with !duplicated() and setnames()
subset <- tbl_B[!duplicated(tbl_B),.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# nomatch with !unique() and setnames()
subset <- unique(tbl_B)[,.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# use of a temporary variable (Thanks #Frank)
subset <- tbl_A[, found := FALSE][tbl_B, found := TRUE][(found)][,found:=NULL][]
but each expression is difficult to read and it's not obvious at first glance what the code is doing. Is there a more idiomatic / expressive way of accomplishing this task?
For purposes of example, here are some toy data.tables:
# toy tables
tbl_A <- data.table(a=letters[1:5],
b=1:5,
c=rnorm(5))
tbl_B <- data.table(x=letters[3:7],
y=13:17,
z=rnorm(5))
# both tables might have multiple rows with the same key fields.
tbl_A <- rbind(tbl_A,tbl_A)
tbl_B <- rbind(tbl_B,tbl_B)
setkey(tbl_A,a)
setkey(tbl_B,x)
and an expected result containing the rows in tbl_A which match at least one row in tbl_B:
a b c
1: c 3 -0.5403072
2: c 3 -0.5403072
3: d 4 -1.3353621
4: d 4 -1.3353621
5: e 5 1.1811730
6: e 5 1.1811730
Adding 2 more options
tbl_A[fintersect(tbl_A[,.(a)], tbl_B[,.(a=x)])]
and
tbl_A[unique(tbl_A[tbl_B, nomatch=0L, which=TRUE])]
I'm not sure how expressive it is (apologies if not) but this seems to work:
tbl_A[,.(a,b,c,any(a == tbl_B[,x])), by = a][V4==TRUE,.(a,b,c)]
I'm sure it can be improved - I only found out about any() yesterday and still testing it :)
I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]
I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding
I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding