Operations on elements of column vectors - r

I have a column vector containing 1's. I also have another numeric column containing numbers.
Example:
day_eq day
1 1
1 5
1 3
1 2
I now want to say:
If an element from day is smaller than its corresponding element in day_eq,
make invalid (a column vector element) = 5.
This is my code:
for (i in 1:nrow(setin)){
if (setin[[i,"day"]]<setin[[i,"day_eq"]]){
setin[[i,"valid"]] = 0
setin[[i,"invalid_code"]] = 5
}
}
It isn't working. It keeps saying:
Error in if (setin[[i, "day"]] < setin[[i, "day_eq"]]) { :
missing value where TRUE/FALSE needed
or
In if (test.ID1$day_eq > test.ID1$day) { :
the condition has length > 1 and only the first element will be used
Where test.ID1 is the set name.

You don't need a loop for that. I'm not sure exactly what you are doing... but ifelse should be able to help you...
setin$valid <- ifelse(setin$day < setin$day_eq, 0, NA)
setin$invalid_code <- ifelse(setin$day < setin$day_eq, 5, NA)

your data is
day_eq <- c(1,1,1,1)
day <- c (1,5,3,2)
setin <- data.frame(day_eq,day)
the solution using dplyr is
library(dplyr)
setin %>% mutate(invalid = ifelse (day < day_eq, 5, 0))
I used setin as set name, however, you also use test.ID1, so just replace it in case

Related

How to exclude a range of data points by index from a dataframe in R [duplicate]

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Referencing Variable Row Number in R

I am new to R and would appreciate your help with the following question:
I have code that runs through all of the values (x) in a column of a dataset called m, comparing them one by one to a fixed value via a for loop. I'd like for x to be compared to my fixed value (0.17) ONLY IF the cell in m[(SAME ROW AS x), "reference_column_name"] contains a certain string.
The goal is to get at the end of m a column of values 0,1,2, or 3 based the comparison of x with a cell from the reference column with the same row number as x. Something like this:
new_column
0
2
2
3
1
1
2
0
3
How do I refer to the row of x (as my variable is changing as the for loop continues)?
With what can I replace "(SAME ROW AS x)"?
this is my code:
m$new_colum <- 0 #I start by assigning everything the value 0.
for (x in m$current_column) {
if ((grepl("string",((m[(SAME ROW AS x),"reference_column_name"])),fixed=TRUE))==TRUE){
if (is.na(x)){
m$new_column<-0
}
else if (x <= 0.17) {
m$new_column<-1}
else if (x > 0.17) {
m$new_column<-2}
}
else {m$new_column<-3}
}
I have changed all of the variable and column names to make reading this question easier - I am aware that names should be shorter.
Thanks for your help!
As per my understanding of your question, here is my solution:
m$new_column <- ifelse(grepl("string", m$ref_column), ifelse(is.na(m$x), 0, ifelse(m$x <= 0.17, 1, 2)), 3)
This code will first check for the string in the reference column at the same row. If it doesn't find it will equate to 3. If it finds it, it will go further into the 2nd ifelse block.
- In this block, it will first check for NA and assign a 0, else it goes into the 3rd ifelse block where it finally checks for your "x" column to have the value of 0.17 or less and assign 1 else 2.
Hope this helps
Could use a series of properly indexed assignments:
dat <- data.frame( x=runif(20), ref_col=sample( c("string", "not string"), 20, repl=TRUE) )
dat$new_col[dat$x > 0.17 & dat$ref_col=="string"] <- 2
dat$new_col[dat$x <= 0.17 & dat$ref_col=="string"] <- 1
dat$new_col[ is.na(dat$x)] <- 0
dat$new_col[ dat$ref_col != "string"] <- 3
dat
Didn't have any NA's in my x's but I predict they would have been properly assigned

Trimming NAs based on column subset - a more elegant solution?

A New Year's quandary for the stackoverflow community which has been quite the help by reading posts and answers in the past (this is my first question). I've found a work around, but I'm wondering if other approaches/solutions might be suggested.
I am attempting to remove trailing NA's from a large data.frame, but those NA's are only found in a few of the columns of the data.frame and I would like to retain all columns in the output. Here is a representative data subset.
df=data.frame(var1=rep("A", 8), var2=c("a","b","c","d","e","f","g","h"), var3=c(0,1,NA,2,3,NA,NA,NA), var4=c(0,0,NA,4,5,NA,NA,NA), var5=c(0,0,NA,0,2,4,NA,NA))
Goals of the process:
Trim trailing NAs based on NA presence in var3,var4 and var5
Retain all columns in final output
Only remove trailing NAs (i.e. row 3 remains in record as a placeholder)
Only trim if all columns have an NA (i.e. row 7 and 8, but not row 6)
Based on these goals, the solution should remove the last two rows of df:
df.output = df[-c(7,8),]
The behaviour of na.trim (in the zoo package) is ideal (as it limits removal to those NA's at the end of the data.frame, with sides="right"), and my work-around involved altering the na.trim.default function to include a subset term.
Any suggestions? Many thanks for any help.
EDIT: Just to complete this question, below is the function I created from the na.trim.default code which also works, but as noted, does require loading the zoo package.
na.trim.multiplecols <- function (object, colrange, sides = c("both", "left", "right"), is.na = c("any","all"),...)
{
is.na <- match.arg(is.na)
nisna <- if (is.na == "any" || length(dim(object[,colrange])) < 1) {
complete.cases(object[,colrange])
}
else rowSums(!is.na(object[,colrange])) > 0
idx <- switch(match.arg(sides), left = cumsum(nisna) > 0,
right = rev(cumsum(rev(nisna) > 0) > 0), both = (cumsum(nisna) >
0) & rev(cumsum(rev(nisna)) > 0))
if (length(dim(object)) < 2)
object[idx]
else object[idx, , drop = FALSE]
}
Something based on max(which(!is.na())) will work. We use this to find the largest index of non-missing data from the columns of interest.
Using your df
ind <- max(max(which(!is.na(df$var3))),
max(which(!is.na(df$var4))),
max(which(!is.na(df$var5))))
df[1:ind, ]
var1 var2 var3 var4 var5
1 A a 0 0 0
2 A b 1 0 0
3 A c NA NA NA
4 A d 2 4 0
5 A e 3 5 2
6 A f NA NA 4
Edit: First solution using base rle and apply
t <- rle(apply(as.matrix(df[,3:5]), 1, function(x) all(is.na(x))))
r <- ifelse(t$values[length(t$values)] == TRUE, t$lengths[length(t$lengths)], 0)
head(df, -r)
Second solution using Rle from package IRanges:
require(IRanges)
t <- min(sapply(df[,3:5], function(x) {
o <- Rle(x)
val <- runValue(o)
if (is.na(val[length(val)])) {
len <- runLength(o)
out <- len[length(len)]
} else {
out <- 0
}
}))
head(df, -t)

How do I delete rows in a data frame?

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Selectively replacing columns in R with their delta values

I've got data being read into a data frame R, by column. Some of the columns will increase in value; for those columns only, I want to replace each value (n) with its difference from the previous value in that column. For example, looking at an individual column, I want
c(1,2,5,7,8)
to be replaced by
c(1,3,2,1)
which are the differences between successive elements
However, it's getting really late in the day, and I think my brain has just stopped working. Here's my code at present
col1 <- c(1,2,3,4,NA,2,3,1) # This column rises and falls, so we want to ignore it
col2 <- c(1,2,3,5,NA,5,6,7) # Note: this column always rises in value, so we want to replace it with deltas
col3 <- c(5,4,6,7,NA,9,3,5) # This column rises and falls, so we want to ignore it
d <- cbind(col1, col2, col3)
d
fix_data <- function(data) {
# Iterate through each column...
for (column in data[,1:dim(data)[2]]) {
lastvalue <- 0
# Now walk through each value in the column,
# checking to see if the column consistently rises in value
for (value in column) {
if (is.na(value) == FALSE) { # Need to ignore NAs
if (value >= lastvalue) {
alwaysIncrementing <- TRUE
} else {
alwaysIncrementing <- FALSE
break
}
}
}
if (alwaysIncrementing) {
print(paste("Column", column, "always increments"))
}
# If a column is always incrementing, alwaysIncrementing will now be TRUE
# In this case, I want to replace each element in the column with the delta between successive
# elements. The size of the column shrinks by 1 in doing this, so just prepend a copy of
# the 1st element to the start of the list to ensure the column length remains the same
if (alwaysIncrementing) {
print(paste("This is an incrementing column:", colnames(column)))
column <- c(column[1], diff(column, lag=1))
}
}
data
}
fix_data(d)
d
If you copy/paste this code into RGui, you'll see that it doesn't do anything to the supplied data frame.
Besides losing my mind, what am I doing wrong??
Thanks in advance
Without addressing the code in any detail, you're assigning values to column, which is a local variable within the loop (i.e. there is no relationship between column and data in that context). You need to assign those values to the appropriate value in data.
Also, data will be local to your function, so you need to assign that back to data after running the function.
Incidentally, you can use diff to see if any value is incrementing rather than looping over every value:
idx <- apply(d, 2, function(x) !any(diff(x[!is.na(x)]) < 0))
d[,idx] <- blah
diff calculates the difference between consecutive values in a vector. You can apply it to each column in a dataframe using, e.g.
dfr <- data.frame(x = c(1,2,5,7,8), y = (1:5)^2)
as.data.frame(lapply(dfr, diff))
x y
1 1 3
2 3 5
3 2 7
4 1 9
EDIT: I just noticed a few more things. You are using a matrix, not a data frame (as you stated in the question). For your matrix 'd', you can use
d_diff <- apply(d, 2, diff)
#Find columns that are (strictly) increasing
incr <- apply(d_diff, 2, function(x) all(x > 0, na.rm=TRUE))
#Replace values in the approriate columns
d[2:nrow(d),incr] <- d_diff[,incr]

Resources