Invert sign of even numbered rows in r dataframe - r

I have a data frame with 10 items and I want to negate the even numbered rows. I came up with this monstrosity:
change_even <- data.frame(val=runif(10))
change_even$val[row( as.matrix(change_even[,'val']) ) %% 2 == 0 ] <- -change_even$val[row( as.matrix(change_even[,'val']) ) %% 2 == 0 ]
is there a better way?

Simply you can use recycling:
change_even$val*c(1,-1)
#[1] 0.1077468 -0.5418167 0.8319609 -0.7230043 0.6649786 -0.7232669
#[7] 0.2677659 -0.4035824 0.6880934 -0.5600653
(values are not reproducible since seed was not set; however the alternating sign can be seen clearly).

You can simply do,
change_even[c(FALSE,TRUE),] <- change_even[c(FALSE,TRUE),]*(-1)

With a data.table, you can get similar with data.frame. Similar to here Selecting multiple odd or even columns/rows for dataframe in R
library(data.table)
change_even <- data.table(val=runif(10))
even_indexes<-seq(2,nrow(change_even),2)
change_even <- change_even[even_indexes,val:=val*-1]

Use the remainder operator to find the even numbered rows, then simply negate
change_even <- data.frame(val=runif(10))
change_even[seq(nrow(change_even)) %% 2 != 1,] = -change_even[seq(nrow(change_even)) %% 2 != 1,]

This is what I came up with:
change_even$val = change_even$val * c(rep(-1,nrow(change_even))^((row(change_even)+1)))

Another one:
(-1)^(0:(nrow(change_even)-1))*change_even$val

Related

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

For loop for multiple indices

I know that in R for loops should be avoided and vectorized operations should be used instead.
I want to solve this with a for loop and then try to use the apply family, then also in Rcpp.
I load a dataset containing one column of passwords (alphanumeric).
Once loaded (a sample, for speed), I want to create new column with value (0,1) based on some conditions "contains_lower_chars", "contains_numbers" and so on.
Here what I tried to do, but it doesn't work - meaning each column I create has the same value.
library(tidyverse)
set.seed(123)
# load dataset from url, skip the first 16 rows
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
sample_frac(.001) %>%
rename(password = V1)
patterns = c("[a-z]","[A-Z]","[0-9]+")
df$has_lower <- 0
df$has_upper <- 0
df$has_numeric <- 0
for(i in 1:nrow(df)){
for(j in patterns){
n <- ifelse(grepl(j, df$password[i]),1,0)
}
df$has_lower[i] <- n
df$has_upper[i] <- n
df$has_numeric[i] <- n
}
Output I have in mind is:
password has_lower has_upper has_numeric
Bigmaccas 1 1 0
0127515559 0 0 1
dbqky73p 1 0 1
We can simplify things if we just name your pattern vector. For example
patterns = c(has_lower="[a-z]",
has_upper="[A-Z]",
has_numeric="[0-9]+")
for(pattern in names(patterns)) {
df[, pattern] = as.numeric(grepl(patterns[pattern], df$password))
}
Basically we just loop through each of the names, grab the regular expression corresponding to that name, then do the matching and adding the column.
A data frame is above all a list.
So, you can simply do:
df[c("has_lower", "has_upper", "has_numeric")] <-
lapply(patterns, function(pattern) grepl(pattern, df$password) + 0)
Use + 0L instead of + 0 is you want integers instead of doubles (I would recommend to do nothing and to keep logicals).
First you need to update has.lower has.upper and has.numeric within the j loop otherwise your n remains the same for this 3 cases. To do so you need to be able to loop over the names of the columns has.lower has.upper and has.numeric:
names <- c("has_lower","has_upper","has_numeric")
for(i in 1:nrow(df)){
for(j in 1:length(patterns)){
df[i,(names[j])] <- as.numeric(grepl(j, df$password[i]))
}
}
A quicker, nicer, more compact alternative using apply and the fact that grepl is already vectorized:
df[, c("has_lower","has_upper","has_numeric"):=lapply(patterns, function(x) grepl(x,df$password))]
Note (nothing to do with your question):
I advise you to use the fread function to read your dataset since it is quite large.
df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>%
sample_frac(.001) %>%
rename(password = V1)

Calculate count of number of switch in vector

I have a vector in which i have to calculate how many times data switched from 0 to 100 and back to 0. An example is given as below.
Input
X1<-c(100,100,100,0,0,0,0,0,100,100,100,100,100,0,0,0,0,100,100,100,0,0,100,100)
So the output should be 3 as the value started at 0 stayed at 100 for the some time and back to 0. My requirements is to count how many times this switch has occurred. I am aware of rle but that only gives me the length.
Thanks in advance for the help.
This looks sufficient
sum(X1[-1] != X1[-length(X1)]) / 2
Assumptions are that
You only have two unique values in X1
The last element of X1 equals the first element, that is, it switches back to original state in the end.
You can do something like,
sum(diff(X1) == 100)
#[1] 3
#Or
min(sum(diff(X1) == 100), sum(diff(X1) == -100))
#[1] 3
You could run rle and then iterate through three elements of values at a time to see if the required condition has been met.
with(rle(X1),
sum(sapply(3:length(lengths), function(i)
values[i-2] == 0 & values[i-1] == 100 & values[i] == 0)))
#[1] 2
more generally for counting switches in n cases (numeric or character):
count_switches_groups <- function(seq.input){
COUNT <- 0
transition = rep("no switch",length(seq.input))
for (i in 2:length(seq.input)) {
if (seq.input[i] != seq.input[i - 1]) {
COUNT <- COUNT + 1
transition[i] <- paste0("from ",seq.input[i - 1]," to ",seq.input[i])
}
}
total_switches <- COUNT
state_transitions <- transition[transition != "no switch"]
occurances <- as.data.frame(table(state_transitions))
return_list <- list(total_switches,occurances)
names(return_list) <- c("total_transitions","unique_switches")
return(return_list)
}
count_switches_groups(X1)
sum((np.diff(x)==100)|(np.diff(x)==-100))
I think this would be the answer, worked for me

How to filter a data frame depending on how many chars in each row [duplicate]

I have a dataframe m and I want to remove all the rows where the f_name column has an entry greater than 3. I assume I can use something similar to
m <- m[-grep("nchar(m$f_name)>3", m$f_name]
To reword your question slightly, you want to retain rows where entries in f_name have length of 3 or less. So how about:
subset(m, nchar(as.character(f_name)) <= 3)
Try this:
m[!nchar(as.character(m$f_name)) > 3, ]
For those looking for a tidyverse approach, you can use dplyr::filter:
m %>% dplyr::filter(nchar(f_name) > 3)
The obligatory data.table solution:
setDT(m)
m[ nchar(f_name) <= 3 ]

Deleting many, specific rows in R [duplicate]

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]
For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.
Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")
Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!
To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2
Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Resources