R: Iterative deletion of rows with group criteria - r

I'm trying to delete rows iteratively, if they meet two criteria:
slope column < 0
max of Lfd within Ring group
Ring <- c(1, 1, 1, 1, 2, 2, 2, 2)
Lfd <- c(1:4, 1:4)
slope <- c(2, 2, -1, -2, 2, -1, 2, -2)
test <- data.frame(Ring, Lfd, slope)
Ring Lfd slope
1 1 1 2
2 1 2 2
3 1 3 -1
4 1 4 -2
5 2 1 2
6 2 2 -1
7 2 3 2
8 2 4 -2
After first iteration they should look like
Ring Lfd slope
1 1 1 2
2 1 2 2
3 1 3 -1
5 2 1 2
6 2 2 -1
7 2 3 2
And after second like
Ring Lfd slope
1 1 1 2
2 1 2 2
5 2 1 2
6 2 2 -1
7 2 3 2
I already tried without iteration:
test_out <- test %>%
group_by(Ring) %>%
filter(Lfd != which.max(Lfd) & (slope > 0)) %>%
ungroup
And with iteration:
del.high.neg <- function(x) {
success <- FALSE
while (!success) {
test_out <- test %>%
group_by(Ring) %>%
filter(Lfd == which.max(Lfd)) %>%
select(Ring, Lfd, slope) %>%
ungroup
Index <- test_out[test_out$slope < 0, ]
test_out <- test_out[!(test_out$Ring %in% Index),]
success <- Index == NULL
}
return(x)
}

I think this is what you want - it will delete every negative row from the end of the data, until it hits your first positive value:
library(dplyr)
test %>% group_by(Ring) %>%
mutate(row = row_number()) %>%
filter(row <= max(which(slope > 0)))
Source: local data frame [5 x 4]
Groups: Ring [2]
Ring Lfd slope row
(dbl) (int) (dbl) (int)
1 1 1 2 1
2 1 2 2 2
3 2 1 2 1
4 2 2 -1 2
5 2 3 2 3
you can add on a select(-row) if you'd like the row column gone too.

I think you are saying that you want to delete all the rows that have a negative slope and have Lfd that is greater than or equal to the row with the maximum value of Lfd and a non-negative slope. If you want to do that within Ring, you can use the following:
library(plyr)
testmax <- ddply(test,.(Ring),summarize,maxLfd = max(Lfd[slope>=0]))
test1 <- merge(test,testmax)
test_out <- test1[!(test1$Lfd>=test1$maxLfd & test1$slope<0),-4]
test_out
# Ring Lfd slope
# 1 1 1 2
# 2 1 2 2
# 5 2 1 2
# 6 2 2 -1
# 7 2 3 2

Related

Get the first non-null value from selected cells in a row

Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4

How to apply a function to a data frame for multiple inputs and create columns with the outputs using dplyr?

Given the following data
data_in <- data.frame(X1 = c(1, 3, 5, 2, 6),
X2 = c(2, 4, 5, 1, 8),
X3 = c(3, 2, 4, 1, 4))
I wrote a function, which takes the data frame, a value (here called distance) and a string (to add a column name) to count the number of values being smaller or equal to the input value.
custom_function <- function(some_data_frame, distance, name) {
some_data_frame %>%
mutate(!!name := rowSums(. <= distance, na.rm = TRUE)) %>%
return()
}
I can apply the function to the data as follows:
data_in %>%
custom_function(., 5, "some_name")
What I would like now is to use a vector of distances and create a column for each distance using my custom function. Let's say for c(1, 3, 5), I would like to get three columns in an automatic manner and not in hardcoding (applying the function manually three times).
There is an easy way to do that with mapply (using the same distances as in #Sotos ansswer):
(dst <- c(5, 3, 1, 6, 7, 8))
# [1] 5 3 1 6 7 8
(cnm <- paste('some_name', dst, sep = '_'))
# [1] "some_name_5" "some_name_3" "some_name_1" "some_name_6" "some_name_7" "some_name_8"
data_in[, cnm] <- mapply(function(d) rowSums(data_in <= d, na.rm = T), d = dst)
data_in
# X1 X2 X3 some_name_5 some_name_3 some_name_1 some_name_6 some_name_7 some_name_8
# 1 1 2 3 3 3 1 3 3 3
# 2 3 4 2 3 2 0 3 3 3
# 3 5 5 4 3 0 0 3 3 3
# 4 2 1 1 3 3 2 3 3 3
# 5 6 8 4 1 0 0 2 2 3
You can obtain the same results within tidyverse using purrr::map2:
cbind(
data_in,
purrr::map2(dst, cnm, ~custom_function(data_in, .x, .y))
)
# X1 X2 X3 some_name_5 some_name_3 some_name_1 some_name_6 some_name_7 some_name_8
# 1 1 2 3 3 3 1 3 3 3
# 2 3 4 2 3 2 0 3 3 3
# 3 5 5 4 3 0 0 3 3 3
# 4 2 1 1 3 3 2 3 3 3
# 5 6 8 4 1 0 0 2 2 3
With custom_function() defined as:
custom_function <- function(some_data_frame, distance, name) {
some_data_frame %>%
transmute(!!name := rowSums(. <= distance, na.rm = TRUE))
}
You can use sapply to loop through your vector and cbind at the end, i.e.
cbind.data.frame(data_in,
do.call(cbind.data.frame, sapply(c(5, 3, 1, 6, 7, 8), function(i)
custom_function(data_in, i, paste0('some_name_', i))[ncol(data_in) + 1])))
which gives,
X1 X2 X3 some_name_5 some_name_3 some_name_1 some_name_6 some_name_7 some_name_8
1 1 2 3 3 3 1 3 3 3
2 3 4 2 3 2 0 3 3 3
3 5 5 4 3 0 0 3 3 3
4 2 1 1 3 3 2 3 3 3
5 6 8 4 1 0 0 2 2 3

Create dummy-column based on another columns

Let's say I have this dataset
> example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
I want to create a new variable d. I want in d the value 1 when at least in of the variables a b c the value 1 2 or 3 is present.
d should look like this:
d <- c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1)
Thanks in advance.
You can use rowSums to get a logical vector of 1, 2 or 3 appearing in each row and wrap it in as.integer to convert to 0 and 1, i.e.
as.integer(rowSums(df == 1|df == 2| df == 3) > 0)
#[1] 1 1 1 0 0 1 1 1 1 1
Will work for any number of vars:
example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
x <- c(1, 2, 3)
as.integer(Reduce(function(a, b) (a %in% x) | (b %in% x), example))
With the dplyr package:
library(dplyr)
x <- 1:3
example %>% mutate(d = as.integer(a %in% x | b %in% x | c %in% x))
Two other possibilities which work with any number of columns:
#option 1
example$d <- +(rowSums(sapply(example, `%in%`, 1:3)) > 0)
#option 2
library(matrixStats)
example$d <- rowMaxs(+(sapply(example, `%in%`, 1:3)))
which both give:
> example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1
You can do this using apply(although little slow)
Logic: any will compare if there is any 1,2 or 3 is present or not, apply is used to iterate this logic on each of the rows. Then finally converting the boolean outcome to numeric by adding +0 (You may choose as.numeric here in case you want to be more expressive)
d <- apply(example,1 ,function(x)any(x==1|x==2|x==3))+0
In case someone wants to restrict the columns or want to run the logic on some columns, then one can do this also:
d <- apply(example[,c("a","b","c")], 1, function(x)any(x==1|x==2|x==3))+0
Here you have control on columns on which one to take or ignore basis your needs.
Output:
> d
[1] 1 1 1 0 0 1 1 1 1 1
general solution:
example %>%
sapply(function(i)i %in% x) %>% apply(1,any) %>% as.integer
#[1] 1 1 1 0 0 1 1 1 1 1
Try this method, verify if in any column there is at list one element present in x.
x<-c(1,2,3)
example$d<-as.numeric(example$a %in% x | example$b %in% x | example$c %in% x)
example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1

counting number of same values and printing them in R

I have a vector with repeated numbers. I want to count the number of repeated numbers and print the output.
This is my input:
deg <- c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df <- data.frame(table(deg))
This is my output:
deg Freq
1 1 2
2 2 6
3 3 1
4 4 2
5 5 1
Here in my output I want to print the data frame from 0 to 5, where 0 is the starting element and 5 is the max element in the vector. The output I want to get is:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
Someone please help with this!!!
If we're starting from df we can just unpack the data, add zero as a factor level, then re-tabulate:
f <- with(df, factor(rep(deg, Freq), levels = union(0, levels(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
If we're starting with the vector deg, it's easier. We can just add zero as a factor level then tabulate:
f <- factor(deg, levels = union(0, sort(unique(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
Try this:
df <- data.frame(deg=seq(0,max(deg)),
Freq=sapply(seq(0,max(deg)),function(x) length(which(deg==x))))
Output:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
You can add a row to df:
#convert deg from factor back to numeric
df$deg = as.numeric(as.character(df$deg))
# add 0 deg with 0 freq if it doesn't exist already in df
if (!any(df$deg == 0)) {
df = rbind(df, c(0,0))
# sort df by deg
df = df[order(df$deg),]
}
Try this
rbind(data.frame(deg=0, Freq=0)[!(c(0) %in% deg)], as.data.frame(table(deg)))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
The expand_df function below can help you get the desired output
deg = c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df = as.data.frame(table(deg))
expand_df = function(df){
upd_list = 0: max(as.numeric(as.character(df[,1])))
upd_df = as.data.frame(upd_list)
merged_df = merge(upd_df, df,all.x=TRUE,by.x=colnames(upd_df)[1], by.y=colnames(df)[1])
merged_df[,2] = ifelse(is.na(merged_df[,2]),0,merged_df[,2])
merged_df
}
expand_df(df)

how to divide my dataset by the number of times a value appears in R

I have a dataset and need to know on average how many times the number 1 appears, the number 0 appears, and the number -1 appears. But it is not a traditional average. I explain:
this is part of my dataset:
position
1
1
1
0
0
-1
0
-1
-1
-1
-1
-1
1
1
So if I subset the number of times each number appears by vectors I would have:
position '1' position '-1' position '0'
X1 X2 X1 X2 X1 X2
1 1 -1 -1 0 0
1 1 -1 0
1 -1
-1
-1
This way I can find the average for 1 as: (X1+X2)/2 where 2 is the number of vectors that appear. This depends and can be any number given by the number of consecutive times a number appears.
This is a little confusing but I hope you understand my point. I have been thinking how to do this but can't find a way.
Thank you very much!
rle is the way to go, as mentioned by #KonradRudolph. Then you can use split to get a proper format
with(rle(position), split(lengths, values))
# $`-1`
# [1] 1 5
#
# $`0`
# [1] 2 1
#
# $`1`
# [1] 3 2
And, to do the averaging, tapply would work
with(rle(position), tapply(lengths, values, FUN=mean))
# -1 0 1
# 3.0 1.5 2.5
You could also use dplyr with diff:
library(dplyr)
data %>% mutate(group = c(0, cumsum(diff(position)!=0))) %>%
group_by(position) %>%
summarise(mean = n()/length(unique(group)))
Source: local data frame [3 x 2]
position mean
(int) (dbl)
1 -1 3.0
2 0 1.5
3 1 2.5
A bit verbose, but this shows how all of this comes together:
library(dplyr)
position <- c(1, 1, 1, 0, 0, -1, 0, -1, -1, -1, -1, -1, 1, 1)
rle_pos <- rle(position)
df <- data_frame(position_code = rle_pos$values,
length = rle_pos$lengths)
df
# Source: local data frame [6 x 2]
#
# position_code length
# (dbl) (int)
# 1 1 3
# 2 0 2
# 3 -1 1
# 4 0 1
# 5 -1 5
# 6 1 2
df %>%
group_by(position_code) %>%
summarise(count = n(),
sum_lengths = sum(length)) %>%
mutate(average = sum_lengths / count)
# Source: local data frame [3 x 4]
#
# position_code count sum_lengths average
# (dbl) (int) (int) (dbl)
# 1 -1 2 6 3.0
# 2 0 2 3 1.5
# 3 1 2 5 2.5

Resources