creating new column based on rows being equal in R - r

Here is a simple question about creating a new column conditional on a row duplicate in one column matching criterion in different column. Specifically, if the row is a duplicate in column "pairs", create new column "new" based on rows in column "y" being equal/unequal.
In the actual data frame I have even more conditions for other columns but my main issue is with making these conditions dependent on the rows being the same in the "pairs" column.
Many thanks!
pairs y new
1 1 1
1 0 1
2 1 0
2 1 0
3 3 1
3 1 1

Assuming values are always paired, i.e., there are only two row in each group:
DF <- read.table(text="pairs y new
1 1 1
1 0 1
2 1 0
2 1 0
3 3 1
3 1 1", header=TRUE)
library(plyr)
#for integers:
ddply(DF, .(pairs), transform, new1 = 1*(diff(y) != 0L))
#for numerics:
ddply(DF, .(pairs), transform, new1 = 1*(abs(diff(y)) > .Machine$double.eps ^ 0.5))

Related

Is there a R function to count the values in the rows more than 0 [duplicate]

This question already has answers here:
Number of column values greater than 0 for given row? [duplicate]
(2 answers)
Closed 3 years ago.
Is there a R function to count the values more than 0 in a row
test <- data.frame(a=c(a,"y"),b=c(0,"5"),c=c(2,"0"))
test
a b c
1 1 0 2
2 y 5 0
I need to get following, because first row contains 1 values more than 0 and second row contains 1 value more than 0. I need to exclude first column as it is only character
test
a b c d
1 a 0 2 1
2 y 5 0 1
We can convert the type of columns with type.convert, select the numeric columns, check if it is greater than 0, get the row wise sum of logical matrix, and create a new column in the 'test' dataset
library(tidyverse)
library(magrittr)
type.convert(test, as.is = TRUE) %>%
select_if(is.numeric) %>%
is_greater_than(0) %>%
rowSums %>%
bind_cols(test, d = .)
# a b c d
#1 a 0 2 1
#2 y 5 0 1

Sum a group of columns by row count

I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
d1
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
d2
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.
Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
library(dplyr)
d %>%
group_by(group) %>%
summarise_all(funs(sum))
Returns:
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0
Overview
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
library(tidyverse)
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
bind_rows()
# view results -----
df2
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

R Sort one column ascending, all others descending (based on column order)

I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))

Matching values in a column of one data frame with subsets of a column in another data frame

I am trying to match the values in a column of one data frame with the values in a column of a second data frame. The tricky part is that I would like to do the matching using subsets of the second data frame (designated by a distinct column in the second data frame from the one that is being matched). This is different from the commonly posted problem of trying to subset based on matching between data frames.
My problem is the opposite - I want to match data frames based on subsets. To be specific, I would like to match subsets of the column in the second data frame with the entire column of the first data frame, and then create new columns in the first data frame that show whether or not a match has been made for each subset.
These subsets can have varying number of rows. Using the two dummy data frames below...
DF1 <- data.frame(number=1:10)
DF2 <- data.frame(category = rep(c("A","B","C"), c(5,7,3)),
number = sample(10, size=15, replace=T))
...the objective would be to create three new columns (DF1$A, DF1$B, and DF$C) that show whether the values in DF1$number match with the values in DF2$number for each of the respective subsets of DF2$category. Ideally the rows in these new columns would show a '1' if a match has been made and a '0' if a match has not. With the dummy data below I would end up with DF1 having 4 columns (DF1$number, DF1$A, DF1$B, and DF$C) of 10 rows each.
Note that in my actual second data frame I have a huge number of categories, so I don't want to have to type them out individually for whatever operation is needed to accomplish this objective. I hope that makes sense! Sorry if I'm missing something obvious and thanks very much for any help you might be able to provide.
This should work:
sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)
A B C
[1,] 0 0 1
[2,] 1 1 0
[3,] 1 1 1
[4,] 0 1 0
[5,] 0 0 1
[6,] 0 1 0
[7,] 1 1 0
[8,] 1 0 0
[9,] 1 0 0
[10,] 0 1 0
You can add this back to DF1 like:
data.frame(
DF1,
sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)
)
number A B C
1 1 0 0 1
2 2 1 1 0
3 3 1 1 1
4 4 0 1 0
5 5 0 0 1
6 6 0 1 0
7 7 1 1 0
8 8 1 0 0
9 9 1 0 0
10 10 0 1 0

Resources