Creating a New Column that Compared Two Other Columns in R - r

My data looks like:
x y
2 5
3 6
4 2
6 1
7 10
12 16
145 1
Looking to output the number that's smaller than the other into a new column which would look like:
x y z
2 5 2
3 6 3
4 2 2
6 1 1
7 10 7
12 16 12
145 1 1
None of the data will be equal so you don't need to worry about that.
x <- c(2,3,4,6,7,12,145)
y <- c(5,6,2,1,10,16,1)
df <- data.frame(x,y)

Using case_when from tidyverse
remove(list = ls())
x <- c(2,3,4,6,7,12,145)
y <- c(5,6,2,1,10,16,1)
df <- data.frame(x,y)
df <- df %>%
mutate(z =
case_when(
x < y ~ x,
TRUE ~ y
)
)
df

You can use pmin to get minimum between x and y columns.
df$z <- pmin(df$x, df$y)
df
# x y z
#1 2 5 2
#2 3 6 3
#3 4 2 2
#4 6 1 1
#5 7 10 7
#6 12 16 12
#7 145 1 1

Related

Filter based on matching condition in R [duplicate]

This question already has an answer here:
Find rows in a data frame where two columns are equal
(1 answer)
Closed 2 years ago.
I'm trying to execute a command to only keep rows where the 'ID' is the same in column Y as it is in column X. In other words, keep the row if the 'ID' in column Y matches the ID in column X.
edit: here's the code that is close but not quite there. What I need is to add a condition to the Y column. So it should keep rows where the ID in column X equals the ID in column Y when column Y = '34'.
data %>%
filter(ID %in% X == ID %in% Y)
You can use join or just do something like this:
df <- data.frame(x = 1:13, y = c(1:5,7:14))
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
11 11 12
12 12 13
13 13 14
rows_to_select <- which(df$x==df$y,TRUE)
df[rows_to_select,]
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
You can use the 'which' function in base R. Example:
set.seed(7) # create toy dataframe
x1 <- sample(1:2, 10, replace = TRUE)
x2 <- sample(1:2, 10, replace = TRUE)
df <- data.frame(x1, x2)
df
x1 x2
1 2 2
2 2 1
3 2 1
4 1 1
5 2 1
6 2 2
7 1 2
8 1 1
9 1 1
10 1 2
keep <- which(df$x1 == df$x2) # only this line
keep
1 4 6 8 9
df2 <- df[keep , ] # and this line required for the reduced dataframe
df2
x1 x2
1 2 2
4 1 1
6 2 2
8 1 1
9 1 1

Add column to each data frame within list with function rowSums and range of columns

SO. The following might serve as a small example of the real list.
a <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
b <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
c <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
l <- list(a,b,c)
From the second column to last column - on every data frame - i want to add the sums as a new column to each data frame.
I tried:
lapply(l, function(x) rowSums(x[2:ncol(x)]))
which returns the correct sums, but doesn't add them to the data frames.
I also tried:
lapply(l, transform, sum = y + z)
which gives me the correct results but is not flexible enough, because i don't always know how many columns there are for each data frame and what names they have. The only thing i know, is, that i have to start from second column to end. I tried to combine these two approaches but i can't figure out, how to do it exactly.
Thanks
Try this. You can play around index in columns and exclude the first variable so that there is not issues about how many additional variables you have in order to obtain the rowsums. Here the code:
#Compute rowsums
l1 <- lapply(l,function(x) {x$RowSum<-rowSums(x[,-1],na.rm=T);return(x)})
Output:
l1
[[1]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
Here's how to combine your attempts. I used data[-1] instead of data[2:ncol(data)] because it seems simpler, but either should work.
lapply(l, function(data) transform(data, sum = rowSums(data[-1])))
Unfortunately, transform will be confused if the name of the argument to your anonymous function is the same as a column name - data[-1] needs to look at the data frame, not a particular column. (I originally use function(x) instead of function(data), and this caused an error because there is a column named x. From this perspective, Duck's answer is a little safer.)
Does this work:
> add_col <- function(df){
+ df[(ncol(df)+1)] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
>
With sum as column name:
> add_col <- function(df){
+ df['sum'] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
use tidyverse
library(tidyverse)
map(l, ~.x %>% mutate(Sum := apply(.x[-1], 1, sum)))
#> [[1]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[2]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[3]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
Created on 2020-09-30 by the reprex package (v0.3.0)
We can use map with mutate
library(purrr)
library(dplyr)
map(l, ~ .x %>%
mutate(sum = rowSums(select(., -1))))
Or with c_across
map(l, ~ .x %>%
rowwise() %>%
mutate(sum = sum(c_across(-1), na.rm = TRUE)) %>%
ungroup)

Combination of vectors' values with the same positional indices in R

Assuming we have the following simplified vectors (in reality, they contain much more values):
n <- c(1,2)
x <- c(4,5,6)
y <- c(7,8,9)
#to get all possible combinations, we can use expand.grid
df <- expand.grid(n=n,
                  x=x,
                  y=y
)
> df
n x y
1 4 7
2 4 7
1 5 7
2 5 7
1 6 7
2 6 7
1 4 8
2 4 8
1 5 8
2 5 8
1 6 8
2 6 8
1 4 9
2 4 9
1 5 9
2 5 9
1 6 9
2 6 9
However, I would like vectors x, y to have the combination where only elements with the same index values are considered, i.e. (x1, y1), (x2, y2), (x3, y3) but NOT (x1,y2), (x1,y3), etc.
while vector n is still used as usual (all its elements are 'paired' with the outcome of x and y combination).
In other words, I would like to get the following df:  
n x y
1 4 7
2 4 7
1 5 8
2 5 8
1 6 9
2 6 9
if n vector had 3 elements, i.e. n <- (1, 2, 3), then we would have:
n x y
1 4 7
2 4 7
3 4 7
1 5 8
2 5 8
3 5 8
1 6 9
2 6 9
3 6 9
You could combine list of pairs that need to be together and then use it in expand.grid
expand.grid(n, Map(c, x, y)) %>% tidyr::unnest_wider(Var2)
Or we can also use crossing using the same logic.
library(tidyverse)
crossing(n, x = map2(x, y, c)) %>%
unnest_wider(x) %>%
rename_at(-1, ~c("x", "y"))
# n x y
# <dbl> <dbl> <dbl>
#1 1 4 7
#2 1 5 8
#3 1 6 9
#4 2 4 7
#5 2 5 8
#6 2 6 9
We can create a function to do this
f1 <- function(vec1, vec2, n) {
d1 <- data.frame(x = vec1, y = vec2)
d2 <- transform(d1[rep(seq_len(nrow(d1)), each = length(n)), ], n = n)
row.names(d2) <- NULL
d2[c('n', 'x', 'y')]
}
f1(x, y, n = 1:2)
# n x y
#1 1 4 7
#2 2 4 7
#3 1 5 8
#4 2 5 8
#5 1 6 9
#6 2 6 9
f1(x, y, n = 1:3)
# n x y
#1 1 4 7
#2 2 4 7
#3 3 4 7
#4 1 5 8
#5 2 5 8
#6 3 5 8
#7 1 6 9
#8 2 6 9
#9 3 6 9
Or in tidyverse
library(dplyr)
library(tidyr)
tibble(x, y) %>%
uncount(length(n)) %>%
mutate(n = rep(n, length.out = n())) %>%
select(n, x, y)
# A tibble: 9 x 3
# n x y
# <int> <dbl> <dbl>
#1 1 4 7
#2 2 4 7
#3 3 4 7
#4 1 5 8
#5 2 5 8
#6 3 5 8
#7 1 6 9
#8 2 6 9
#9 3 6 9
Or create a tibble first and then use that with crossing
tibble(x, y) %>%
crossing(n)
data
n <- 1:3
Here's a tidyverse solution, using purrr::map_df:
library(tidyverse)
map_df(n, ~tibble(n=.x, x, y))
n x y
<dbl> <dbl> <dbl>
1 1 4 7
2 1 5 8
3 1 6 9
4 2 4 7
5 2 5 8
6 2 6 9
If you need the values sorted exactly like your example output, add %>% arrange(x, y) to the output of map.
One option would be to paste together x and y, then use expand grid and separate the columns using the separate function from the tidyr package.
library(dplyr) #for pipe
library(tidyr) #for separate
n <- c(1,2)
x <- c(4,5,6)
y <- c(7,8,9)
z <- paste(x, y, sep = "-")
expand.grid(n = n, xy = z) %>%
separate(xy, sep = "-", into = c("x", "y")) %>%
mutate(x = as.numeric(x), y = as.numeric(y)) %>%
as.tibble()

Selecting top N rows for each group based on value in column

I have dataframe like below :-
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df
x y z
1 3 a 2
2 2 a 2
3 1 a 2
4 8 b 1
5 7 b 1
6 11 c 3
7 10 c 3
8 9 c 3
9 7 c 3
10 5 c 3
11 4 c 3
I want to select top n row for each group by column y where n is provided in column z.
So the output should be like :
output:
x y z
1 3 a 2
2 2 a 2
3 8 b 1
4 11 c 3
5 10 c 3
6 9 c 3
A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
EDIT:
A much more direct way (still in base R) provided in comment by #mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3
One approach with data.table:
library(data.table)
setDT(df)
df[,.(inc=seq_len(.N)<=z,x,z),by=.(y)][inc==T ,-2]
# y x z
#1: a 3 2
#2: a 2 2
#3: b 8 1
#4: c 11 3
#5: c 10 3
#6: c 9 3
A solution with dplyr that uses do:
df %>%
group_by(y) %>%
do(head(.,as.numeric(unique(.$z))))
I'm posting the solution I was looking for using dplyr. It is based on #HNSKD:
library(dplyr)
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df %>% group_by(y) %>% slice(1:2)
Which returns the first two elements for each y:
# A tibble: 6 x 3
# Groups: y [3]
x y z
<dbl> <fct> <dbl>
1 3 a 2
2 2 a 2
3 8 b 1
4 7 b 1
5 11 c 3
6 10 c 3

repeatedly applying ave for computing group means in a data frame

The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.

Resources