match multiple columns and create/update selected multiple columns [duplicate] - r

This question already has answers here:
Substitute DT1.x with DT2.y when DT1.x and DT2.x match in R [duplicate]
(1 answer)
merge data.frames based on year and fill in missing values
(4 answers)
Closed 5 years ago.
I would like to update the dataframe d_sub with two new columns x,y(and excluding column xy) based on the matching of the common columns(treatment,replicate) in the parent dataframe d.
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, xy=x*y, treatment=treatment, replicate=replicate)
d_sub <- d[sample(nrow(d),6),4:5]
d_sub
# treatment replicate
# 32 b 2
# 11 h 7
# 9 h 3
# 20 e 3
# 10 b 5
# 7 d 3
Unlike the normal merge or other methods mentioned here, I would only need to extract few columns as shown in the below expected output:
# treatment replicate x y
# 32 b 2 2 8.998847
# 11 h 7 1 5.082928
# 9 h 3 2 7.050445
# 20 e 3 10 10.145350
# 10 b 5 10 7.941056
# 7 d 3 7 6.814287
Note the exclusion of xy column in the output here! In my original problem, there are thousands of columns which I would not require in the output than the required very few columns. I am especially looking for methods other than merge to know if I can achieve the solution in a memory-efficient way.

I guess it has been asked here before, but what you are looking for is:
merge(d_sub, d, by=c("treatment", "replicate"))
or:
d_sub <- merge(d_sub, d, by=c("treatment", "replicate"))

Related

How to Use Select() function in a Data Frame by passing the Column name in a variable in R [duplicate]

This question already has answers here:
R dplyr subset with missing columns
(3 answers)
Closed 1 year ago.
Purpose
Can I select columns using dplyr conditional that the column name is in an external vector. I have found some posts that explain how to subset the data frame using a vector of name, but I could not find one when some of the names in the vector do not exist in the data frame.
Example dataset
library(tidyverse)
library(tibble)
library(data.table)
col_names <- c('a', 'b', 'e')
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
a <- sample(1:20, 1000, replace=T)
set.seed(10003)
b <- sample(letters, 1000, replace=T)
set.seed(10004)
c <- sample(letters, 1000, replace=T)
data <-
data.frame(a, b, c)
# I would like to choose a, b that are in col_names vector.
We could use any_of with select
library(dplyr)
data %>%
select(any_of(col_names))
-output
a b
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
...
Here is one way to solve your problem:
data[names(data) %in% col_names]
# a b
# 1 1 e
# 2 4 e
# 3 13 f
# 4 8 m
# 5 10 z
# 6 3 y
# ...
We may also use matches:
library(dplyr)
data %>%
select(matches(col_names)))
Output:
a b
<int> <chr>
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
7 19 g
8 7 f
9 12 f
10 15 k
# … with 990 more rows

dplyr r : selecting columns whose names are in an external vector [duplicate]

This question already has answers here:
R dplyr subset with missing columns
(3 answers)
Closed 1 year ago.
Purpose
Can I select columns using dplyr conditional that the column name is in an external vector. I have found some posts that explain how to subset the data frame using a vector of name, but I could not find one when some of the names in the vector do not exist in the data frame.
Example dataset
library(tidyverse)
library(tibble)
library(data.table)
col_names <- c('a', 'b', 'e')
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
a <- sample(1:20, 1000, replace=T)
set.seed(10003)
b <- sample(letters, 1000, replace=T)
set.seed(10004)
c <- sample(letters, 1000, replace=T)
data <-
data.frame(a, b, c)
# I would like to choose a, b that are in col_names vector.
We could use any_of with select
library(dplyr)
data %>%
select(any_of(col_names))
-output
a b
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
...
Here is one way to solve your problem:
data[names(data) %in% col_names]
# a b
# 1 1 e
# 2 4 e
# 3 13 f
# 4 8 m
# 5 10 z
# 6 3 y
# ...
We may also use matches:
library(dplyr)
data %>%
select(matches(col_names)))
Output:
a b
<int> <chr>
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
7 19 g
8 7 f
9 12 f
10 15 k
# … with 990 more rows

Using a loop to create multiple dataframes in R based on columns criteria [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 4 years ago.
Suppose I have a dataframe with 3 columns. I would like to create separate sub-dataframes for each of the unique combinations of a few columns.
For example, suppose we have just 3 columns,
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
I would like to get a separate dataframe for each of the unique combinations of Column 'a' and 'b'
I started with using unique to get a list of the unique combinations as the following,
factors <- unique(df[,c('a','b')])
a b
1 1 a
2 5 a
3 2 f
4 3 d
5 4 f
6 5 c
7 3 a
8 2 r
10 3 c
But I am not sure what to do next.
The code below are for illustration purposes. Ideally this will be done through a loop where it uses each of the rows in factors to create the dataframes.
df_1_a <- df %>% filter(a==1, b=='a')
a b c
1 1 a 0.2
2 1 a 0.9
df_3_a <- %>% filter(a==3, b=='a')
a b c
1 3 a 0.112
.
.
.
This is kinda dirty and I'm not sure that answer your question but try this :
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
d <- paste0(a,b)
df <- data.frame(a,b,c,d)
df_splited <- split(df,df$d)
You obtain a list composed of dataframes with unique combinaison of a,b
You can use split after you get the unique combinations you are after.
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c,stringsAsFactors = FALSE)
fx <- unique(df[,c('a','b')])
fx_list <- split(fx,rownames(fx))

Vector to dataframe with variable row length [duplicate]

This question already has answers here:
Convert Rows into Columns by matching string in R
(3 answers)
Closed 4 years ago.
Given a vector, I want to convert it to a dataframe using a 'key' value which is randomly distributed throughout the vector at the start of what is to be a row. In this case, "z" would be the first value in each column.
vd <- c("z","a","b","c","z","a","b","c","z","a","b","c","d")
The resultant data should look like:
#using magrittr
data.frame(x1 = c("z","a","b","c", NA), x2 = c("z","a","b","c", NA), x3 = c("z","a","b","c","d"))
%>% transpose()
One solution would be to find the largest distance between 'keys' in the vector and then interject blank values at the end of 'sections' that are smaller than the longest 'section' so you could use matrix()
What would be the best way to do this?
plyr::ldply(split(vd, cumsum(vd == "z")), rbind)[-1]
(copied from here)
result:
1 2 3 4 5
1 z a b c <NA>
2 z a b c <NA>
3 z a b c d
We can use cumsum to identify groups then split them. Then we append the vectors and format them as a data.frame.
x <- split(vd,cumsum("z"==vd))
maxl <- max(lengths(x))
as.data.frame(lapply(x,function(y) c(y,rep(NA,maxl-length(y)))))
# X1 X2 X3
# 1 z z z
# 2 a a a
# 3 b b b
# 4 c c c
# 5 <NA> <NA> d

R loop to find minimal and maximal value in dataframe [duplicate]

This question already has answers here:
How to get summary statistics by group
(14 answers)
Closed 5 years ago.
I'm working in R.
I have a dataframe df with three columns. The structure looks like this:
df <- data.frame(c(11:15,4:7,21:24), c(rep("A",9),rep("B",4)), c(rep("X",5),rep("Y",4),rep("X",4)))
colnames(df) <- c("pos","name","name2")
Example:
pos name name2
11 A X
12 A X
13 A X
14 A X
15 A X
4 A Y
5 A Y
6 A Y
7 A Y
21 B X
22 B X
23 B X
24 B X
From this dataframe, I want to create a new one (df_new) that looks like this
name name2 pos_min pos_max
A X 11 15
A Y 4 7
B X 21 24
So for every unique combination of name & name2 (in this case: A-X, A-Y and B-X), I want to put the minimal and maximal value of df$pos in two new columns.
Can anybody help me to achieve this?
This can be solved using the dplyr package:
df_new <- df %>%
group_by(name, name2) %>%
summarise(pos_min = min(pos),
pos_max = max(pos))

Resources