Compare rows of a dataset with another dataset in R

Compare rows of a dataset with another dataset in R - r

I have dataset1 with 1400 row and 25 columns, and dataset2 with 400 rows and 5 columns.Both datasets have a column called ID. as a small example, I can illustrate them like below:
dataset1:
ID c1 c2 c3 c4
12 m n 5 1/2/2015
5 c x 4 2/3/2015
45 g t 47 4/23/2015
45 j t 3 1/1/2016
61 t y 12 7/3/2015
3 r n 18 3/3/2015
dataset2:
ID a1 a2
45 1 1/1/2015
3 5 2/2/2016
12 12 4/29/2016
(as you can see ID in dataset2 is a subset of ID in dataset1)
what I want is: for each row of dataset1, if the value in column ID is equal to a value in the column ID of dataset2, then copy the corresponding value of the column a2 of that row of dataseset2 into a new column of dataset1 as below:
ID c1 c2 c3 c4 c5
12 m n 5 1/2/2015 4/29/2016
5 c x 4 2/3/2015 NA
45 g t 47 4/23/2015 1/1/2015
45 j t 3 1/1/2016 1/1/2015
61 t y 12 7/3/2015 NA
3 r n 18 3/3/2015 2/2/2016
I appreciate your help.

As #42 mentioned, you can use match.
This is an example with match:
# match the ID of df1 with that of df2
# then returns the index of df2 that
# matches df1
# then subset the a2 column using the above index
# then store in a new column in df1
df1$c5 <- df2$a2[match(df1$ID, df2$ID)]
The output of the above code is below:
> df1
ID c1 c2 c3 c4 c5
1 12 m n 5 01/02/2015 4/29/2016
2 5 c x 4 01/02/2015 <NA>
3 45 g t 47 01/02/2015 01/01/2015
4 45 j t 3 01/02/2015 01/01/2015
5 61 t y 12 01/02/2015 <NA>
6 3 r n 18 01/02/2015 02/02/2016

din's answer is perfect. The other way to think about is to merge to data frames.
Data Preparation
ex_data1 <- data.frame(ID = c(12, 5, 45, 45, 61, 3),
c1 = c("m", "c", "g", "j", "t", "r"),
c2 = c("n", "x", "t", "t", "y", "n"),
c3 = c(5, 4, 47, 3, 12, 8),
c4 = c("1/2/2015", "2/3/2015", "4/23/2015",
"1/1/2016", "7/3/2015", "3/3/2015"),
stringsAsFactors = FALSE)
ex_data2 <- data.frame(ID = c(45, 3, 12),
a1 = c(1, 5, 12),
a2 = c("1/1/2015", "2/2/2016", "4/29/2016"), stringsAsFactors = FALSE)
Solution 1: Merge the data using base R
ex_data3 <- ex_data2[, c("ID", "a2")]
names(ex_data3) <- c("ID", "c5")
m_data <- merge(ex_data1, ex_data3, by = "ID", all = TRUE)
Solution 2: Merge the data using dplyr
library(dplyr)
m_data <- ex_data1 %>%
left_join(ex_data2, by = "ID") %>%
select(-a1, c5 = a2)

Related

Dplyr: How to match a value from multiple columns?

I have a dataset with N column and an additional one containing a number of column. I want to add another column which will return values taken from a column having a particular number (rowwise).
Col 1
…
Col 14
…
Col n
Number of column
Value
a1
…
a14
…
an
14
a14
b1
…
b14
…
bn
8
b8
c1
…
c14
…
cn
1
c1
Such operation can be done with a for loop, but how it can be done in dplyr? Thank you!

Base R option -
df$Value <- df[cbind(1:nrow(df), df$n)]
df
# col1 col2 col3 n Value
#1 1 6 11 1 1
#2 2 7 12 2 7
#3 3 8 13 3 13
#4 4 9 14 3 14
#5 5 10 15 2 10
In dplyr -
library(dplyr)
df %>% rowwise() %>% mutate(Value = c_across()[n])
data
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15, n = c(1, 2, 3, 3, 2))

In R is there a way to recode the columns from one data frame with values from another data frame?

I am still relatively new to working in R and I am not sure how to approach this problem. Any help or advice is greatly appreciated!!!
The problem I have is that I am working with two data frames and I need to recode the first data frame with values from the second. The first data frame (df1) contains the data from the respondents to a survey and the other data frame(df2) is the data dictionary for df1.
The data looks like this:
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b1","b2","b3","c1","c2","c3"))
So far I can manually recode df1 to get the expected output by doing this:
df1 <- within(df1,{
a[a==1] <- "a1"
a[a==2] <- "a2"
a[a==3] <- "a3"
b[b==4] <- "b4"
b[b==5] <- "b5"
b[b==6] <- "b6"
c[c==7] <- "c7"
c[c==8] <- "c8"
c[c==9] <- "c9"
})
However my real dataset has about 42 columns that need to be recoded and that method is a little time intensive. Is there another way in R for me to recode the values in df1 with the values in df2?
Thanks!

Just need to transform the shape a bit.
library(data.table)
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b4","b5","b6","c7","c8","c9"),stringsAsFactors = FALSE)
setDT(df1)
setDT(df2)
df1[,ID:=.I]
ldf1 <- melt(df1,measure.vars = c("a","b","c"),variable.name = "columnIndicator",value.name = "df1_value")
ldf1[df2,"new_value":=i.new_value,on=.(columnIndicator,df1_value)]
ldf1
#> ID columnIndicator df1_value new_value
#> 1: 1 a 1 a1
#> 2: 2 a 2 a2
#> 3: 3 a 3 a3
#> 4: 1 b 4 b4
#> 5: 2 b 5 b5
#> 6: 3 b 6 b6
#> 7: 1 c 7 c7
#> 8: 2 c 8 c8
#> 9: 3 c 9 c9
dcast(ldf1,ID~columnIndicator,value.var = "new_value")
#> ID a b c
#> 1: 1 a1 b4 c7
#> 2: 2 a2 b5 c8
#> 3: 3 a3 b6 c9
Created on 2020-04-18 by the reprex package (v0.3.0)

In base R, we can unlist df1 match it with df1_value and get corresponding new_value.
df1[] <- df2$new_value[match(unlist(df1), df2$df1_value)]
df1
# a b c
#1 a1 b1 c1
#2 a2 b2 c2
#3 a3 b3 c3

Is this what you are looking for???
library(dplyr)
df3 <- df1 %>% gather(key = "key", value = "value")
df3 %>% inner_join(df2, by = c("key" = "columnIndicator", "value" = "df1_value"))
Output
key value new_value
1 a 1 a1
2 a 2 a2
3 a 3 a3
4 b 4 b1
5 b 5 b2
6 b 6 b3
7 c 7 c1
8 c 8 c2
9 c 9 c3

how to subtract a column to the other colums in a data frame

I have a data frame that consist of 1000 rows and 156 columns. I'm trying to subtract the first column to the next 38 columns, then subtract column 39 to the next 38, and so, but I can't find a way to do it. I'm only using ncdf4 and nothing else. Something like this
C1 C2 C3 C4 C5 C6 C7 C8
1 2 3 4 5 6 4 5
3 4 6 5 4 3 2 7
And I'd like it to be
C1 C2 C3 C4 C5 C6 C7 C8
0 1 2 3 4 5 3 4
0 1 3 2 1 0 -1 4
The logic would be
First 38 columns - First column
Columns 39:77 - Column 39
and so on.

Solved it by simply doing
{
z[,1:38] <- z[,1:38]-z[,1]
z[,39:77] <-z[,39:77]-z[,39]
z[,78:118] <-z[,78:118]-z[,78]
z[,119:156] <-z[,119:156]-z[,119]
}
Where z is the dataframe. Might not be the nicest way but it did the trick

You can also do the following without any loop:
# sample data frame
df <- data.frame(matrix(data = seq(1,316),ncol = 158))
# split the data frame into list of data frame having columns
# 1 to 38, 39 to 77 and so on
df <- split.default(df, gl(round(ncol(df)/38),k = 38))
# subtract the last column from each
df <- do.call(cbind, lapply(df, function(f) f - f[,ncol(f)]))
colnames(df) <- paste0('C', seq(1,158))
print(head(df))
C1 C2 C3 C4 C5
1 -74 -72 -70 -68 -66
2 -74 -72 -70 -68 -66

Here is a user defined function: You can add else if statements as desired.
mydiff<-function(df){
mydiff<-df
for(i in 1:ncol(df)){
if(i<=38){
mydiff[,i]<-df[,i]-df[,1]
}
else if(i%in%c(39:77)){
mydiff[,i]<-df[,i]-df[,39]
}
}
mydiff
}
mydiff(df1)
Output:
C1 C2 C3 C4 C5 C6 C7 C8
0 1 2 3 4 5 3 4
0 1 3 2 1 0 -1 4
Benchmark:
system.time(result<-as.tibble(iris2) %>%
select_if(is.numeric) %>%
mydiff())
Result:
user system elapsed
0.02 0.00 0.01

You should consider using tidyverse to solve this, loading a package into R does little to the overhead of your environment and can make your life much easier.
library(tidyverse)
> df %>%
mutate_at(.vars = vars(num_range(prefix = 'C', 1:38)), .funs = function(x) x - .$C1) %>%
mutate_at(.vars = vars(num_range(prefix = 'C', 39:77)), .funs = function(x) x - .$C39)
C1 C2 C3 C4 C38 C39 C40 C41 C42 C77
1 0 1 2 3 4 0 1 2 3 4
2 0 0 3 2 4 0 0 3 2 4
Data
df <-
data.frame(
C1 = c(1, 3),
C2 = c(2, 3),
C3 = c(3, 6),
C4 = c(4, 5),
C38 = c(5, 7),
C39 = c(1, 3),
C40 = c(2, 3),
C41 = c(3, 6),
C42 = c(4, 5),
C77 = c(5, 7)
)

How to merge two dataframes with replacement/creation of rows depending on existence in first df?

I have two dataframes df1 and df2, I am looking for the simplest operation to get df3.
I want to replace rows in df1 with rows from df2 if id match (so rbind.fill is not a solution), and append rows from df2 where id does not exist in df1but only for columns that exist in df2.
I guess I could use several joins and antijoins and then merge but I wonder if there already exists a function for that operation.
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
df1
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 14 19 24
5 15 20 25
df2
id c1 c2
4 1 5
5 2 6
6 3 7
7 4 8
df3
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 1 5 24
5 2 6 25
6 3 7 NULL
7 4 8 NULL

We can use {powerjoin}, make a full join and deal with the conflicts using coalesce_xy (which is really dplyr::coalesce) :
library(powerjoin)
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
safe_full_join(df1, df2, by= "id", conflict = coalesce_xy)
# id c1 c2 c3
# 1 1 11 16 21
# 2 2 12 17 22
# 3 3 13 18 23
# 4 4 14 19 24
# 5 5 15 20 25
# 6 6 3 7 NA
# 7 7 4 8 NA

I ended up with :
special_combine <- function(df1, df2){
df1_int <- df1[, colnames(df1) %in% colnames(df2)]
df1_ext <- df1[, c("id", colnames(df1)[!colnames(df1) %in% colnames(df2)])]
df3 <- bind_rows(df1_int, df2)
df3 <- df3[!duplicated(df3$id, fromLast=TRUE), ] %>%
dplyr::left_join(df1_ext, by="id") %>%
dplyr::arrange(id)
df3
}

Run call columns data of another dataframe, row by row

This is my First dataframe,
df1 <- as.data.frame(matrix(rbinom(9*9, 1, 0.5), ncol=9, nrow =9))
colnames(df1) <- paste(rep(c("a","b","c"), each=3), rep(c(1,2,3), 3), sep = "")
set.seed(11)
This is my Second dataframe,
factor.1 <- paste(rep(c("a","b"), each=3), rep(c(1,2,3), 2), sep = "")
factor.2 <- rep(paste(rep("c", 3), c(1,2,3), sep = ""), 2)
df2 <- as.data.frame(cbind(factor.1,factor.2))
I want to calculate the result in each column and put it inside the second dataframe. I use dplyr
fun1 <- function(x){sum(ds1[, x])}
df2%>% mutate(value = fun1(factor.1))
But what I get is this,
factor.1 factor.2 value
1 a1 c1 22
2 a2 c2 22
3 a3 c3 22
4 b1 c1 22
5 b2 c2 22
6 b3 c3 22
But What I want is this,
factor.1 factor.2 value
1 a1 c1 4
2 a2 c2 4
3 a3 c3 4
4 b1 c1 1
5 b2 c2 4
6 b3 c3 5

Is this what you are looking for ?
df2 %>% mutate(value = sapply(factor.1, fun1) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compare rows of a dataset with another dataset in R - r

Related

Dplyr: How to match a value from multiple columns?

In R is there a way to recode the columns from one data frame with values from another data frame?

how to subtract a column to the other colums in a data frame

How to merge two dataframes with replacement/creation of rows depending on existence in first df?

Run call columns data of another dataframe, row by row

Categories

Resources