I have a data frame with some error
T item V1 V2
1 a 2 .1
2 a 5 .8
1 b 1 .7
2 b 2 .2
I have another data frame with corrections for items concerning V1 only
T item V1
1 a 2
2 a 6
How do I get the final data frame? Should I use merge or rbind. Note: actual data frames are big.
An option would be a data.table join on the 'T', 'item' and assigning the 'V1' with the the corresponding 'V1' column (i.V1) from the second dataset
library(data.table)
setDT(df1)[df2, V1 := i.V1, on = .(T, item)]
df1
# T item V1 V2
#1: 1 a 2 0.1
#2: 2 a 6 0.8
#3: 1 b 1 0.7
#4: 2 b 2 0.2
data
df1 <- structure(list(T = c(1L, 2L, 1L, 2L), item = c("a", "a", "b",
"b"), V1 = c(2L, 5L, 1L, 2L), V2 = c(0.1, 0.8, 0.7, 0.2)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(T = 1:2, item = c("a", "a"), V1 = c(2L, 6L)),
class = "data.frame", row.names = c(NA,
-2L))
This should work -
library(dplyr)
df1 %>%
left_join(df2, by = c("T", "item")) %>%
mutate(
V1 = coalesce(as.numeric(V1.y), as.numeric(V1.x))
) %>%
select(-V1.x, -V1.y)
Related
Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
I want to ask how do I merge this two data frame?
df1:
Name Type Price
A 1 NA
B 2 2.5
C 3 2.0
df2:
Name Type Price
A 1 1.5
D 2 2.5
E 3 2.0
As you can see from both df, they have same column names and one row with the same value in "Name" which is A but df1 doesn't have the price whereas df2 has. I want to achieve this output such that they merge if the value in "Name" is the same
Name Type Price
A 1 1.5
B 2 2.5
C 3 2.0
D 2 2.5
E 3 2.0
We could do a full_join on df1 and df2 by Name and using coalesce on Type and Price get the first non-NA value from those columns.
library(dplyr)
full_join(df1, df2, by = 'Name') %>%
mutate(Type = coalesce(Type.x, Type.y),
Price = coalesce(Price.x, Price.y)) %>%
select(names(df1))
# Name Type Price
#1 A 1 1.5
#2 B 2 2.5
#3 C 3 2.0
#4 D 2 2.5
#5 E 3 2.0
And similar in base R :
transform(merge(df1, df2, by = 'Name', all = TRUE),
Price = ifelse(is.na(Price.x), Price.y, Price.x),
Type = ifelse(is.na(Type.x), Type.y, Type.x))[names(df1)]
data
df1 <- structure(list(Name = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Type = 1:3, Price = c(NA, 2.5, 2)),
class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(Name = structure(1:3, .Label = c("A", "D", "E"
), class = "factor"), Type = 1:3, Price = c(1.5, 2.5, 2)),
class = "data.frame", row.names = c(NA, -3L))
Seems like you want to rbind the data frames together, then remove rows with NA values for Price, and order by Name.
library(data.table)
setDT(rbind(df1, df2))[!is.na(Price)][order(Name)]
# Name Type Price
# 1: A 1 1.5
# 2: B 2 2.5
# 3: C 3 2.0
# 4: D 2 2.5
# 5: E 3 2.0
Here is a base R solution using merge + ocmplete.cases
dfout <- subset(u <- merge(df1,df2,all= TRUE),complete.cases(u))
which yields
> dfout
Name Type Price
1 A 1 1.5
3 B 2 2.5
4 C 3 2.0
5 D 2 2.5
6 E 3 2.0
DATA
df1 <- structure(list(Name = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Type = 1:3, Price = c(NA, 2.5, 2)),
class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(Name = structure(1:3, .Label = c("A", "D", "E"
), class = "factor"), Type = 1:3, Price = c(1.5, 2.5, 2)),
class = "data.frame", row.names = c(NA, -3L))
I would like to combine the following dataframe where repeated ids will be combined together by the mean values of repeated observation.
id V1 V2
AA 21.76410 1
BB 25.57568 0
BB 20.91222 0
CC 21.71828 1
CC 22.89878 1
FF 22.20535 0
structure(list(id = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("AA",
"BB", "CC", "FF"), class = "factor"), V1 = c(21.7640981693372,
25.575675904744, 20.9122208946358, 21.7182828011676, 22.8987775530191,
22.2053520672232), V2 = c(1, 0, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA,
-6L))
After data reduction by mean, it should like this-
id V1 V2
AA 21.76410 1
BB 23.24395 0 # mean reduction for BB in V1 and V2
CC 22.30853 1 # same as above
FF 22.20535 0
structure(list(id = structure(1:4, .Label = c("AA", "BB", "CC",
"FF"), class = "factor"), V1 = c(21.7641, 23.24395, 22.30853,
22.20535), V2 = c(1, 0, 1, 0)), class = "data.frame", row.names = c(NA,
-4L))
How can I do that in R?
Any package function or custom function code you want to share with me would be of great help.
Thanks.
With base R, it can be done with aggregate
aggregate(.~ id, df1, mean)
Using dplyr:
df %>%
group_by(id) %>%
mutate(V1 = ifelse(n() > 1, mean(V1), V1)) %>%
unique()
# A tibble: 4 x 3
# Groups: id [4]
# id V1 V2
#<fct> <dbl> <dbl>
#1 AA 21.8 1
#2 BB 23.2 0
#3 CC 22.3 1
#4 FF 22.2 0
Another way of using aggregate from base R
dfout <- aggregate(df[-1],df[1],FUN = mean)
such that
> dfout
id V1 V2
1 AA 21.76410 1
2 BB 23.24395 0
3 CC 22.30853 1
4 FF 22.20535 0
To start I will ignore the use of lists and show what I want using two df's.
I have df1
ID v1 Join_ID
1 100 1
2 110 2
3 150 3
And df2
Join_ID Type v2
1 a 80
1 b 90
2 a 70
2 b 60
3 a 50
3 b 40
I want the df.join to be:
ID v1 Join_ID a_v2 b_v2
1 100 1 80 90
2 110 2 70 60
3 150 3 50 40
I have tried:
df.merged <- merge(df1, df2, by="Join_ID")
df.wide <- dcast(melt(df.merged, id.vars=c("ID", "type")), ID~variable+type)
But this repeats all the variables in df1 for each type: v1_a v1_b
On top of this I have two lists
list.1
df1_a
df1_b
df1_c
list.2
df2_a
df2_b
df2_c
And I want the df1_a in list 1 to join with the df2_a in list 2
We can do this with maping through the list elements and then do the join
library(tidyverse)
map2(list.1, list.2, ~
.y %>%
mutate(Type = paste0(Type, "_v2")) %>%
spread(Type, v2) %>%
inner_join(.x, by = 'Join_ID'))
data
df1 <- structure(list(ID = 1:3, v1 = c(100L, 110L, 150L), Join_ID = 1:3),
.Names = c("ID",
"v1", "Join_ID"), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(Join_ID = c(1L, 1L, 2L, 2L, 3L, 3L), Type = c("a",
"b", "a", "b", "a", "b"), v2 = c(80L, 90L, 70L, 60L, 50L, 40L
)), .Names = c("Join_ID", "Type", "v2"), class = "data.frame", row.names = c(NA,
-6L))
list.1 <- list(df1_a = df1, df1_b = df1, df1_c = df1)
list.2 <- list(df2_a = df2, df2_b = df2, df2_c = df2)
Some replies to your request :
1. the reshaping of df2
2. the join with different column names
library(reshape2)
df1=data.frame(id=c(1,2,3), v1=c(100,110,150))
df2=data.frame(Join_ID=c(1,1,2,2,3,3),Type=c("a","b","a","b","a","b"),v2=c(80,90,70,60,50,40))
cast_df2=dcast(df2, Join_ID ~ Type)
mergedData <- full_join(df1,cast_df2, by=c("id"="Join_ID"),suffixes=c("_df1","_df2") )
This question already has answers here:
Create unique identifier from the interchangeable combination of two variables
(2 answers)
Closed 6 years ago.
I have a dataframe of 3 columns
A B 1
A B 1
A C 1
B A 1
I want to aggregate it such that it considers combinations A-B and B-A to be the same, resulting in
A B 3
A C 1
How do I go about this?
Use pmin and pmax on the first two columns and then do the group-by-count:
library(dplyr);
df %>% group_by(G1 = pmin(V1, V2), G2 = pmax(V1, V2)) %>% summarise(Count = sum(V3))
Source: local data frame [2 x 3]
Groups: G1 [?]
G1 G2 Count
(chr) (chr) (int)
1 A B 3
2 A C 1
Corresponding data.table solution would be:
library(data.table)
setDT(df)
df[, .(Count = sum(V3)), .(G1 = pmin(V1, V2), G2 = pmax(V1, V2))]
G1 G2 Count
1: A B 3
2: A C 1
Data:
structure(list(V1 = c("A", "A", "A", "B"), V2 = c("B", "B", "C",
"A"), V3 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2", "V3"), row.names = c(NA,
-4L), class = "data.frame")