R - Merge two data frames with one differing column - r

Suppose I have two data frames
df1 = data.frame(id = c(1,1,1), stat = c("B", "A", "C"), value = c(10,11,12))
df2 = data.frame(id = c(2,2,2), stat = c("B", "A", "C"), value = c(20,21, 22))
Basically the first column identifies the data frame, the second column is some statistic I want to keep track of and the last column is the value of that statistics. Can I easily merge the data frames so that I get
stat id value
B 1 10
B 2 20
A 1 11
A 2 21
C 1 12
C 2 22
I'd like to preserve the order of the stat column even though it's not alphabetical

You could do
(r <- rbind(df1, df2))[c(2,1,3)][order(r$stat, decreasing = TRUE),]
# stat id value
# 1 B 1 10
# 3 B 2 20
# 2 A 1 11
# 4 A 2 21
In response to the edited question, you could use
f <- function(i) rbind(df1[i,], df2[i,])
do.call(rbind, lapply(1:nrow(df1), f))[c(2,1,3)]
# stat id value
# 1 B 1 10
# 2 B 2 20
# 22 A 1 11
# 21 A 2 21
# 3 C 1 12
# 31 C 2 22

Related

Find time-lag between groups in a data.frame

Let's suppose I want to estimate the time lag between two groups within a data.frame.
Here an example of my data:
df_1 = data.frame(time = c(1,3,5,6,8,11,15,16,18,20), group = 'a') # create group 'a' data
df_2 = data.frame(time = c(2,7,10,13,19,25), group = 'b') # create group 'b' data
df = rbind(df_1, df_2) # merge groups
df = df[with(df, order(time)), ] # order by time
rownames(df) = NULL #remove row names
> df
time group
1 1 a
2 2 b
3 3 a
4 5 a
5 6 a
6 7 b
7 8 a
8 10 b
9 11 a
10 13 b
11 15 a
12 16 a
13 18 a
14 19 b
15 20 a
16 25 b
Now I need to subtract the time observation from group b to the time observation from group a.
i.e. 2-1, 7-6, 10-8, 13-11, 19-18 and 25-20.
# Expected output
> out
[1] 1 1 2 2 1 5
How can I achieve this?
We can find indices of b and subtract the time value from it's previous index.
inds <- which(df$group == "b")
df$time[inds] - df$time[inds - 1]
#[1] 1 1 2 2 1 5
Here's a tidyverse solution. First add a column by basic logic of the appearance of group b with transmute and a subtraction of the preceding column. Then filter to just the results, and convert to vector with deframe
library(tidyverse)
df %>%
transmute(result = if_else(group == "b", time - lag(time), 0)) %>%
filter(result != 0) %>%
deframe()
result:
[1] 1 1 2 2 1 5

R: Conditionally replacing values based on column pre-fixes and suffixes

I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6

Grouping data by name R

id value
1 expsubs 29
2 expsubs 32
3 expsubs 27
4 expsubs 36
5 expsubs 29
6 expsubs 24
New to R
I have data that I've sorted in excel and tried to import into R
I want to sort or my data by the names that are in my "id" so that I can run an ANOVA on my data. Can't figure out how to get R to recognize my id column as the names for each value. Thanks!
In this situation you need to use package dplyr:
tab <- data.frame(x = c("A", "B", "C", "C"), y = 1:4)
by_x <- group_by(tab, x)
by_x
This code will sort your data by x column.
Use order:
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1

How do you combine two columns into a new column in a dataframe made of two or more different csv files?

I have several csv files all named with dates and for all of them I want to create a new column in each file that contains data from two other columns placed together. Then, I want to combine them into one big dataframe and choose only two of those columns to keep. Here's an example:
Say I have two dataframes:
a b c a b c
x 1 2 3 x 3 2 1
y 2 3 1 y 2 1 3
Then I want to create a new column d in each of them:
a b c d a b c d
x 1 2 3 13 x 3 2 1 31
y 2 3 1 21 y 2 1 3 23
Then I want to combine them like this:
a b c d
x 1 2 3 13
y 2 3 1 21
x 3 2 1 31
y 2 1 3 23
Then keep two of the columns a and d and delete the other two columns b and c:
a d
x 1 13
y 2 21
x 3 31
y 2 23
Here is my current code (It doesn't work when I try to combine two of the columns or when I try to only keep two of the columns):
f <- list.files(pattern="201\\d{5}\\.csv") # reading in all the files
mydata <- sapply(f, read.csv, simplify=FALSE) # assigning them to a dataframe
do.call(rbind,mydata) # combining all of those dataframes into one
mydata$Data <- paste(mydata$LAST_UPDATE_DT,mydata$px_last) # combining two of the columns into a new column named "Data"
c('X','Data') %in% names(mydata) # keeping two of the columns while deleting the rest
The object mydata is a list of data frames. You can change the data frames in the list with lapply:
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
file1 <- "a b
x 2 3"
file2 <- "a b
x 3 1"
mydata <- lapply(c(file1, file2), function(x) read.table(text = x, header =TRUE))
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
# [[1]]
# a b c
# x 2 3 23
#
# [[2]]
# a b c
# x 3 1 31
You can use rbind (data1,data2)[,c(1,3)] for that. I assume that you can create col d in each dataframe which is a basic thing.
data1<-structure(list(a = 1:2, b = 2:3, c = c(3L, 1L), d = c(13L, 21L
)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"), class = "data.frame")
> data1
a b c d
x 1 2 3 13
y 2 3 1 21
data2<-structure(list(a = c(3L, 2L), b = c(2L, 1L), c = c(1L, 3L), d = c(31L,
23L)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"
), class = "data.frame")
> data2
a b c d
x 3 2 1 31
y 2 1 3 23
data3<-rbind(data1,data2)
> data3
a b c d
x 1 2 3 13
y 2 3 1 21
x1 3 2 1 31
y1 2 1 3 23
finaldata<-data3[,c("a","d")]
> finaldata
a d
x 1 13
y 2 21
x1 3 31
y1 2 23

Merge data frames and overwrite values

How do I merge 2 similar data frames but have one with greater importance?
For example:
Dataframe 1
Date Col1 Col2
jan 2 1
feb 4 2
march 6 3
april 8 NA
Dataframe 2
Date Col2 Col3
jan 9 10
feb 8 20
march 7 30
april 6 40
merge these by Date with dataframe 1 taking precedence but dataframe 2 filling blanks
DataframeMerge
Date Col1 Col2 Col3
jan 2 1 10
feb 4 2 20
march 6 3 30
april 8 6 40
EDIT - SOLUTION
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != "key"]
dfmerge<- merge(df1,df2,by="key",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
merdat <- merge(dfrm1,dfrm2, by="Date") # seems self-documenting
# explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]
Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.
In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep() that has the same length as sum(logical.vector). The goal is both instances is to have the same length (and order) for assignment as the items being replaced.
Update using v1.9.6 of data.table's on= argument (which allows for adhoc joins:
setDT(df1)[df2, `:=`(Col2 = ifelse(is.na(Col2), i.Col2, Col2),
Col3 = i.Col3), on="Date"][]
Here's a data.table solution. Make sure your df1 and df2's Date column is factor with desired levels (for ordering)
require(data.table)
dt1 <- data.table(df1, key="Date")
dt2 <- data.table(df2, key="Date")
# Col2 refers to the Col2 of dt1 and i.col2 refers to that of dt2
dt1[dt2, `:=`(Col3 = Col3, Col1 = Col1,
Col2 = ifelse(is.na(Col2), i.Col2, Col2))]
# the result is stored in dt1
> dt1
# Date Col1 Col2 Col3
# 1: jan 2 1 10
# 2: feb 4 2 20
# 3: march 6 3 30
# 4: april 8 6 40
Here is a dplyr solution. Credit to #docendo discimus
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
y x1
1 A 1
2 B 2
3 C NA
4 D 4
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
y x1
1 A 5
2 B 6
3 C 7
dplyr
left_join(df1, df2, by="y") %>%
transmute(y, x1 = ifelse(is.na(x1.y), x1.x, x1.y))
y x1
1 A 5
2 B 6
3 C 7
Consider this example:
> d1 <- data.frame(x=1:4, a=2:5, b=c(3,4,5,NA))
> d1
x a b
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 NA
> d2 <- data.frame(x=1:4, b=c(6,7,8,9), c=11:14)
> d2
x b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
Now use merge and within, with ifelse:
> within(merge(d1, d2, by="x"), {b <- ifelse(is.na(b.x),b.y,b.x); b.x <- NULL; b.y <- NULL})
x a c b
1 1 2 11 3
2 2 3 12 4
3 3 4 13 5
4 4 5 14 9

Resources