Merging Multiple (and different datasets) - r

I'd like to merge multiple (around ten) datasets in R. Quite a few of the datasets are different from each other, so I don't need to match them by row name or anything. I'd just like to paste them side by side, on a single dataframe so I can export them into a single sheet. For instance, I have the following two datasets:
Month
Engagement
Test
Jan
51
1
Feb
123
2
Variable
Engagement
Hot
412
Cold
4124
Warm
4fd4
I'd simply like to put them side by side (as in left and right) in a single data frame for exporting purposes, like this:
Month
Engagement
Test
Variable
Engagement
Jan
51
1
Hot
412
Feb
123
2
Cold
4124
NA
NA
NA
Warm
4fd4
Is there any way to accomplish this? It might seem like a strange request, but do let me know if I should provide any more info! Thank you so much.

Put the data in a list. Find the max number of rows from the list. For each dataframe subset the rows, dataframe with lower number of rows will be appended with NA's.
data <- list(df1, df2)
n <- seq_len(max(sapply(data, nrow)))
result <- do.call(cbind, lapply(data, `[`, n, ))
result
# Month Engagement Test Variable Engagement
#1 Jan 51 1 Hot 412
#2 Feb 123 2 Cold 4124
#NA <NA> NA NA Warm 4fd4

Index both data then merge by the index and drop the index:
df1 <- read.csv("Book1.csv", header = TRUE, na.strings = "")
df2 <- read.csv("Book2.csv", header = TRUE, na.strings = "")
# Assign index to the dataframe
rownames(df1) <- 1:nrow(df1)
rownames(df2) <- 1:nrow(df2)
# Merge by index:
merged <- merge(df1, df2, by=0, all=TRUE) %>%
select(-1)
merged
Output:
Month Engagement Test Variable Engagement
1 Jan 51 1 Hot 412
2 Feb 123 2 Cold 4124
3 <NA> NA NA Warm 4fd4

Related

Selecting the first non 0 value in a row [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a large data frame of data from across months and I want to select the
first number that is not NA in each row. For instance ID 895 would correspond to the value in Feb15, 687.
ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17
454 NA 977 NA 146
It would be helpful to store them in a variable so I could perform further calculations by month.
apply(tempdat[,32:43],1, function(x) head(which(x>0),1))
This data frame contains thousands of rows so, is it possible to have the all the numbers returned for each month stored into their own new vars or one new data frame by month.
In this case:
AggJan15 = 451
AggFeb15 = 687
AggMar15 = 0
AggApr15 = 625
The two answers below are based on different assumptions on what the question is saying.
1) In this answer we are assuming you want the first non-NA in each row. First find the index of the first NAs, one per row, using max.col giving ix. Then create an output data frame whose first column is ID, second is the first non-NA month for that row and whose third column is the value in that month. The next line NAs out any month that does not have a non-NA value and is not needed if you know that every row has at least one non-NA. Note that we have convert month/year to class yearmon so that they sort properly.
library(zoo)
DF1 <- DF[-1]
ix <- max.col(!is.na(DF1), "first")
out <- data.frame(ID = DF$ID,
month = as.yearmon(names(DF1)[ix], "%b%y"),
value = DF1[cbind(1:nrow(DF1), ix)])
out$month[is.na(out$value)] <- NA
## ID month value
## 1 100 Apr 2015 625
## 2 113 Jan 2015 451
## 3 895 Feb 2015 687
In a comment the poster says they want the sum by month so in that case we first sum by month giving ag and then we merge that with all months within the range to fill it out. The third line can be omitted if it is OK to have absent months filled in with NA; otherwise, use it and they will be filled with 0.
ag <- aggregate(value ~ month, out, sum)
m <- merge(ag, seq(min(ag$month), max(ag$month), 1/12), by = 1, all = TRUE)
m$value[is.na(m$value)] <- 0
## month value
## 1 Jan 2015 451
## 2 Feb 2015 687
## 3 Mar 2015 0
## 4 Apr 2015 625
2) Originally I thought you wanted the first non-NA in each column and this answer addresses that.
Assuming DF is as shown reproducibly in the Note at the end use na.locf specifying reverse order and take the first row.
library(zoo)
Agg <- na.locf(DF[-1], fromLast = TRUE)[1, ]
Agg
## Jan15 Feb15 Mar15 Apr15
## 1 451 586 313 625
Agg$Jan15
## [1] 451
Note
Lines <- "ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17 "
DF <- read.table(text = Lines, header = TRUE, comment.char = "-")

Extract the values from the dataframes created in a loop for further analysis (I am not sure, how to sum up the question in one line)

My raw dataset has multiple product Id, monthly sales and corresponding date arranged in a matrix format. I wish to create individual dataframes for each product_id along with the sales value and dates. For this, I am using a for loop.
base is the base dataset.
x is the variable that contains the unique product_id and the corresponding no of observation points.
for(i in 1:nrow(x)){
n <- paste("df", x$vars[i], sep = "")
assign(n, base[base[,1] == x$vars[i],])
print(n)}
This is a part of the output:
[1] "df25"
[1] "df28"
[1] "df35"
[1] "df37"
[1] "df39"
So all the dataframe names are saved in n. This, I think is a string vector.
When I write df25 outside the loop, I get the dataframe I want:
> df25
# A tibble: 49 x 3
ID date Sales
<dbl> <date> <dbl>
1 25 2014-01-01 0
2 25 2014-02-01 0
3 25 2014-03-01 0
4 25 2014-04-01 0
5 25 2014-05-01 0
6 25 2014-06-01 0
7 25 2014-07-01 0
8 25 2014-08-01 0
9 25 2014-09-01 0
10 25 2014-10-01 0
# ... with 39 more rows
Now, I want to use each of these dataframes seperately to perform a forecast analysis. For doing this, I need to get to the values in individual dataframes. This is what I have tried for the same:
for(i in 1:4) {print(paste0("df", x$vars[i]))}
[1] "df2"
[1] "df3"
[1] "df5"
[1] "df14"
But I am unable to refer to individual dataframes.
I am looking for help on how can I get access to the dataframes with their values for further analysis? Since there are more than 200 products, I am looking for some function which deals with all the dataframes.
First, I wish to convert it to a TS, using year and month values from the date variable and then use ets or forecast, etc.
SAMPLE DATASET:
set.seed(354)
df <- data.frame(Product_Id = rep(1:10, each = 50),
Date = seq(from = as.Date("2010/1/1"), to = as.Date("2014/2/1") , by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312) ,]
As always, any suggestion would be highly appreciated.
I think this is one way to get an access to the individual dataframes. If there is a better method, please let me know:
(Var <- get(paste0("df",x$vars[i])))

Programmatically Finding, Correcting IDs in Dataframes with Different Column and Row Lengths

I have two data frames of differing lengths and widths. Both contain panel data on sites across several years, with each site having a unique ID code. However, these unique ID codes were altered for some sites between data frames. For example:
Year <- c(2006,2006,2006,2006)
Name <- as.character(c("A","B","C","D.B"))
Qtr.2 <- as.numeric(c(14,32,62,40))
Code <- as.character(c(123,456,789,101))
DF1 <- data.frame(Year,Name,Qtr.2,Code,stringsAsFactors = FALSE)
Year2 <- c(2007,2007,2007,2007,2007,2007)
Name2 <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3 <- as.numeric(c(14,32,62,11,40,20))
Code2 <- as.character(c("W33","456","789","121","W133","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2 <- data.frame(Year2,Name2,Qtr.3,Code2,Type,stringsAsFactors = FALSE)
> DF1
Year Name Qtr.2 Code
1 2006 A 14 123
2 2006 B 32 456
3 2006 C 62 789
4 2006 D.B 40 101
> DF2
Year2 Name2 Qtr.3 Code2 Type
1 2007 A 14 W33 Blue
2 2007 B 32 456 Red
3 2007 C 62 789 Red
4 2007 E 11 121 Green
5 2007 D.B 40 W133 Blue
6 2007 D.A 20 W111 Red
Here, site “A's” code has changed from “123” in DF1 to “W33” in DF2.
I am having trouble programmatically finding and converting the altered ID codes to match their prior ID code. In other words, I want to match names from DF1 to DF2, and replace "Code2" in DF2 with "Code" from DF1 when a matching name is discovered. My approach thus far has involved a rather convoluted padding and for loop process. However, I feel this must be a semiregular wrangling problem and there must be a simpler approach.
Ideally, my second DF would look as follows:
Year2_fixed <- c(2007,2007,2007,2007,2007,2007)
Name2_fixed <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3_fixed <- as.numeric(c(14,32,62,11,40,20))
Code2_fixed <- as.character(c("123","456","789","121","101","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2_fixed <-data.frame(Year2_fixed,Name2_fixed,Qtr.3_fixed,Code2_fixed,Type,stringsAsFactors = FALSE)
> DF2_fixed
Year2_fixed Name2_fixed Qtr.3_fixed Code2_fixed Type
1 2007 A 14 123 Blue
2 2007 B 32 456 Red
3 2007 C 62 789 Red
4 2007 E 11 121 Green
5 2007 D.B 40 101 Blue
6 2007 D.A 20 W111 Red
I have done some looking but I haven't found a clear answer on OS that gets at this problem. It is possible I am not asking the question clearly enough in searches. Please point it out if it is out there, or let me know if I can clarify my question.
A few last points: I want to be able to perform an inner_join BY the code, preserving those observations that appear in both sets. I am providing a toy example, but, as is often the case, the true problem is too large to manually check these names.
Edit
As pointed out by others, stringAsFactors = FALSE has been added to prevent error.
Try using the match command:
DF2 <- within(DF2, {
ind <- match(Name2, DF1$Name)
new_code <- DF1$Code[ind]
Code_fixed <- ifelse(is.na(ind), as.character(Code2), as.character(new_code))
rm(ind, new_code)
})
DF2
A solution is to use dplyr::coalesce along with left_join to get the desired result.
library(dplyr)
DF2 %>% left_join(select(DF1, Name, Code), by=c("Name2" = "Name")) %>%
mutate(Code2 = coalesce(Code, Code2)) %>%
select(-Code)
# Year2 Name2 Qtr.3 Code2 Type
# 1 2007 A 14 123 Blue
# 2 2007 B 32 456 Red
# 3 2007 C 62 789 Red
# 4 2007 E 11 121 Green
# 5 2007 D.B 40 101 Blue
# 6 2007 D.A 20 W111 Red
Note: stringsAsFactors = FALSE has been added in OP's code to create data.frames, otherwise it would generate unnecessary warnings.
Data:
Year <- c(2006,2006,2006,2006)
Name <- as.character(c("A","B","C","D.B"))
Qtr.2 <- as.numeric(c(14,32,62,40))
Code <- as.character(c(123,456,789,101))
DF1 <- data.frame(Year,Name,Qtr.2,Code, stringsAsFactors = FALSE)
Year2 <- c(2007,2007,2007,2007,2007,2007)
Name2 <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3 <- as.numeric(c(14,32,62,11,40,20))
Code2 <- as.character(c("W33","456","789","121","W133","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2 <- data.frame(Year2,Name2,Qtr.3,Code2,Type, stringsAsFactors = FALSE)

Rbind and merge in R

So I have this big list of dataframes, and some of them have matching columns and others do not. I want to rbind the ones with matching columns and merge the others that do not have matching columns (based on variables Year, Country). However, I don't want to go through all of the dataframes by hand to see which ones have matching columns and which do not.
Now I was thinking that it would look something along the lines of this:
myfiles = list.files(pattern="*.dta")
dflist <- lapply(myfiles, read.dta13)
for (i in 1:length(dflist)){
if colnames match
put them in list and rbindlist.
else put them in another list and merge.
}
Apart from not knowing how to do this in R exactly, I'm starting to think this wouldn't work after all.
To illustrate consider 6 dataframes:
Dataframe 1: Dataframe 2:
Country Sector Emp Country Sector Emp
Belg A 35 NL B 31
Aus B 12 CH D 45
Eng E 18 RU D 12
Dataframe 3: Dataframe 4:
Country Flow PE Country Flow PE
NL 6 13 ... ... ...
HU 4 11 ... ...
LU 3 21 ...
Dataframe 5: dataframe 6:
Country Year Exp Country Year Imp
GER 02 44 BE 00 34
GER 03 34 BE 01 23
GER 04 21 BE 02 41
In this case I would want to rbind (dataframe 1,dataframe2) and rbind(dataframe 3, dataframe 4), and I would like to merge dataframe 5 and 6, based on variables country and year. So my output would be several rbinded/merged dataframes..
Rbind will fail if the columns are not the same. As suggested you can use merge or left_join from the dplyr package.
Maybe this will work: do.call(left_join, dflist)
For same columns data frame you could Union or Union all operation.
union will remove all duplicate values and if you need duplicate entries, use Union all.
(For data frame 1 and data frame 2) & (For data frame 3 and data frame 4) use Union or Union all operation. For data frame 5 and data frame 6, use
merge(x= dataframe5, y=dataframe6, by=c("Country", "Year"), all=TRUE)

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Resources