So I have this big list of dataframes, and some of them have matching columns and others do not. I want to rbind the ones with matching columns and merge the others that do not have matching columns (based on variables Year, Country). However, I don't want to go through all of the dataframes by hand to see which ones have matching columns and which do not.
Now I was thinking that it would look something along the lines of this:
myfiles = list.files(pattern="*.dta")
dflist <- lapply(myfiles, read.dta13)
for (i in 1:length(dflist)){
if colnames match
put them in list and rbindlist.
else put them in another list and merge.
}
Apart from not knowing how to do this in R exactly, I'm starting to think this wouldn't work after all.
To illustrate consider 6 dataframes:
Dataframe 1: Dataframe 2:
Country Sector Emp Country Sector Emp
Belg A 35 NL B 31
Aus B 12 CH D 45
Eng E 18 RU D 12
Dataframe 3: Dataframe 4:
Country Flow PE Country Flow PE
NL 6 13 ... ... ...
HU 4 11 ... ...
LU 3 21 ...
Dataframe 5: dataframe 6:
Country Year Exp Country Year Imp
GER 02 44 BE 00 34
GER 03 34 BE 01 23
GER 04 21 BE 02 41
In this case I would want to rbind (dataframe 1,dataframe2) and rbind(dataframe 3, dataframe 4), and I would like to merge dataframe 5 and 6, based on variables country and year. So my output would be several rbinded/merged dataframes..
Rbind will fail if the columns are not the same. As suggested you can use merge or left_join from the dplyr package.
Maybe this will work: do.call(left_join, dflist)
For same columns data frame you could Union or Union all operation.
union will remove all duplicate values and if you need duplicate entries, use Union all.
(For data frame 1 and data frame 2) & (For data frame 3 and data frame 4) use Union or Union all operation. For data frame 5 and data frame 6, use
merge(x= dataframe5, y=dataframe6, by=c("Country", "Year"), all=TRUE)
Related
I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence
I'd like to merge multiple (around ten) datasets in R. Quite a few of the datasets are different from each other, so I don't need to match them by row name or anything. I'd just like to paste them side by side, on a single dataframe so I can export them into a single sheet. For instance, I have the following two datasets:
Month
Engagement
Test
Jan
51
1
Feb
123
2
Variable
Engagement
Hot
412
Cold
4124
Warm
4fd4
I'd simply like to put them side by side (as in left and right) in a single data frame for exporting purposes, like this:
Month
Engagement
Test
Variable
Engagement
Jan
51
1
Hot
412
Feb
123
2
Cold
4124
NA
NA
NA
Warm
4fd4
Is there any way to accomplish this? It might seem like a strange request, but do let me know if I should provide any more info! Thank you so much.
Put the data in a list. Find the max number of rows from the list. For each dataframe subset the rows, dataframe with lower number of rows will be appended with NA's.
data <- list(df1, df2)
n <- seq_len(max(sapply(data, nrow)))
result <- do.call(cbind, lapply(data, `[`, n, ))
result
# Month Engagement Test Variable Engagement
#1 Jan 51 1 Hot 412
#2 Feb 123 2 Cold 4124
#NA <NA> NA NA Warm 4fd4
Index both data then merge by the index and drop the index:
df1 <- read.csv("Book1.csv", header = TRUE, na.strings = "")
df2 <- read.csv("Book2.csv", header = TRUE, na.strings = "")
# Assign index to the dataframe
rownames(df1) <- 1:nrow(df1)
rownames(df2) <- 1:nrow(df2)
# Merge by index:
merged <- merge(df1, df2, by=0, all=TRUE) %>%
select(-1)
merged
Output:
Month Engagement Test Variable Engagement
1 Jan 51 1 Hot 412
2 Feb 123 2 Cold 4124
3 <NA> NA NA Warm 4fd4
I have a large dataframe A similar to the following and a second one, B, containing only lat/lon values.
What I am trying to do is to subset dataframe A based on the unique combinations of lat/lon from dataframe B.
So far, I have tried the following but does not work.
How should I change my code in order to effectively do this?
head(A)
vals time lon lat mo year
1 5 1978-11-01 100 32 01 1988
2 3 1978-11-02 100 45 02 1988
3 3 1978-11-03 100 45 01 1998
4 9 1978-11-04 100 50 05 1998
5 1 1978-11-05 100 60 05 1998
6 4 1978-11-06 100 32 05 1998
A_subset <-subset(A, A[, "lon"] %in% B$lon | A[, "lat"]
%in% B$lat)
Consider running an expand.grid on data frame B for all combination of unique coordinates. Then merge to data frame A:
B_all_combns <- expand.grid(lon = unique(B$lon), lat = unique(B$lat))
A_subset <- merge(A, B_all_combns, by=c("lon", "lat"))
When you use a dplyr join function like full_join, columns with identical names are duplicated and given suffixes like "col.x", "col.y", "col.x.x", etc. when they are not used to join the tables.
library(dplyr)
data1<-data.frame(
Code=c(2,1,18,5),
Country=c("Canada", "USA", "Brazil", "Iran"),
x=c(50,29,40,29))
data2<-data.frame(
Code=c(2,40,18),
Country=c("Canada","Japan","Brazil"),
y=c(22,30,94))
data3<-data.frame(
Code=c(25,14,52),
Country=c("China","Japan","Australia"),
z=c(22,30,94))
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3))
This results in "Country", "Country.x", and "Country.y" columns.
Is there a way to combine the three columns into one, such that if a row has NA for a "Country", it takes the value from "Country.x" or "Country.y"?
I attempted a solution based on this similar question, but it gives me a warning and returns only values from the top three rows.
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) %>%
mutate(Country=coalesce(Country.x,Country.y,Country)) %>%
select(-Country.x, -Country.y)
This returns the warning invalid factor level, NA generated.
Any ideas?
You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce.
First we'll have to rename the tables to have value columns named the same.
library(dplyr)
data1 <- rename_at(data1,3, ~"value")
data2 <- rename_at(data2,3, ~"value")
data3 <- rename_at(data3,3, ~"value")
Then we can join
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
data1 %>%
safe_full_join(data2, by = c("Code","Country"), conflict = coalesce) %>%
safe_full_join(data3, by = c("Code","Country"), conflict = coalesce)
# Code Country value
# 1 2 Canada 50
# 2 1 USA 29
# 3 18 Brazil 40
# 4 5 Iran 29
# 5 40 Japan 30
# 6 25 China 22
# 7 14 Japan 30
# 8 52 Australia 94
You get some warnings because you're joining factor columns with different levels, add parameter check="" to remove them.
Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.