R: Combine duplicate columns after dplyr join - r

When you use a dplyr join function like full_join, columns with identical names are duplicated and given suffixes like "col.x", "col.y", "col.x.x", etc. when they are not used to join the tables.
library(dplyr)
data1<-data.frame(
Code=c(2,1,18,5),
Country=c("Canada", "USA", "Brazil", "Iran"),
x=c(50,29,40,29))
data2<-data.frame(
Code=c(2,40,18),
Country=c("Canada","Japan","Brazil"),
y=c(22,30,94))
data3<-data.frame(
Code=c(25,14,52),
Country=c("China","Japan","Australia"),
z=c(22,30,94))
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3))
This results in "Country", "Country.x", and "Country.y" columns.
Is there a way to combine the three columns into one, such that if a row has NA for a "Country", it takes the value from "Country.x" or "Country.y"?
I attempted a solution based on this similar question, but it gives me a warning and returns only values from the top three rows.
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) %>%
mutate(Country=coalesce(Country.x,Country.y,Country)) %>%
select(-Country.x, -Country.y)
This returns the warning invalid factor level, NA generated.
Any ideas?

You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce.
First we'll have to rename the tables to have value columns named the same.
library(dplyr)
data1 <- rename_at(data1,3, ~"value")
data2 <- rename_at(data2,3, ~"value")
data3 <- rename_at(data3,3, ~"value")
Then we can join
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
data1 %>%
safe_full_join(data2, by = c("Code","Country"), conflict = coalesce) %>%
safe_full_join(data3, by = c("Code","Country"), conflict = coalesce)
# Code Country value
# 1 2 Canada 50
# 2 1 USA 29
# 3 18 Brazil 40
# 4 5 Iran 29
# 5 40 Japan 30
# 6 25 China 22
# 7 14 Japan 30
# 8 52 Australia 94
You get some warnings because you're joining factor columns with different levels, add parameter check="" to remove them.

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

How to extract same column twice from r dataframe

I have a dataframe with 10 columns and have created a plot which allows user to input four different values. I have used select() function for extracting columns for plot purpose. It works well using shiny, however, when the user selects same input value twice, the columns not getting selected twice. An example would be helpful which is given below.
names <- c("kamal", "vimal", "shamal")
age <- c(45,23,35)
weight <- c(50,34,42)
data <- data.frame(names, age, weight)
data_select <- data %>% select(names,age,names)
print(data_select)
Please guide me to extract same column twice, if selected by the user.
You could use [ from base R:
data[c("names", "age", "names")]
names age names.1
1 kamal 45 kamal
2 vimal 23 vimal
3 shamal 35 shamal
Or data.table:
data.table(data)[, .(names, age, names)]
names age names
1: kamal 45 kamal
2: vimal 23 vimal
3: shamal 35 shamal
Using dplyr you could do:
data_select <- data %>%
{sapply(c("names", "age", "names"), function(x) pull(., x))} %>%
as.data.frame()
# names age names
# 1 kamal 45 kamal
# 2 vimal 23 vimal
# 3 shamal 35 shamal
dplyr::select does not allow multiple selection of same column. You could use base here for subsetting.
cols_to_select <- c('names','age','names')
data[cols_to_select]
# names age names.1
#1 kamal 45 kamal
#2 vimal 23 vimal
#3 shamal 35 shamal
Although this renames the column name with suffixes.

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Rbind and merge in R

So I have this big list of dataframes, and some of them have matching columns and others do not. I want to rbind the ones with matching columns and merge the others that do not have matching columns (based on variables Year, Country). However, I don't want to go through all of the dataframes by hand to see which ones have matching columns and which do not.
Now I was thinking that it would look something along the lines of this:
myfiles = list.files(pattern="*.dta")
dflist <- lapply(myfiles, read.dta13)
for (i in 1:length(dflist)){
if colnames match
put them in list and rbindlist.
else put them in another list and merge.
}
Apart from not knowing how to do this in R exactly, I'm starting to think this wouldn't work after all.
To illustrate consider 6 dataframes:
Dataframe 1: Dataframe 2:
Country Sector Emp Country Sector Emp
Belg A 35 NL B 31
Aus B 12 CH D 45
Eng E 18 RU D 12
Dataframe 3: Dataframe 4:
Country Flow PE Country Flow PE
NL 6 13 ... ... ...
HU 4 11 ... ...
LU 3 21 ...
Dataframe 5: dataframe 6:
Country Year Exp Country Year Imp
GER 02 44 BE 00 34
GER 03 34 BE 01 23
GER 04 21 BE 02 41
In this case I would want to rbind (dataframe 1,dataframe2) and rbind(dataframe 3, dataframe 4), and I would like to merge dataframe 5 and 6, based on variables country and year. So my output would be several rbinded/merged dataframes..
Rbind will fail if the columns are not the same. As suggested you can use merge or left_join from the dplyr package.
Maybe this will work: do.call(left_join, dflist)
For same columns data frame you could Union or Union all operation.
union will remove all duplicate values and if you need duplicate entries, use Union all.
(For data frame 1 and data frame 2) & (For data frame 3 and data frame 4) use Union or Union all operation. For data frame 5 and data frame 6, use
merge(x= dataframe5, y=dataframe6, by=c("Country", "Year"), all=TRUE)

Resources