Create a new column from conditions - r

I have a dataframe with information of some countries and states like this:
data.frame("state1"= c(NA,NA,"Beijing","Beijing","Schleswig-Holstein","Moskva",NA,"Moskva",NA,"Berlin"),
"country1"=c("Spain","Spain","China","China","Germany","Russia","Germany","Russia","Germany","Germany"),
"state2"= c(NA,NA,"Beijing",NA,NA,NA,"Moskva",NA,NA,NA),
"country2"=c("Germany","Germany","China","Germany","","Ukraine","Russia","Germany","Ukraine","" ),
"state3"= c(NA,NA,NA,NA,"Schleswig-Holstein",NA,NA,NA,NA,"Berlin"),
"country3"=c("Spain","Spain","Germany","Germany","Germany","Germany","Germany","Germany","Germany","Germany"))
Now, I would like to create a new column with the information of German states. (the result would look like below).
When at least one of the three variables state are a German state, assign it in the new variable.
data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))
Please help a beginner for the condition setting.
Thank you in advance!

Using dplyr::mutate() with case_when() works, although I suspect there should be a more efficient way using across()
library(dplyr)
df %>%
mutate(GE_state = case_when(country1 == "Germany" & !is.na(state1) ~ state1,
country2 == "Germany" & !is.na(state2) ~ state2,
country3 == "Germany" & !is.na(state3) ~ state3,
TRUE ~ NA_character_))
#> state1 country1 state2 country2 state3 country3
#> 1 <NA> Spain <NA> Germany <NA> Spain
#> 2 <NA> Spain <NA> Germany <NA> Spain
#> 3 Beijing China Beijing China <NA> Germany
#> 4 Beijing China <NA> Germany <NA> Germany
#> 5 Schleswig-Holstein Germany <NA> Schleswig-Holstein Germany
#> 6 Moskva Russia <NA> Ukraine <NA> Germany
#> 7 <NA> Germany Moskva Russia <NA> Germany
#> 8 Moskva Russia <NA> Germany <NA> Germany
#> 9 <NA> Germany <NA> Ukraine <NA> Germany
#> 10 Berlin Germany <NA> Berlin Germany
#> GE_state
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 Schleswig-Holstein
#> 6 <NA>
#> 7 <NA>
#> 8 <NA>
#> 9 <NA>
#> 10 Berlin
Created on 2021-03-31 by the reprex package (v1.0.0)

I think you want cbind() here:
df1 <- cbind(df1, df2)
Data:
df1 <- <your first data frame>
df2 <- data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))

Related

Drop lines in a long dataset by group, based on some condition

I have this df:
library(lubridate)
Date <- c("2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04")
Date <- as_date(Date)
Country <- c("USA", "USA", "USA", "USA",
"Mexico", "Mexico", "Mexico", "Mexico",
"Japan", "Japan", "Japan","Japan")
Value_A <- c(0,40,0,0,25,29,34,0,20,25,27,0)
df<- data.frame(Date, Country, Value_A)
view(df)
Date Country Value_A
<date> <chr> <dbl>
1 2020-10-01 USA 0
2 2020-10-02 USA 40
3 2020-10-03 USA 0
4 2020-10-04 USA 0
5 2020-10-01 Mexico 25
6 2020-10-02 Mexico 29
7 2020-10-03 Mexico 34
8 2020-10-04 Mexico 0
9 2020-10-01 Japan 20
10 2020-10-02 Japan 25
11 2020-10-03 Japan 27
12 2020-10-04 Japan 0
I'm trying to drop the rows containing zeros, but only if these zeros are in the last two rows of each group of the Country column. So the result would be:
Date Country Value_A
<date> <chr> <dbl>
1 2020-10-01 USA 0
2 2020-10-02 USA 40
5 2020-10-01 Mexico 25
6 2020-10-02 Mexico 29
7 2020-10-03 Mexico 34
9 2020-10-01 Japan 20
10 2020-10-02 Japan 25
11 2020-10-03 Japan 27
I appreciate it if someone can help :)
We can use the tidyverse package to do a few manipulations to get the result. We group_by Country, and sort descending by Date. After that, we generate row_numbers. Finally, we filter based on the condition you described:
library(tidyverse)
df %>%
group_by(Country) %>%
arrange(desc(Date)) %>%
mutate(rn = row_number()) %>%
filter(!(Value_A == 0 & rn <= 2))
# Date Country Value_A rn
# 1 2020-10-03 Mexico 34 2
# 2 2020-10-03 Japan 27 2
# 3 2020-10-02 USA 40 3
# 4 2020-10-02 Mexico 29 3
# 5 2020-10-02 Japan 25 3
# 6 2020-10-01 USA 0 4
# 7 2020-10-01 Mexico 25 4
# 8 2020-10-01 Japan 20 4
Another method would be to use rank(desc(Date))
library(tidyverse)
df %>%
group_by(Country) %>%
mutate(rank_date = rank(desc(Date))) %>%
filter(!(rank_date <= 2 & Value_A == 0))
# Date Country Value_A rank_date
# 1 2020-10-01 USA 0 4
# 2 2020-10-02 USA 40 3
# 3 2020-10-01 Mexico 25 4
# 4 2020-10-02 Mexico 29 3
# 5 2020-10-03 Mexico 34 2
# 6 2020-10-01 Japan 20 4
# 7 2020-10-02 Japan 25 3
# 8 2020-10-03 Japan 27 2

Make time-period observations into annual observations in R

I have a dataset (df1) on hundreds of national crises, where each observation is a crisis event at the country level with a start and an end date. I also have the date when the crisis was announced (yyyy-mm-dd format), and a bunch of other crisis characteristics.
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
eventID country start end announcement x1 x2
1 ALB 1994 1996 1994-11-01 6 a
2 ALB 1998 1999 1998-03-01 2 q
3 ARG 1998 1999 1998-07-01 8 k
4 ARG 1991 1993 1992-01-01 7 b
I need to make df2, a panel of countries with annual observations from the earliest "start" year to the latest "end" year. I want to have a dummy variable, "crisis", that equals 1 for the years between "start" and "end" in df1, and 0 otherwise. I want "announcement" to contain the announcement date in df1 for the year with an announcement, and "NA" otherwise. I would like the extra crisis characteristics, x1 and x2, to show up for crisis years to which they correspond, and "NA" otherwise.
I also need observations for each country for years in which no country has a crisis (in df2: 1997).
df2 <- data.frame(cbind(year=c(1991,1992,1993,1994,1995,1996,1997,1998,1999,1991,1992,1993,1994,1995,1996,1997,1998,1999), country=c("ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG"),crisis=c(0,0,0,1,1,1,0,1,1,1,1,1,0,0,0,0,1,1), announcement=c(NA, NA,NA,"1994-11-01",NA,NA,NA,"1998-03-01",NA,NA,"1992-01-01",NA,NA,NA,NA,NA,"1998-07-01"), x1=c(NA,NA,NA,6,6,6,NA,2,2,8,8,8,NA,NA,NA,NA,7,7), x2=c(NA,NA,NA,"a","a","a",NA,"q","q","k","k","k",NA,NA,NA,NA,"b","b")))
year country crisis announcement x1 x2
1991 ALB 0 NA NA NA
1992 ALB 0 NA NA NA
1993 ALB 0 NA NA NA
1994 ALB 1 1994-11-01 6 a
1995 ALB 1 NA 6 a
1996 ALB 1 NA 6 a
1997 ALB 0 NA NA NA
1998 ALB 1 1998-03-01 2 q
1999 ALB 1 NA 2 q
1991 ARG 1 NA 8 k
1992 ARG 1 1992-01-01 8 k
1993 ARG 1 NA 8 k
1994 ARG 0 NA NA NA
1995 ARG 0 NA NA NA
1996 ARG 0 NA NA NA
1997 ARG 0 NA NA NA
1998 ARG 1 1998-07-01 7 b
1999 ARG 1 NA 7 b
I would love any suggestions! I'm stumped as to how to replicate the observations for each year, but only include x1 and x2 values when my new "crisis" dummy = 1
Thanks!
Making use of dplyr and tidyr this could be achieved like so:
library(dplyr)
library(tidyr)
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
df1 %>%
mutate(year = factor(start, levels = min(start):max(end))) %>%
complete(year, country) %>%
mutate(year = as.numeric(as.character(year))) %>%
arrange(country, year) %>%
group_by(country) %>%
fill(eventID, end, x1, x2) %>%
ungroup() %>%
mutate(across(c(eventID, end, x1, x2), ~ ifelse(end < year, NA, .)),
crisis = as.numeric(!is.na(eventID)))
#> # A tibble: 18 x 9
#> year country eventID start end announcement x1 x2 crisis
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1991 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 2 1992 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 3 1993 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 4 1994 ALB 1 1994 1996 1994-11-01 6 a 1
#> 5 1995 ALB 1 <NA> 1996 <NA> 6 a 1
#> 6 1996 ALB 1 <NA> 1996 <NA> 6 a 1
#> 7 1997 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 8 1998 ALB 2 1998 1999 1998-03-01 2 q 1
#> 9 1999 ALB 2 <NA> 1999 <NA> 2 q 1
#> 10 1991 ARG 4 1991 1993 1992-01-01 7 b 1
#> 11 1992 ARG 4 <NA> 1993 <NA> 7 b 1
#> 12 1993 ARG 4 <NA> 1993 <NA> 7 b 1
#> 13 1994 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 14 1995 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 15 1996 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 16 1997 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 17 1998 ARG 3 1998 1999 1998-07-01 8 k 1
#> 18 1999 ARG 3 <NA> 1999 <NA> 8 k 1

How to find rolling top 3 values in a column by group?

A data frame has 3 columns
-----------------------------------------
| Id | Country | Date |
-----------------------------------------
The 3 columns record the travel history of the person.
3 more columns need to be created representing the rolling top 3 countries this person (ID) has travelled to the most often before the date on the row.
(If tie appears for 2 countries, the latest travelled country has the precedence.)
mydata <- data.frame(ID = c('A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2'),
Country = c('Japan', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Japan', 'France', 'UK', 'Spain', 'Spain', 'UK', 'UK', 'Brazil'),
Date = as.Date(c('2010/01/02', '2010/04/18', '2011/03/22', '2011/11/23', '2012/05/09', '2012/09/11', '2014/01/06', '2015/12/11', '2010/04/03', '2010/05/11', '2011/05/01', '2012/03/01', '2013/01/03', '2014/01/04')))
# final data should look like below
#ID Country Date Pref1 Pref2 Pref3
#A1B1 Japan 2010-01-02 NA NA NA
#A1B1 USA 2010-04-18 Japan NA NA
#A1B1 USA 2011-03-22 USA Japan NA
#A1B1 USA 2011-11-23 USA Japan NA
#A1B1 Germany 2012-05-09 USA Japan NA
#A1B1 Germany 2012-09-11 USA Germany Japan
#A1B1 Japan 2014-01-06 USA Germany Japan
#A1B1 France 2015-12-11 USA Japan Germany
#A2B2 UK 2010-04-03 NA NA NA
#A2B2 Spain 2010-05-11 UK NA NA
#A2B2 Spain 2011-05-01 Spain UK NA
#A2B2 UK 2012-03-01 Spain UK NA
#A2B2 UK 2013-01-03 UK Spain NA
#A2B2 Brazil 2014-01-04 UK Spain NA
Q. How to create the last 3 columns for rolling top 3 countries in counts by ID?
Here is a way taking last 3 unique countries at each row for each ID.
library(dplyr)
mydata %>%
group_by(ID) %>%
mutate(data = purrr::map(row_number(), ~{
un_country <- Country[seq_len(.x - 1)]
if(.x == 1) un_country <- NA
else un_country <- names(sort(table(un_country), decreasing = TRUE))[1:3]
data.frame(t(un_country[1:3]))
})) %>%
tidyr::unnest_wider(data)
# ID Country Date X1 X2 X3
# <chr> <chr> <date> <chr> <chr> <chr>
# 1 A1B1 Japan 2010-01-02 NA NA NA
# 2 A1B1 USA 2010-04-18 Japan NA NA
# 3 A1B1 USA 2011-03-22 Japan USA NA
# 4 A1B1 USA 2011-11-23 USA Japan NA
# 5 A1B1 Germany 2011-05-09 USA Japan NA
# 6 A1B1 Germany 2012-09-11 USA Germany Japan
# 7 A1B1 Japan 2014-01-06 USA Germany Japan
# 8 A1B1 France 2015-12-11 USA Germany Japan
# 9 A2B2 UK 2010-04-03 NA NA NA
#10 A2B2 Spain 2010-05-11 UK NA NA
#11 A2B2 Spain 2011-05-01 Spain UK NA
#12 A2B2 UK 2012-03-01 Spain UK NA
#13 A2B2 UK 2013-01-03 Spain UK NA
#14 A2B2 Brazil 2014-01-04 UK Spain NA
I think this does it. I've included the mydata here as I think there was a typo in one of the dates.
mydata <- data.frame(ID = c('A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2'),
Country = c('Japan', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Japan', 'France', 'UK', 'Spain', 'Spain', 'UK', 'UK', 'Brazil'),
Date = as.Date(c('2010/01/02', '2010/04/18', '2011/03/22', '2011/11/23', '2012/05/09', '2012/09/11', '2014/01/06', '2015/12/11', '2010/04/03', '2010/05/11', '2011/05/01', '2012/03/01', '2013/01/03', '2014/01/04')))
library(data.table)
setDT(mydata)
mydata[order(Date), `:=`(num_v = seq_len(.N), last_v = Date), .(ID, Country)]
x <- mydata[
mydata[, CJ(Country = unique(Country), Date = unique(Date)), ID],
on=c('ID', 'Country', 'Date'), roll=Inf]
x[, `:=`(num_v = shift(num_v), last_v = shift(last_v)), .(ID, Country)]
x[is.na(num_v), Country := NA]
y <- x[,
.SD[order(-num_v, -last_v)][1:3, .(Pref = paste0('Pref',1:3), Country)],
.(ID, Date)]
dcast(y, ID+Date~Pref, value.var = 'Country')
#> ID Date Pref1 Pref2 Pref3
#> 1: A1B1 2010-01-02 <NA> <NA> <NA>
#> 2: A1B1 2010-04-18 Japan <NA> <NA>
#> 3: A1B1 2011-03-22 USA Japan <NA>
#> 4: A1B1 2011-11-23 USA Japan <NA>
#> 5: A1B1 2012-05-09 USA Japan <NA>
#> 6: A1B1 2012-09-11 USA Germany Japan
#> 7: A1B1 2014-01-06 USA Germany Japan
#> 8: A1B1 2015-12-11 USA Japan Germany
#> 9: A2B2 2010-04-03 <NA> <NA> <NA>
#> 10: A2B2 2010-05-11 UK <NA> <NA>
#> 11: A2B2 2011-05-01 Spain UK <NA>
#> 12: A2B2 2012-03-01 Spain UK <NA>
#> 13: A2B2 2013-01-03 UK Spain <NA>
#> 14: A2B2 2014-01-04 UK Spain <NA>
You can join back on the Country from the original mydata if you need it.
This isn't a super clean answer. Hopefully it helps you gets you close.
library(readr)
df <- readr::read_table(
"ID Country Date
A1B1 Japan 2010-01-02
A1B1 USA 2010-04-18
A1B1 USA 2011-03-22
A1B1 USA 2011-11-23
A1B1 Germany 2012-05-09
A1B1 Germany 2012-09-11
A1B1 Japan 2014-01-06
A1B1 France 2015-12-11
A2B2 UK 2010-04-03
A2B2 Spain 2010-05-11
A2B2 Spain 2011-05-01
A2B2 UK 2012-03-01
A3B2 UK 2013-01-03
A3B2 Brazil 2014-01-04")
df
library(tidyverse)
rankings <- df %>%
group_by(ID, Country) %>%
summarise(obs = n(),
last_dt = max(Date)) %>%
arrange(ID,-obs, desc(last_dt)) %>%
mutate(rank = 1:n()) %>% print() %>%
filter(rank <= 3) %>%
pivot_wider(
names_from = rank,
values_from = Country,
names_prefix = "rank_",
id_cols = ID
) %>% print()
#> `summarise()` regrouping output by 'ID' (override with `.groups` argument)
#> # A tibble: 8 x 5
#> # Groups: ID [3]
#> ID Country obs last_dt rank
#> <chr> <chr> <int> <date> <int>
#> 1 A1B1 USA 3 2011-11-23 1
#> 2 A1B1 Japan 2 2014-01-06 2
#> 3 A1B1 Germany 2 2012-09-11 3
#> 4 A1B1 France 1 2015-12-11 4
#> 5 A2B2 UK 2 2012-03-01 1
#> 6 A2B2 Spain 2 2011-05-01 2
#> 7 A3B2 Brazil 1 2014-01-04 1
#> 8 A3B2 UK 1 2013-01-03 2
#> # A tibble: 3 x 4
#> # Groups: ID [3]
#> ID rank_1 rank_2 rank_3
#> <chr> <chr> <chr> <chr>
#> 1 A1B1 USA Japan Germany
#> 2 A2B2 UK Spain <NA>
#> 3 A3B2 Brazil UK <NA>
df %>% left_join(rankings, by = "ID")
#> # A tibble: 14 x 6
#> ID Country Date rank_1 rank_2 rank_3
#> <chr> <chr> <date> <chr> <chr> <chr>
#> 1 A1B1 Japan 2010-01-02 USA Japan Germany
#> 2 A1B1 USA 2010-04-18 USA Japan Germany
#> 3 A1B1 USA 2011-03-22 USA Japan Germany
#> 4 A1B1 USA 2011-11-23 USA Japan Germany
#> 5 A1B1 Germany 2012-05-09 USA Japan Germany
#> 6 A1B1 Germany 2012-09-11 USA Japan Germany
#> 7 A1B1 Japan 2014-01-06 USA Japan Germany
#> 8 A1B1 France 2015-12-11 USA Japan Germany
#> 9 A2B2 UK 2010-04-03 UK Spain <NA>
#> 10 A2B2 Spain 2010-05-11 UK Spain <NA>
#> 11 A2B2 Spain 2011-05-01 UK Spain <NA>
#> 12 A2B2 UK 2012-03-01 UK Spain <NA>
#> 13 A3B2 UK 2013-01-03 Brazil UK <NA>
#> 14 A3B2 Brazil 2014-01-04 Brazil UK <NA>
Created on 2020-08-29 by the reprex package (v0.3.0)
Here's a messy Base R solution:
rlln_rnk_df <- do.call("rbind", lapply(split(mydata, mydata$ID), function(x){
y <- do.call("rbind", lapply(seq_len(nrow(x)), function(i){
tmp <- x[x$Date <= x$Date[i],]
tmp1 <- cbind(head(tmp[order(tmp$Date, decreasing = TRUE),], 1),
rnk = t(names(sort(table(tmp$Country), decreasing = TRUE))))
tmp1 <- setNames(tmp1, c(names(tmp), paste0("rnk.", 1:(ncol(tmp1) - ncol(tmp)))))
tmp1[,setdiff(paste0("rnk.", 1:(length(unique(mydata$Country)))), names(tmp1))] <- NA_character_
tmp1
}
)
)
z <- y[order(y$Date),]
cbind(ID = z$ID, Country = z$Country, Date = z$Date,
z[match(z$Date, z$Date[2:nrow(z)]), (grep("rnk", names(z), value = TRUE))])
}
)
)
df_clean <- data.frame(rlln_rnk_df[, colSums(is.na(rlln_rnk_df)) < nrow(rlln_rnk_df)],
row.names = NULL)

Finding duplicates by columns in r

I want to find duplicates horizontally and keeping the uniques. Please help me with this.
I am sharing a sample dataset. Hope this helps.
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
TableTest <- data.frame(X,Y,Z,S)
TableTest
Input dataset
X Y Z S
1 1 India India India
2 2 India India France
3 3 Philippines Netherlands Netherlands
4 4 Netherlands France France
5 5 France France India
Expected Output
X Y Z S
1 1 India NA NA
2 2 India France NA
3 3 Philippines Netherlands NA
4 4 Netherlands France NA
5 5 France India NA
Please help.
TableTest[,-1] <- as.data.frame(t(apply(TableTest[,-1], 1, function(a) { a <- replace(a, duplicated(a), NA_character_); a[ order(is.na(a)) ]; })))
TableTest
# X Y Z S
# 1 1 India <NA> <NA>
# 2 2 India France <NA>
# 3 3 Philippines Netherlands <NA>
# 4 4 Netherlands France <NA>
# 5 5 France India <NA>
Another base R option
TableTest[-1] <- do.call(rbind,lapply(apply(TableTest[-1],1,unique),`length<-`,ncol(TableTest)-1))
or a simpler version (thanks for advice by #Onyambu in the comments)
TableTest[-1] <- t(apply(TableTest[-1], 1, function(x)`length<-`(unique(x),ncol(TableTest[-1]))))
which gives
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
My solution:
TableTest[2:4] <- as.data.frame(t(apply(TableTest[2:4], 1, function(x) {
xo <- ifelse(!duplicated(x), x, NA_character_)
if (any(is.na(xo))) xo <- xo[!is.na(xo)]
length(xo) <- ncol(TableTest) - 1
xo
})))
Output
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
I don't think you can do it by only using data.frames, because you're moving values across columns. But here's one way to do it using matrices:
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
output <- apply(cbind(Y,Z,S), 1, function(row) {
rm_dup <- unique(row)
return(c(rm_dup, rep(NA_character_,
3 - length(rm_dup))))
})
t(output)
[,1] [,2] [,3]
[1,] "India" NA NA
[2,] "India" "France" NA
[3,] "Philippines" "Netherlands" NA
[4,] "Netherlands" "France" NA
[5,] "France" "India" NA

How can I transform a long-formated data frame to a wide-formated one with multiple values within a cell in R?

Here is the original df:
area sector item
1 East A <NA>
2 South A Baidu
3 South A Tencent
4 West A <NA>
5 North A <NA>
6 East B Microsoft
7 East B Google
8 East B Facebook
9 South B <NA>
10 West B <NA>
11 North B <NA>
12 East C <NA>
13 South C <NA>
14 West C <NA>
15 North C Alibaba
16 East D <NA>
17 South D <NA>
18 West D Amazon
19 North D <NA>
20 East E <NA>
21 South E <NA>
22 West E <NA>
23 North E <NA>
How can I transform the above df to the following one? Some cells in the transformed df have multiple items from the original df.
Sector East South West North
1 A <NA> "Baidu, Tencent" <NA> <NA>
2 B "Microsoft, Google, Facebook" <NA> <NA> <NA>
3 C <NA> <NA> <NA> "Alibaba"
4 D <NA> <NA> "Amazon" <NA>
5 E <NA> <NA> <NA> <NA>
A quick solution could be to use the toString function while trasnforming from long to wide using the reshape2 package
reshape2::dcast(df, sector ~ area, toString)
#Using item as value column: use value.var to override.
# sector East North South West
# 1 A <NA> <NA> Baidu, Tencent <NA>
# 2 B Microsoft, Google, Facebook <NA> <NA> <NA>
# 3 C <NA> Alibaba <NA> <NA>
# 4 D <NA> <NA> <NA> Amazon
# 5 E <NA> <NA> <NA> <NA>
This is almost a dupe of this but most of the solutions there won't work for this case- but this can still give you some ideas.
And just for fun, here is a base solution:
reshape(aggregate(item ~ area + sector, data = df, paste, collapse = ","),
idvar = "sector", timevar = "area", direction = "wide")
sector item.East item.North item.South item.West
1 A <NA> <NA> Baidu,Tencent <NA>
5 B Microsoft,Google,Facebook <NA> <NA> <NA>
9 C <NA> Alibaba <NA> <NA>
13 D <NA> <NA> <NA> Amazon
17 E <NA> <NA> <NA> <NA>
Here is an option with dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(area, sector) %>%
summarise(item = toString(item)) %>%
spread(area, item)

Resources