My aim is to create a table that summarizes the countries featured in my sample. This table should only have two rows, a first row with different columns for each region and a second row with country names that are located in the respective region.
To give you an example, this is what my data.frame XYZ looks like:
..................wvs5red2.s003names.....wvs5red2.regiondummies
21............."Hong Kong"......................Asian Tigers
45............."South Korea"....................Asian Tigers
49............."Taiwan".............................Asian Tigers
66............."China"...............................East Asia & Pacific
80............."Indonesia"........................East Asia & Pacific
86............."Malaysia"...........................East Asia & Pacific
My aim is to obtain a table that looks similar to this:
region.............Asian Tigers..............................................East Asia & Pacific
countries........Hong Kong, South Korea, Taiwan...........China, Indonesia, etc.
Do you have any idea how to obtain such a table? It took me hours searching for something similar.
Simplest way is tapply:
XYZ <- structure(list(
names = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("China", "Hong Kong", "Indonesia", "Malaysia", "South Korea", "Taiwan"), class = "factor"),
region = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Asian Tigers", "East Asia & Pacific"), class = "factor")),
.Names = c("names", "region"), row.names = c(NA, -6L), class = "data.frame")
tapply(XYZ$names, XYZ$region, paste, collapse=", ")
# Asian Tigers East Asia & Pacific
# "Hong Kong, South Korea, Taiwan" "China, Indonesia, Malaysia"
Recreate the data:
dat <- data.frame(
country = c("Hong Kong", "South Korea", "Taiwan", "China", "Indonesia", "Malaysia"),
region = c(rep("Asian Tigers", 3), rep("East Asia & Pacific", 3))
)
dat
country region
1 Hong Kong Asian Tigers
2 South Korea Asian Tigers
3 Taiwan Asian Tigers
4 China East Asia & Pacific
5 Indonesia East Asia & Pacific
6 Malaysia East Asia & Pacific
Use ddply in package plyr combined with paste to summarise the data:
library(plyr)
ddply(dat, .(region), function(x)paste(x$country, collapse= ","))
region V1
1 Asian Tigers Hong Kong,South Korea,Taiwan
2 East Asia & Pacific China,Indonesia,Malaysia
First create data:
> country<-c("Hong Kong","Taiwan","China","Indonesia")
> region<-rep(c("Asian Tigers","East Asia & Pacific"),each=2)
> df<-data.frame(country=country,region=region)
Then run through column region and gather all the countries. We can use tapply, but I will use dlply from package plyr, since it retains list names.
> ll<-dlply(df,~region,function(d)paste(d$country,collapse=","))
> ll
$`Asian Tigers`
[1] "Hong Kong,Taiwan"
$`East Asia & Pacific`
[1] "China,Indonesia"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
region
1 Asian Tigers
2 East Asia & Pacific
Now convert the list to the data.frame using do.call. Since we need nice names we need to pass argument check.names=FALSE:
> ll$check.names <- FALSE
> do.call("data.frame",ll)
Asian Tigers East Asia & Pacific
1 Hong Kong,Taiwan China,Indonesia
Related
I have a dataframe like this:
structure(list(from = c("China", "China", "Canada", "Canada",
"USA", "China", "Trinidad and Tobago", "China", "USA", "USA"),
to = c("Japan", "Japan", "USA", "USA", "Japan", "USA", "USA",
"Rep. of Korea", "Canada", "Japan"), weight = c(4766781396,
4039683737, 3419468319, 3216051707, 2535151299, 2513604035,
2303474559, 2096033823, 2091906420, 2066357443)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
from = c("Canada", "China", "China", "China", "Trinidad and Tobago",
"USA", "USA"), to = c("USA", "Japan", "Rep. of Korea", "USA",
"USA", "Canada", "Japan"), .rows = structure(list(3:4, 1:2,
8L, 6L, 7L, 9L, c(5L, 10L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), .drop = TRUE))
I would like to perform the absolute value of difference in weight column grouped by from and to.
I'm trying with the function aggregate() but it seems to work for means and sums and not for difference. For example (df is the name of my dataframe):
aggregate(weight~from+to, data = df, FUN=mean)
which produces:
from to weight
1 USA Canada 2091906420
2 China Japan 4403232567
3 USA Japan 2300754371
4 China Rep. of Korea 2096033823
5 Canada USA 3317760013
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
EDIT. The desired result is instead
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
As we can see, the countries that appear two times in the columns from and to colllapsed in only one row with the difference between weights in the column weight. E.g.,
from to weight
China Japan 4766781396
China Japan 4039683737
become
from to weight
China Japan 727097659
because
> 4766781396-4039683737
[1] 727097659
The difference should be positive (and this is why I wrote "the absolute value of difference of the weights").
The couples of countries which instead appear just in one row of dataframe df remain the same, as e.g.
from to weight
7 Trinidad and Tobago USA 2303474559
Assuming at most 2 values per group and that the order of the difference is not important
aggregate(weight~from+to, data=df, FUN=function(x){
abs(ifelse(length(x)==1,x,diff(x)))
})
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
Is the following what you are looking for?
f <- function(x) abs(x[2] - x[1])
aggregate(weight ~ from + to, data = df, FUN = f)
#> from to weight
#> 1 USA Canada NA
#> 2 China Japan 727097659
#> 3 USA Japan 468793856
#> 4 China Rep. of Korea NA
#> 5 Canada USA 203416612
#> 6 China USA NA
#> 7 Trinidad and Tobago USA NA
I have a dataframe that in an entirely simplistic representation looks like this:
structure(list(Plant = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area = c("North", "North", "North", "North", "South", "South",
"South", "South", "South", "South", "South", "South"), dups = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-12L))
For any row of plant, I want to replace it with specific possible combinations of plant and area that are within another data frame. They are not ALL possible combinations, but just the ones that truly exist.
The possible combinations of the variables look like this:
structure(list(nam = c("rose", "rose", "rose", "rose", "cactus",
"cactus"), area = c("North", "South", "East", "West", "South",
"Northwest")), class = "data.frame", row.names = c(NA, -6L))
The final dataset should look like:
structure(list(Plant2 = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area2 = c("North", "South", "East", "West", "North", "South",
"East", "West", "South", "Northwest", "South", "Northwest")), class = "data.frame", row.names = c(NA,
-12L))
This is how I started. I created a variable for how many combinations were potentially possible and added them to the datframe with a join. And then I got super stuck because try as I might, I can't change the Area variables properly. I thought I could basically paste all the combinations of the variables with the same dups value, but I can't call to the other dataframe from dplyr. This is a very simplistic version of the data, there are many other combinations and so it's not really want to do by subsetting the data, etc...
dups<-combos %>% group_by(nam) %>% mutate(dups=n())
colnames(dups)<-c("Plant","Area","dups")
df<-left_join(df,dups)
df<-df %>% uncount(dups, .remove=FALSE)
The information you have provided is not enough to produce a final dataframe like that since each combination of Plant and dups in df can be mapped to multiple values in combos. For instance, each "rose" and "4" could be matched against the first four rows in combos. However, it seems that you simply want Area2 to repeat itself until the values fill up all possible entries for each group of Plant and dups. If so, you can try
library(dplyr)
combos <- combos %>% group_by(nam) %>% mutate(dups = n())
df %>%
group_by(Plant, dups) %>%
mutate(Area2 = rep(
combos$area[combos$nam == Plant[[1L]] & combos$dups == dups[[1L]]],
length.out = n()
))
Output
# A tibble: 12 x 4
# Groups: Plant, dups [2]
Plant Area dups Area2
<chr> <chr> <int> <chr>
1 rose North 4 North
2 rose North 4 South
3 rose North 4 East
4 rose North 4 West
5 rose South 4 North
6 rose South 4 South
7 rose South 4 East
8 rose South 4 West
9 cactus South 2 South
10 cactus South 2 Northwest
11 cactus South 2 South
12 cactus South 2 Northwest
You can use expand.grid to create a dataframe with all possible conditions
expand.grid(name = unique(df$name), area = unique(df$area))
Plant Area
1 rose North
2 cactus North
3 rose South
4 cactus South
5 rose East
6 cactus East
7 rose West
8 cactus West
9 rose Northwest
10 cactus Northwest
This snippet should do what you want, if I've understood correctly. Here, d1 and d2 are your first and second data frames. I don't think that computing dups as you have is necessary for this task, but maybe I've misunderstood your intention.
library("dplyr")
l <- split(d2$area, d2$nam)
d1 %>%
group_by(Plant) %>%
mutate(Area = rep_len(l[[Plant[1L]]], n())) %>%
ungroup() %>%
select(-dups)
# A tibble: 12 × 2
Plant Area
<chr> <chr>
1 rose North
2 rose South
3 rose East
4 rose West
5 rose North
6 rose South
7 rose East
8 rose West
9 cactus South
10 cactus Northwest
11 cactus South
12 cactus Northwest
The code below does what I want for a simple table. The mapping that takes place in the statement with on works perfectly. But I also have the situation with multiple countries that need to be assigned potentially to multiple regions and the result stored in the regions column is more challenging
library(data.table)
testDT <- data.table(country = c("Algeria", "Egypt", "United States", "Brazil"))
testDTcomplicated <- data.table(country = c("Algeria, Ghana, Sri Lanka", "Egypt", "United States, Argentina", "Brazil"))
regionLookup <- data.table(countrylookup = c("Algeria", "Argentina", "Egypt", "United States", "Brazil", "Ghana", "Sri Lanka"), regionVal = c("Africa", "South America", "Africa", "North America", "South America", "Africa", "Asia"))
testDT[regionLookup, region := regionVal, on = c(country = "countrylookup")]
> testDT
country region
1: Algeria Africa
2: Egypt Africa
3: United States North America
4: Brazil South America
I'd like to have testDTcomplicated look like the following
> testDT
country region
1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
2: Egypt Africa
3: United States, Argentina, Brazil North America, South America, South America
4: Brazil South America
You could split the data on comma and get each country in a separate row, join the data with regionLookup and collapse them again in one value in a comma-separated string.
library(data.table)
testDTcomplicated[, row := seq_len(.N)]
new <- splitstackshape::cSplit(testDTcomplicated, 'country', ',',
direction = 'long')[regionLookup, region := regionVal,
on = c(country = "countrylookup")]
new <- new[, lapply(.SD, toString), row][,row:=NULL]
new
# country region
#1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
#2: Egypt Africa
#3: United States, Argentina North America, South America
#4: Brazil South America
Same logic in dplyr can be implemented as :
library(dplyr)
testDTcomplicated %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(country, sep = ", ") %>%
left_join(regionLookup, by = c("country" = "countrylookup")) %>%
group_by(row) %>%
summarise(across(.fns = toString))
Is there a R function to check whether a word from my list is present in a string and if yes return another value?
Address
10 Sydney, South East
11 Mumbai, North West
12 London, Central Town
.
City Country
Mumbai India
Sydney Australia
London Britain
Output:
Address Country
10 Sydney, South East Australia
11 Mumbai, North West India
12 London, Central Town Britain
Sample code -
influencer %>%
mutate(AC.Name = AC_Village$AC.Name[match(AC_Village$Town,
str_extract(Complete.Address,paste(AC_Village$Town, collapse="|")))])
One option would be to extract the 'City' of 'Address' column from the 'City' column of second dataset, do a match and get the corresponding 'Country'
library(tidyverse)
df1 %>%
mutate(Country = df2$Country[match(df2$City, str_extract(Address,
paste(df2$City, collapse="|")))])
# Address Country
#1 10 Sydney, South East Australia
#2 11 Mumbai, North West India
#3 12 London, Central Town Britain
data
df1 <- structure(list(Address = c("10 Sydney, South East", "11 Mumbai, North West",
"12 London, Central Town")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(City = c("Mumbai", "Sydney", "London"), Country = c("India",
"Australia", "Britain")), class = "data.frame", row.names = c(NA,
-3L))
Subset of data frame:
country1 country2
Japan Japan
Netherlands <NA>
<NA> <NA>
Brazil Brazil
Russian Federation <NA>
<NA> <NA>
<NA> United States of America
Germany Germany
Ukraine <NA>
Japan Japan
<NA> Russian Federation
<NA> United States of America
France France
New Zealand New Zealand
Japan <NA>
I have two character vectors, country1 and country2, which I would like to merge together into a new column. No observations in my dataset have different countries. However, some pairs have duplicated values which I would like only to display once. There is also the issue of the NAs, which I want to omit in the merged column, where each value in the new column only has the country string. A few observations have NAs in both of my columns, which I just want to leave as NA in the new column. I'm wondering what the best way to tackle this would be.
I've made a minor modification to the function in the top voted answer here with a similar question, changing the seperation of commas to nothing.
However, this leaves the repeating issue unsolved:
country1 country2 merge
Japan Japan JapanJapan
Netherlands <NA> Netherlands
<NA> <NA> <NA>
Brazil Brazil BrazilBrazil
Russian Federation <NA> Russian Federation
<NA> <NA> <NA>
<NA> United States of America United States of America
Germany Germany GermanyGermany
Ukraine <NA> Ukraine
Japan Japan JapanJapan
<NA> Russian Federation Russian Federation
<NA> United States of America United States of America
France France FranceFrance
New Zealand New Zealand New ZealandNew Zealand
Japan <NA> Japan
Since you specified dplyr, here's a one-liner with it:
df <- dplyr::mutate(df, merge = dplyr::if_else(is.na(country1), country2, country1))
Data
country1 <- c("Japan", "Netherlands", NA, "Brazil", "Russian Federation", NA, NA, "Germany", "Ukraine", "Japan", NA, NA, "France", "New Zealand", "Japan")
country2 <- c("Japan", NA, NA, "Brazil", NA, NA, "United States of America", "Germany", NA, "Japan", "Russian Federation", "United States of America", "France", "New Zealand", NA)
df <- data.frame(country1, country2, stringsAsFactors = F)
Since you said you have character vectors, then:
library(tidyverse)
coalesce(country1,country2)
[1] "Japan" "Netherlands" NA
[4] "Brazil" "Russian Federation" NA
[7] "United States of America" "Germany" "Ukraine"
[10] "Japan" "Russian Federation" "United States of America"
[13] "France" "New Zealand" "Japan"
if its a dataframe. Just do coalesce(!!!df)
You can also just replace the NA values from 1st column with values from the 2nd :
df$country1[is.na(df$country1)] <- df$country2[is.na(df$country1)]