I have a dataframe like this:
structure(list(from = c("China", "China", "Canada", "Canada",
"USA", "China", "Trinidad and Tobago", "China", "USA", "USA"),
to = c("Japan", "Japan", "USA", "USA", "Japan", "USA", "USA",
"Rep. of Korea", "Canada", "Japan"), weight = c(4766781396,
4039683737, 3419468319, 3216051707, 2535151299, 2513604035,
2303474559, 2096033823, 2091906420, 2066357443)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
from = c("Canada", "China", "China", "China", "Trinidad and Tobago",
"USA", "USA"), to = c("USA", "Japan", "Rep. of Korea", "USA",
"USA", "Canada", "Japan"), .rows = structure(list(3:4, 1:2,
8L, 6L, 7L, 9L, c(5L, 10L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), .drop = TRUE))
I would like to perform the absolute value of difference in weight column grouped by from and to.
I'm trying with the function aggregate() but it seems to work for means and sums and not for difference. For example (df is the name of my dataframe):
aggregate(weight~from+to, data = df, FUN=mean)
which produces:
from to weight
1 USA Canada 2091906420
2 China Japan 4403232567
3 USA Japan 2300754371
4 China Rep. of Korea 2096033823
5 Canada USA 3317760013
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
EDIT. The desired result is instead
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
As we can see, the countries that appear two times in the columns from and to colllapsed in only one row with the difference between weights in the column weight. E.g.,
from to weight
China Japan 4766781396
China Japan 4039683737
become
from to weight
China Japan 727097659
because
> 4766781396-4039683737
[1] 727097659
The difference should be positive (and this is why I wrote "the absolute value of difference of the weights").
The couples of countries which instead appear just in one row of dataframe df remain the same, as e.g.
from to weight
7 Trinidad and Tobago USA 2303474559
Assuming at most 2 values per group and that the order of the difference is not important
aggregate(weight~from+to, data=df, FUN=function(x){
abs(ifelse(length(x)==1,x,diff(x)))
})
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
Is the following what you are looking for?
f <- function(x) abs(x[2] - x[1])
aggregate(weight ~ from + to, data = df, FUN = f)
#> from to weight
#> 1 USA Canada NA
#> 2 China Japan 727097659
#> 3 USA Japan 468793856
#> 4 China Rep. of Korea NA
#> 5 Canada USA 203416612
#> 6 China USA NA
#> 7 Trinidad and Tobago USA NA
Related
I have this dataset below
Country Sales
France 12000
Germany 2400
Italy 1000
Belgium 500
Please can you help with a code to add a common 'word' to the Country. I have tried all my best. Here is the intended output I want. thanks
Country Sales
France - Europe 12000
Germany - Europe 2400
Italy - Europe 1000
Belgium - Europe 500
Thanks, as you help me
country_data <-
data.frame(
country = c(
"France",
"Germany",
"Italy",
"Belgium"
),
sales = c(
12000,
2400,
1000,
500
)
)
country_data_2 <-
country_data |>
dplyr::mutate(
continent = dplyr::case_when(
country %in% c("France", "Germany", "Italy", "Belgium") ~ "Europe",
country %in% c("Egypt", "South Africa", "Morroco") ~ "Africa",
country %in% c("Canada", "Mexico", "United States") ~ "North America"
# ...
)
) |>
dplyr::transmute(
country = paste(
country,
continent,
sep = " - "
),
sales = sales
)
country_data_2
#> country sales
#> 1 France - Europe 12000
#> 2 Germany - Europe 2400
#> 3 Italy - Europe 1000
#> 4 Belgium - Europe 500
Created on 2022-11-08 with reprex v2.0.2
I am trying to automatically match two dataframes with a (for?) loop.
> df_key
country_election keyword1 keyword2 keyword3 keyword4 keyword5
1 France Paris Rome Madrid London Marseille
2 Spain Valencia Berlin Manchester Zurich Milan
> df_country
city country
1 Paris France
2 Rome Italy
3 Madrid Spain
4 London United Kingdom
5 Marseille France
6 Valencia Spain
7 Berlin Germany
8 Manchester United Kingdom
9 Zurich Switzerland
10 Milan Italy
In this example I would like to match every keyword in df_key with df_country to add country columns.
country_election keyword1 country_1 keyword2 country_2 keyword3 country_3
1 France Paris France Rome Italy Madrid Spain
2 Spain Valencia Spain Berlin Germany Manchester United Kingdom
FInally, I'd also like to have a series of dummy variables checking whether country_i is equal to country_election. Thanks a lot for your help.
df_key <- structure(list(country_election = c("France", "Spain"), keyword1 = c("Paris", "Valencia"),
keyword2 = c("Rome ", "Berlin"), keyword3 = c("Madrid", "Manchester"), keyword4 = c("London", "Zurich"),
keyword5 = c("Marseille", "Milan")), class = "data.frame", row.names = c(NA, -2L))
df_country <- structure(list(city = c("Paris", "Rome", "Madrid", "London", "Marseille", "Valencia",
"Berlin", "Manchester", "Zurich", "Milan"), country = c("France", "Italy", "Spain", "United Kingdom",
"France", "Spain", "Germany", "United Kingdom", "Switzerland", "Italy")),
class = "data.frame", row.names = c(NA, -10L))
You can match the city names, extract the country and create new columns. If the column order is important, extract the numeric part from it and order the data.
cols <- sub('keyword', 'country', names(df_key[-1]))
df_key[cols] <- df_country$country[match(as.matrix(df_key[-1]), df_country$city)]
df_key[order(as.numeric(sub('\\D+', '', names(df_key))), na.last = FALSE)]
# country_election keyword1 country1 keyword2 country2 keyword3
#1 France Paris France Rome Italy Madrid
#2 Spain Valencia Spain Berlin Germany Manchester
# country3 keyword4 country4 keyword5 country5
#1 Spain London United Kingdom Marseille France
#2 United Kingdom Zurich Switzerland Milan Italy
Update
Thanks to #jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England China USA
2 England China USA
3 England China USA
4 China England USA
5 Sweden
It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
Aspired Result
China England USA Sweden
China 4 4 4 0
England 4 4 4 0
USA 4 4 4 0
Sweden 0 0 0 0
*I've used the code from here to remove all non-unique observations.
Original Post
Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("England", "China", "China", "England", NA),
CTR4 = c("China", "USA", "USA", "China", NA),
CTR5 = c("USA", "England", "USA", "USA", NA),
CTR6 = c("England", "China", "USA", "England", NA))
df
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England England England China USA England
2 England China China USA England China
3 England China China USA USA USA
4 China England England China USA England
5 England
and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),
1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0
Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.
1 England England England China USA England
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1
Due to the fact that row count should not >1 for a combination by row/ID, this results to:
China England USA
China 1 1 1
England 1 1 1
USA 1 1 1
This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:
China England USA
China 4 4 4
England 4 4 4
USA 4 4 4
So there are five conditions for counting:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
A combination should be counted as 1.
Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.
I'm a novice in R. Any help is very much appreciated.
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
Here is an option using base::table:
#get paired combi and remove those from same country
pairsDF <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))
#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")
#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]
output:
V2
V1 China England Sweden USA
China 4 4 0 4
England 4 4 0 4
Sweden 0 0 0 0
USA 4 4 0 4
data:
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("China", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
Subset of data frame:
country1 country2
Japan Japan
Netherlands <NA>
<NA> <NA>
Brazil Brazil
Russian Federation <NA>
<NA> <NA>
<NA> United States of America
Germany Germany
Ukraine <NA>
Japan Japan
<NA> Russian Federation
<NA> United States of America
France France
New Zealand New Zealand
Japan <NA>
I have two character vectors, country1 and country2, which I would like to merge together into a new column. No observations in my dataset have different countries. However, some pairs have duplicated values which I would like only to display once. There is also the issue of the NAs, which I want to omit in the merged column, where each value in the new column only has the country string. A few observations have NAs in both of my columns, which I just want to leave as NA in the new column. I'm wondering what the best way to tackle this would be.
I've made a minor modification to the function in the top voted answer here with a similar question, changing the seperation of commas to nothing.
However, this leaves the repeating issue unsolved:
country1 country2 merge
Japan Japan JapanJapan
Netherlands <NA> Netherlands
<NA> <NA> <NA>
Brazil Brazil BrazilBrazil
Russian Federation <NA> Russian Federation
<NA> <NA> <NA>
<NA> United States of America United States of America
Germany Germany GermanyGermany
Ukraine <NA> Ukraine
Japan Japan JapanJapan
<NA> Russian Federation Russian Federation
<NA> United States of America United States of America
France France FranceFrance
New Zealand New Zealand New ZealandNew Zealand
Japan <NA> Japan
Since you specified dplyr, here's a one-liner with it:
df <- dplyr::mutate(df, merge = dplyr::if_else(is.na(country1), country2, country1))
Data
country1 <- c("Japan", "Netherlands", NA, "Brazil", "Russian Federation", NA, NA, "Germany", "Ukraine", "Japan", NA, NA, "France", "New Zealand", "Japan")
country2 <- c("Japan", NA, NA, "Brazil", NA, NA, "United States of America", "Germany", NA, "Japan", "Russian Federation", "United States of America", "France", "New Zealand", NA)
df <- data.frame(country1, country2, stringsAsFactors = F)
Since you said you have character vectors, then:
library(tidyverse)
coalesce(country1,country2)
[1] "Japan" "Netherlands" NA
[4] "Brazil" "Russian Federation" NA
[7] "United States of America" "Germany" "Ukraine"
[10] "Japan" "Russian Federation" "United States of America"
[13] "France" "New Zealand" "Japan"
if its a dataframe. Just do coalesce(!!!df)
You can also just replace the NA values from 1st column with values from the 2nd :
df$country1[is.na(df$country1)] <- df$country2[is.na(df$country1)]
My aim is to create a table that summarizes the countries featured in my sample. This table should only have two rows, a first row with different columns for each region and a second row with country names that are located in the respective region.
To give you an example, this is what my data.frame XYZ looks like:
..................wvs5red2.s003names.....wvs5red2.regiondummies
21............."Hong Kong"......................Asian Tigers
45............."South Korea"....................Asian Tigers
49............."Taiwan".............................Asian Tigers
66............."China"...............................East Asia & Pacific
80............."Indonesia"........................East Asia & Pacific
86............."Malaysia"...........................East Asia & Pacific
My aim is to obtain a table that looks similar to this:
region.............Asian Tigers..............................................East Asia & Pacific
countries........Hong Kong, South Korea, Taiwan...........China, Indonesia, etc.
Do you have any idea how to obtain such a table? It took me hours searching for something similar.
Simplest way is tapply:
XYZ <- structure(list(
names = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("China", "Hong Kong", "Indonesia", "Malaysia", "South Korea", "Taiwan"), class = "factor"),
region = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Asian Tigers", "East Asia & Pacific"), class = "factor")),
.Names = c("names", "region"), row.names = c(NA, -6L), class = "data.frame")
tapply(XYZ$names, XYZ$region, paste, collapse=", ")
# Asian Tigers East Asia & Pacific
# "Hong Kong, South Korea, Taiwan" "China, Indonesia, Malaysia"
Recreate the data:
dat <- data.frame(
country = c("Hong Kong", "South Korea", "Taiwan", "China", "Indonesia", "Malaysia"),
region = c(rep("Asian Tigers", 3), rep("East Asia & Pacific", 3))
)
dat
country region
1 Hong Kong Asian Tigers
2 South Korea Asian Tigers
3 Taiwan Asian Tigers
4 China East Asia & Pacific
5 Indonesia East Asia & Pacific
6 Malaysia East Asia & Pacific
Use ddply in package plyr combined with paste to summarise the data:
library(plyr)
ddply(dat, .(region), function(x)paste(x$country, collapse= ","))
region V1
1 Asian Tigers Hong Kong,South Korea,Taiwan
2 East Asia & Pacific China,Indonesia,Malaysia
First create data:
> country<-c("Hong Kong","Taiwan","China","Indonesia")
> region<-rep(c("Asian Tigers","East Asia & Pacific"),each=2)
> df<-data.frame(country=country,region=region)
Then run through column region and gather all the countries. We can use tapply, but I will use dlply from package plyr, since it retains list names.
> ll<-dlply(df,~region,function(d)paste(d$country,collapse=","))
> ll
$`Asian Tigers`
[1] "Hong Kong,Taiwan"
$`East Asia & Pacific`
[1] "China,Indonesia"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
region
1 Asian Tigers
2 East Asia & Pacific
Now convert the list to the data.frame using do.call. Since we need nice names we need to pass argument check.names=FALSE:
> ll$check.names <- FALSE
> do.call("data.frame",ll)
Asian Tigers East Asia & Pacific
1 Hong Kong,Taiwan China,Indonesia