I am trying to automatically match two dataframes with a (for?) loop.
> df_key
country_election keyword1 keyword2 keyword3 keyword4 keyword5
1 France Paris Rome Madrid London Marseille
2 Spain Valencia Berlin Manchester Zurich Milan
> df_country
city country
1 Paris France
2 Rome Italy
3 Madrid Spain
4 London United Kingdom
5 Marseille France
6 Valencia Spain
7 Berlin Germany
8 Manchester United Kingdom
9 Zurich Switzerland
10 Milan Italy
In this example I would like to match every keyword in df_key with df_country to add country columns.
country_election keyword1 country_1 keyword2 country_2 keyword3 country_3
1 France Paris France Rome Italy Madrid Spain
2 Spain Valencia Spain Berlin Germany Manchester United Kingdom
FInally, I'd also like to have a series of dummy variables checking whether country_i is equal to country_election. Thanks a lot for your help.
df_key <- structure(list(country_election = c("France", "Spain"), keyword1 = c("Paris", "Valencia"),
keyword2 = c("Rome ", "Berlin"), keyword3 = c("Madrid", "Manchester"), keyword4 = c("London", "Zurich"),
keyword5 = c("Marseille", "Milan")), class = "data.frame", row.names = c(NA, -2L))
df_country <- structure(list(city = c("Paris", "Rome", "Madrid", "London", "Marseille", "Valencia",
"Berlin", "Manchester", "Zurich", "Milan"), country = c("France", "Italy", "Spain", "United Kingdom",
"France", "Spain", "Germany", "United Kingdom", "Switzerland", "Italy")),
class = "data.frame", row.names = c(NA, -10L))
You can match the city names, extract the country and create new columns. If the column order is important, extract the numeric part from it and order the data.
cols <- sub('keyword', 'country', names(df_key[-1]))
df_key[cols] <- df_country$country[match(as.matrix(df_key[-1]), df_country$city)]
df_key[order(as.numeric(sub('\\D+', '', names(df_key))), na.last = FALSE)]
# country_election keyword1 country1 keyword2 country2 keyword3
#1 France Paris France Rome Italy Madrid
#2 Spain Valencia Spain Berlin Germany Manchester
# country3 keyword4 country4 keyword5 country5
#1 Spain London United Kingdom Marseille France
#2 United Kingdom Zurich Switzerland Milan Italy
Related
I have a dataframe like this:
structure(list(from = c("China", "China", "Canada", "Canada",
"USA", "China", "Trinidad and Tobago", "China", "USA", "USA"),
to = c("Japan", "Japan", "USA", "USA", "Japan", "USA", "USA",
"Rep. of Korea", "Canada", "Japan"), weight = c(4766781396,
4039683737, 3419468319, 3216051707, 2535151299, 2513604035,
2303474559, 2096033823, 2091906420, 2066357443)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
from = c("Canada", "China", "China", "China", "Trinidad and Tobago",
"USA", "USA"), to = c("USA", "Japan", "Rep. of Korea", "USA",
"USA", "Canada", "Japan"), .rows = structure(list(3:4, 1:2,
8L, 6L, 7L, 9L, c(5L, 10L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), .drop = TRUE))
I would like to perform the absolute value of difference in weight column grouped by from and to.
I'm trying with the function aggregate() but it seems to work for means and sums and not for difference. For example (df is the name of my dataframe):
aggregate(weight~from+to, data = df, FUN=mean)
which produces:
from to weight
1 USA Canada 2091906420
2 China Japan 4403232567
3 USA Japan 2300754371
4 China Rep. of Korea 2096033823
5 Canada USA 3317760013
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
EDIT. The desired result is instead
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
As we can see, the countries that appear two times in the columns from and to colllapsed in only one row with the difference between weights in the column weight. E.g.,
from to weight
China Japan 4766781396
China Japan 4039683737
become
from to weight
China Japan 727097659
because
> 4766781396-4039683737
[1] 727097659
The difference should be positive (and this is why I wrote "the absolute value of difference of the weights").
The couples of countries which instead appear just in one row of dataframe df remain the same, as e.g.
from to weight
7 Trinidad and Tobago USA 2303474559
Assuming at most 2 values per group and that the order of the difference is not important
aggregate(weight~from+to, data=df, FUN=function(x){
abs(ifelse(length(x)==1,x,diff(x)))
})
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
Is the following what you are looking for?
f <- function(x) abs(x[2] - x[1])
aggregate(weight ~ from + to, data = df, FUN = f)
#> from to weight
#> 1 USA Canada NA
#> 2 China Japan 727097659
#> 3 USA Japan 468793856
#> 4 China Rep. of Korea NA
#> 5 Canada USA 203416612
#> 6 China USA NA
#> 7 Trinidad and Tobago USA NA
This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I am trying to learn how to "count by multiple groups" in R using the dplyr library. I generated some data, and now I want to count the number of people for each combination of city and country.
Can someone please tell me if the code I have written is correct?
library(dplyr)
Data_I_Have <- data.frame(
"Country" = c("USA", "USA", "USA", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "FRANCE", "UK"),
"City" = c("Chicago", "Chicago", "Boston", "Madrid", "Madrid", "Madrid", "Barcelona", "Barcelona", "NA", "Paris", "London"),
" Person" = c("John", "John", "Jim", "Jeff", "Joseph", "Jason", "Justin", "Jake", "Joe", "Jaccob", "Jon")
)
summary = Data_I_Have %>%
dplyr::group_by(Country, City)%>%
dplyr::summarise(COUNT = n())
summary = data.frame(summary)
Suppose if I had wanted to count the number of distinct names, is this code correct?
summary_2 = Data_I_Have %>%
dplyr::group_by(Country,City)%>%
dplyr::summarise(UNIQUE_COUNT = n())
Is this correct as well?
Thanks
Try this:
library(dplyr)
#Code
Data_I_Have %>%
dplyr::group_by(Country,City)%>%
dplyr::summarise(UNIQUE_COUNT = n_distinct(` Person`))
With n() you will get the number of observations per group:
# A tibble: 7 x 3
# Groups: Country [4]
Country City UNIQUE_COUNT
<chr> <chr> <int>
1 FRANCE Paris 1
2 SPAIN Barcelona 2
3 SPAIN Madrid 3
4 SPAIN NA 1
5 UK London 1
6 USA Boston 1
7 USA Chicago 2
Whereas, with n_distinct() you will get the number of unique observations:
# A tibble: 7 x 3
# Groups: Country [4]
Country City UNIQUE_COUNT
<chr> <chr> <int>
1 FRANCE Paris 1
2 SPAIN Barcelona 2
3 SPAIN Madrid 3
4 SPAIN NA 1
5 UK London 1
6 USA Boston 1
7 USA Chicago 1
Is this what you're looking for?
library(tidyverse)
Data_I_Have <- data.frame(
"Country" = c("USA", "USA", "USA", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "SPAIN", "FRANCE", "UK"),
"City" = c("Chicago", "Chicago", "Boston", "Madrid", "Madrid", "Madrid", "Barcelona", "Barcelona", "NA", "Paris", "London"),
"Person" = c("John", "John", "Jim", "Jeff", "Joseph", "Jason", "Justin", "Jake", "Joe", "Jaccob", "Jon")
)
Data_I_Have
#> Country City Person
#> 1 USA Chicago John
#> 2 USA Chicago John
#> 3 USA Boston Jim
#> 4 SPAIN Madrid Jeff
#> 5 SPAIN Madrid Joseph
#> 6 SPAIN Madrid Jason
#> 7 SPAIN Barcelona Justin
#> 8 SPAIN Barcelona Jake
#> 9 SPAIN NA Joe
#> 10 FRANCE Paris Jaccob
#> 11 UK London Jon
Data_I_Have %>%
distinct(Country, City, Person) %>%
group_by(Country, City) %>%
summarise(n_uniuqe_names=length(Person))
#> `summarise()` regrouping output by 'Country' (override with `.groups` argument)
#> # A tibble: 7 x 3
#> # Groups: Country [4]
#> Country City n_uniuqe_names
#> <chr> <chr> <int>
#> 1 FRANCE Paris 1
#> 2 SPAIN Barcelona 2
#> 3 SPAIN Madrid 3
#> 4 SPAIN NA 1
#> 5 UK London 1
#> 6 USA Boston 1
#> 7 USA Chicago 1
Created on 2020-12-02 by the reprex package (v0.3.0)
Is there a R function to check whether a word from my list is present in a string and if yes return another value?
Address
10 Sydney, South East
11 Mumbai, North West
12 London, Central Town
.
City Country
Mumbai India
Sydney Australia
London Britain
Output:
Address Country
10 Sydney, South East Australia
11 Mumbai, North West India
12 London, Central Town Britain
Sample code -
influencer %>%
mutate(AC.Name = AC_Village$AC.Name[match(AC_Village$Town,
str_extract(Complete.Address,paste(AC_Village$Town, collapse="|")))])
One option would be to extract the 'City' of 'Address' column from the 'City' column of second dataset, do a match and get the corresponding 'Country'
library(tidyverse)
df1 %>%
mutate(Country = df2$Country[match(df2$City, str_extract(Address,
paste(df2$City, collapse="|")))])
# Address Country
#1 10 Sydney, South East Australia
#2 11 Mumbai, North West India
#3 12 London, Central Town Britain
data
df1 <- structure(list(Address = c("10 Sydney, South East", "11 Mumbai, North West",
"12 London, Central Town")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(City = c("Mumbai", "Sydney", "London"), Country = c("India",
"Australia", "Britain")), class = "data.frame", row.names = c(NA,
-3L))
Subset of data frame:
country1 country2
Japan Japan
Netherlands <NA>
<NA> <NA>
Brazil Brazil
Russian Federation <NA>
<NA> <NA>
<NA> United States of America
Germany Germany
Ukraine <NA>
Japan Japan
<NA> Russian Federation
<NA> United States of America
France France
New Zealand New Zealand
Japan <NA>
I have two character vectors, country1 and country2, which I would like to merge together into a new column. No observations in my dataset have different countries. However, some pairs have duplicated values which I would like only to display once. There is also the issue of the NAs, which I want to omit in the merged column, where each value in the new column only has the country string. A few observations have NAs in both of my columns, which I just want to leave as NA in the new column. I'm wondering what the best way to tackle this would be.
I've made a minor modification to the function in the top voted answer here with a similar question, changing the seperation of commas to nothing.
However, this leaves the repeating issue unsolved:
country1 country2 merge
Japan Japan JapanJapan
Netherlands <NA> Netherlands
<NA> <NA> <NA>
Brazil Brazil BrazilBrazil
Russian Federation <NA> Russian Federation
<NA> <NA> <NA>
<NA> United States of America United States of America
Germany Germany GermanyGermany
Ukraine <NA> Ukraine
Japan Japan JapanJapan
<NA> Russian Federation Russian Federation
<NA> United States of America United States of America
France France FranceFrance
New Zealand New Zealand New ZealandNew Zealand
Japan <NA> Japan
Since you specified dplyr, here's a one-liner with it:
df <- dplyr::mutate(df, merge = dplyr::if_else(is.na(country1), country2, country1))
Data
country1 <- c("Japan", "Netherlands", NA, "Brazil", "Russian Federation", NA, NA, "Germany", "Ukraine", "Japan", NA, NA, "France", "New Zealand", "Japan")
country2 <- c("Japan", NA, NA, "Brazil", NA, NA, "United States of America", "Germany", NA, "Japan", "Russian Federation", "United States of America", "France", "New Zealand", NA)
df <- data.frame(country1, country2, stringsAsFactors = F)
Since you said you have character vectors, then:
library(tidyverse)
coalesce(country1,country2)
[1] "Japan" "Netherlands" NA
[4] "Brazil" "Russian Federation" NA
[7] "United States of America" "Germany" "Ukraine"
[10] "Japan" "Russian Federation" "United States of America"
[13] "France" "New Zealand" "Japan"
if its a dataframe. Just do coalesce(!!!df)
You can also just replace the NA values from 1st column with values from the 2nd :
df$country1[is.na(df$country1)] <- df$country2[is.na(df$country1)]
I would like to replace the value of column Name.x with value from Name.y with condition if it is not NA (empty rows)
Name.x Name.y
US NA
Germany NA
Germany France
Canada NA
Italy Morocco
Austria Belgium
Result:
Name.x
US
Germany
France
Canada
Morocco
Belgium
Example data:
a <- data.frame("Name.x" = c("US", "Germany","Germany", "Canada", "Italy", "Austria"), "Name.y" = c(NA, NA, "France", NA, "Morocco", "Belgium"))
Solution:
a$Name.x <- ifelse(is.na(a$Name.y), as.character(a$Name.x), as.character(a$Name.y))
Try something like this:
Your data.frame
db1<-data.frame(Name.x=c("US","Germany"),
Name.y=c(NA,"France"))
db1
Name.x Name.y
US <NA>
Germany France
Columns names to analyze/substitute
coldb1_NA<-"Name.y"
coldb_toSub<-"Name.x"
Substitution
db2<-db1
db2[,coldb_toSub]<-as.character(db1[,coldb_toSub])
db2[!is.na(db1[,coldb1_NA]),coldb_toSub]<-as.character(db1[!is.na(db1[,coldb1_NA]),coldb1_NA])
Your output
db2
Name.x Name.y
1 US <NA>
2 France France