I've got a df with country-level data entered in 2003.
Several rows of data belong to a country named 'Federal Republic of Yugoslavia'.
These are two separate countries today and I want to duplicate these rows of data so that I can rename each set of rows to its respective modern country name.
data.frame(Country = "Yugoslavia", Chickens = 567)
Using this minimal example, how do I create this dataframe?
data.frame(Country = c("Serbia", "Montenegro"), Chickens = 567)
you can do in one tidyverse pipe:
library(tidyverse)
df2 <- df %>%
mutate(Country = if_else(Country == "Yugoslavia", "Serbia", as.character(Country))) %>%
bind_rows(df) %>%
mutate(Country = if_else(Country == "Yugoslavia", "Montenegro", as.character(Country)))
You could also use mutate_if instead of the if_else statements.
Country Chickens
1 Serbia 567
2 Montenegro 567
By default data.frame turns character columns into factors. The substitution above coerces into character.
If you want to preserve the factor class then just add:
%>% mutate(Country = as.factor(Country))
... at the end.
You can do something like this:
data2<-data[data$country=="Yugoslavia"]
levels(data2$country)[levels(data2$country)=="Yugoslavia"]<-"Serbia"
levels(data$country)[levels(data$country)=="Yugoslavia"]<-"Montenegro"
rbind(data,data2)
You can write a function which returns the duplicated and renamed rows like:
fun <- function(y) {
if(y[["Country"]] == "Yugoslavia") rbind(replace(y, "Country", "Serbia")
, replace(y, "Country", "Montenegro"))
else y
}
do.call("rbind", apply(x, 1, fun))
# Country Chickens
#[1,] "Italy" " 2"
#[2,] "Serbia" "567"
#[3,] "Montenegro" "567"
#[4,] "Austria" " 3"
Or if order does not matter:
rbind(x[x$Country != "Yugoslavia",]
, replace(x[x$Country == "Yugoslavia",], "Country", "Serbia")
, replace(x[x$Country == "Yugoslavia",], "Country", "Montenegro"))
# Country Chickens
#1 Italy 2
#3 Austria 3
#2 Serbia 567
#21 Montenegro 567
Data:
x <- data.frame(Country = c("Italy","Yugoslavia","Austria"), Chickens = c(2,567,3))
x
# Country Chickens
#1 Italy 2
#2 Yugoslavia 567
#3 Austria 3
Related
Noob question, but how would I create a separate variable that is formed from specific attributes of other variables? For example, I'm trying to find Asian countries in the "region" variable that have a "democracy" variable score of "3." I want to create a variable called "asia3" that selects those Asian countries with a democracy score of 3.
The which operator should solve your request.
asia3 <- your_data[ which(your_data$Region=='Asia' & your_data$democracy == 3), ]
In base R, you can create a new variable based on a condition using an ifelse statement, then assign to a new variable called asia3.
df$asia3 <- ifelse(df$region == "Asia" & df$democracy == 3, "yes", "no")
region democracy asia3
1 Asia 3 yes
2 Australia 3 no
3 Asia 2 no
4 Europe 1 no
Or if you only need a logical output, then you do not need the ifelse:
df$asia3 <- df$region == "Asia" & df$democracy == 3
region democracy asia3
1 Asia 3 TRUE
2 Australia 3 FALSE
3 Asia 2 FALSE
4 Europe 1 FALSE
or with tidyverse
library(tidyverse)
df %>%
mutate(asia3 = ifelse(df$region == "Asia" & df$democracy == 3, TRUE, FALSE))
However, if you only want to keep the rows that meet those conditions, then you can:
#dplyr
df %>%
filter(region == "Asia" & democracy == 3)
#base R
df[df$region=='Asia' & df$democracy == 3, ]
# region democracy
#1 Asia 3
Data
df <-
structure(list(
region = c("Asia", "Australia", "Asia", "Europe"),
democracy = c(3, 3, 2, 1)
),
class = "data.frame",
row.names = c(NA,-4L))
I'm creating a summary table that groups my records by country of destination:
SummarybyLocation <- PSTNRecords %>%
group_by(Destination) %>%
summarize(
Calls = n(),
Minutes = sum(durationMinutes),
MaxDuration = max(durationMinutes),
AverageDuration = mean(durationMinutes),
Charges = sum(charge),
Fees = sum(connectionCharge)
)
SummarybyLocation
The resulting table is as follows:
I realized that the Destination values are inconsistent (for example, "France" and "FR" both refer to the same area, and then I have a "North America" that I presume gathers USA and Canada.
I was wondering if there's a way of creating custom groups for these values, so that the aggregation would make more sense. I tried to use the countrycode package to add an iso2c column, but that doesn't resolve the problem of managing other area aggregations like "North America".
I would really appreciate some suggestions on how to handle this.
Thanks in advance!
Here is one possibility for cleaning up the data with a very minimal example. First, I get a list of country names and the 2 and 3 letter abbreviations, and put into a dataframe, countries. Then, I left_join countries to df for the two letter code, which in this case matches FR. Then, I repeat the left_join but with the 3 letter code, which has no matches in this case. Then, I coalesce the two new columns together, i.e., Country.x and Country.y. Then, I use case_when to multiple if-else statements. First, if Country is not an NA, then I replace Destination with the full country name. This is where you can add in other arguments if you have other items (e.g., Europe) that you might also need to fix. Next, I replace North America with "United States-Canada-Mexico". Finally, I remove the columns that start with "Country".
library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
theurl <-
getURL("https://www.iban.com/country-codes",
.opts = list(ssl.verifypeer = FALSE))
countries <- readHTMLTable(theurl)
countries <-
list.clean(countries, fun = is.null, recursive = FALSE)[[1]]
df %>%
left_join(.,
countries %>% select(Country, `Alpha-2 code`),
by = c("Destination" = "Alpha-2 code")) %>%
left_join(.,
countries %>% select(Country, `Alpha-3 code`),
by = c("Destination" = "Alpha-3 code")) %>%
mutate(
Country = coalesce(Country.x, Country.y),
Destination = case_when(!is.na(Country) ~ Country,
Destination == "North America" ~ "United States-Canada-Mexico",
TRUE ~ Destination
)) %>%
select(-c(starts_with("Country")))
Output
Destination durationMinutes charge connectionCharge
1 France 6.57 0.00 0
2 France 3.34 1.94 0
3 United States 234.40 3.00 0
4 United States-Canada-Mexico 23.40 2.00 0
However, if you have a lot of different variations, then you probably just want to create a simple dataframe with the substitutions, as then you can just do one left_join.
Another option is to also add in a Continent column, which you could get from countrycode.
library(countrycode)
countrycode(sourcevar = df$Destination,
origin = "country.name",
destination = "continent")
[1] NA "Europe" "Americas" NA
Data
df <- structure(list(Destination = c("FR", "France", "United States",
"North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -4L))
The {countrycode} package can deal with custom names/codes easily...
library(tidyverse)
library(countrycode)
PSTNRecords <- tibble::tribble(
~Destination, ~durationMinutes, ~charge, ~connectionCharge,
"FR", 1, 2.5, 0.3,
"France", 1, 2.5, 0.3,
"United States", 1, 2.5, 0.3,
"USA", 1, 2.5, 0.3,
"North America", 1, 2.5, 0.3
)
# see what special codes/country names you have to deal with
iso3cs <- countrycode(PSTNRecords$Destination, "country.name", "iso3c", warn = FALSE)
unique(PSTNRecords$Destination[is.na(iso3cs)])
#> [1] "FR" "North America"
# decde how to deal with them
custom_matches <- c("FR" = "FRA", "North America" = "USA")
# use your custom codes
PSTNRecords %>%
mutate(iso3c = countrycode(Destination, "country.name", "iso3c", custom_match = custom_matches))
#> # A tibble: 5 × 5
#> Destination durationMinutes charge connectionCharge iso3c
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 FR 1 2.5 0.3 FRA
#> 2 France 1 2.5 0.3 FRA
#> 3 United States 1 2.5 0.3 USA
#> 4 USA 1 2.5 0.3 USA
#> 5 North America 1 2.5 0.3 USA
I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)
I have a table that looks like this:
Further down in the table, the countries in Target.Country are repeated in Source.Country, therefore repeating the combinations but with different numbers, sums and means. Is it possible for when the combinations are the same, sum the remaining columns together and add an additional column to find the average?
For example:
Source.Country Target.Country number sum_intensity mean_intensity
North Korea South Korea 26492 10674.9 0.402
South Korea North Korea 34912 53848.3 1.542
To be:
Source.Country Target.Country number sum_intensity mean_intensity Average
North Korea South Korea 61404 64523.2 1.944 1.05
Any help would be great!
A similar solution to what #Axeman proposed in the comments:
library(purrr)
library(dplyr)
df=data.frame(Source.Country=c('North Korea', 'South Korea'),
Target.Country=c('South Korea', 'North Korea'),
number=c(26492, 34912),
sum_intensity=c(10674.9, 53848.3),
mean_intensity=c(0.402, 1.542))
df %>% mutate(grp = purrr::map2_chr(Source.Country, Target.Country, ~paste(sort(c(as.character(.x), as.character(.y))), collapse=' '))) %>%
group_by(grp) %>%
summarise(number = sum(number),
sum_intensity = sum(sum_intensity),
mean_intensity = sum(mean_intensity),
average = sum_intensity/number)
# # A tibble: 1 x 5
# grp number sum_intensity mean_intensity average
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 North Korea South Korea 61404. 64523. 1.94 1.05
A few minor tweaks:
it does require collapse in the paste command
needs as.character to prevent the country names from being coerced into integers
mean_intensity can't be used as an output in the summary, then as an input, but an average of averages doesn't make much sense when number is unbalanced anyway. I just recalculated the average from the sums
I increased the dataframe to check if the code was working properly
df1<-rbind(c( "North Korea ","South Korea" , 26492 , 10674.9 ,
0.402), c( "South Korea", "North Korea" , 34912 , 53848.3 , 1.542),
c( "Canada ","South Korea" , 26492 , 10674.9 , 0.402),
c( "South Korea", "Canada" , 34912 , 53848.3 , 1.542))
colnames(df1)<-c("Source.Country", "Target.Country", "number", "sum_intensity",
"mean_intensity")
df1<-data.frame(df1)
df1$number<-as.numeric(as.character(df1$number))
df1$sum_intensity<-as.numeric(as.character(df1$sum_intensity))
df1$mean_intensity<-as.numeric(as.character(df1$mean_intensity))
df1$Countries<-apply(cbind(df1$Source.Country, df1$Target.Country), 1, function(x)
paste(sort(x), collapse=" "))
#
library(reshape)
m1 <- aggregate(df1$number~df1$Countries,data=df1,FUN=mean)
m2 <- aggregate(df1$sum_intensity~df1$Countries,data=df1,FUN=mean)
m3 <- aggregate(df1$mean_intensity~df1$Countries,data=df1,FUN=mean)
mvtab <- merge(rename(m1,c(y="number")),
rename(m2,c(y="sum_intensity")))
mtab2<-merge(mvtab, rename(m3,c(y="mean_intensity")))
This seems like it should be a fairly simple problem, but I can't seem to find a straightforward solution.
I have a character list that looks like this:
my_info <- c("Fruits",
"North America",
"Apples",
"Michigan",
"Europe",
"Pomegranates",
"Greece",
"Oranges",
"Italy",
"Vegetables",
"North America",
"Potatoes",
"Idaho",
"Avocados",
"California",
"Europe",
"Artichokes",
"Italy",
"Meats",
"North America",
"Beef",
"Illinois")
I want to parse this character vector into a data frame that looks like this:
screenshot of R console
The food types and the region lists will always remain the same, but the foods and their locations are subject to change.
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
I was thinking I needed to use something like str_split, but use the food_types and regions as some sort of a delimiter? But I am not sure how to proceed. The character vector does have an order to it.
Thank you.
One solution can be to first convert your my_info vector in a matrix using ncol = 4. This will split your vector in a matrix/data frame.
Now, you can apply the rule of for food_type and region and swap any food_type or region which is present in other columns.
Note: I request OP to check data once, it seems every 4 elements are not able to make a complete row with description provided by OP.
df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))
names(df) <- c("Foodtype", "Region", "Food", "Location")
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
t(apply(df,1,function(x){
for(i in seq_along(x)){
#One can think of writing a swap function here.
if(x[i] %in% region ){
temp = x[i]
x[i] = x[2]
x[2] = temp
}
#Swap any food_type wrongly placed in other column
if(x[i] %in% food_type ){
temp = x[i]
x[i] = x[1]
x[1] = temp
}
}
x
}))
# Foodtype Region Food Location
# [1,] "Fruits" "North America" "Apples" "Michigan"
# [2,] "Pomegranates" "Europe" "Greece" "Oranges"
# [3,] "Vegetables" "North America" "Italy" "Potatoes"
# [4,] "Idaho" "Europe" "California" "Avocados"
# [5,] "Meats" "North America" "Artichokes" "Italy"
# [6,] "Fruits" "North America" "Beef" "Illinois"
#
I have a long solution, but should work as long as food and location are always in the same order.
First creating a few data.frames with dplyr.
library(dplyr)
info <- data_frame(my_info = my_info)
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)
Next creating a data.frame that joins all of these together and fill missing values with tidyr and removing the rows we do not need. Then The most important trick is the last one, creating a cols column based on the assumption that the order is always the same!
library(tidyr)
df <- info %>%
left_join(food_type, by = c("my_info" = "food_type_id")) %>%
left_join(region, by = c("my_info" = "region_id")) %>%
fill(food_type) %>%
group_by(food_type) %>%
fill(region) %>%
filter(!is.na(region) & !(my_info == region)) %>%
ungroup %>%
mutate(cols = rep(c("food", "location"), group_size(.)/2 ))
This returns:
# A tibble: 14 x 4
my_info food_type region cols
<chr> <chr> <chr> <chr>
1 Apples Fruits North America food
2 Michigan Fruits North America location
3 Pomegranates Fruits Europe food
4 Greece Fruits Europe location
5 Oranges Fruits Europe food
6 Italy Fruits Europe location
7 Beef Meats North America food
8 Illinois Meats North America location
9 Potatoes Vegetables North America food
10 Idaho Vegetables North America location
11 Avocados Vegetables North America food
12 California Vegetables North America location
13 Artichokes Vegetables Europe food
14 Italy Vegetables Europe location
Next use tidyr to spread the cols into food and location columns.
df <- df %>%
group_by(food_type, region, cols) %>%
mutate(ind = row_number()) %>%
spread(cols, my_info) %>%
select(-ind)
# A tibble: 7 x 4
# Groups: food_type, region [5]
food_type region food location
<chr> <chr> <chr> <chr>
1 Fruits Europe Pomegranates Greece
2 Fruits Europe Oranges Italy
3 Fruits North America Apples Michigan
4 Meats North America Beef Illinois
5 Vegetables Europe Artichokes Italy
6 Vegetables North America Potatoes Idaho
7 Vegetables North America Avocados California
This can all be done in one go, just remove the intermediate step of creating a data.frame.
Here are three alternatives. All of them use na.locf0 from zoo and the cn vector only shown in the first.
1) Let cn be a vector the same length as my_info which identifies which column number of the output that element of my_info belongs to. Let cdef be an output column definition vector of 1:4 with the output column names as its names. Then for each output column create a vector the same length as my_info whose rows corresponding to that column and NAs for the other elements. Then use na.locf0 to fill in the NA values and take the elements corresponding to column 4.
library(zoo)
cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4
cdef <- c(food_type = 1, region = 2, food = 3, location = 4)
m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])
giving:
> m
food_type region food location
[1,] "Fruits" "North America" "Apples" "Michigan"
[2,] "Fruits" "Europe" "Pomegranates" "Greece"
[3,] "Fruits" "Europe" "Oranges" "Italy"
[4,] "Vegetables" "North America" "Potatoes" "Idaho"
[5,] "Vegetables" "North America" "Avocados" "California"
[6,] "Vegetables" "Europe" "Artichokes" "Italy"
[7,] "Meats" "North America" "Beef" "Illinois"
We have created character matrix output since the output is entirely character but if you want a data frame anyways then use:
as.data.frame(mm, stringsAsFactors = FALSE)
2) Alternately, we can create m from cn by putting my_info[i] into position (i, cn[i]) of an n x 4 matrix mm of NAs, using na.locf to fill in the NAs and taking those rows corresponding to column 4.
n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")
identical(m2, m) # test
## [1] TRUE
3) A third alternative for creating m from cn is to construct the matrix column by column like this:
m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3],
region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3],
food = my_info[cn == 3],
location = my_info[cn == 4])
identical(m, m3) # test
## [1] TRUE