This seems like it should be a fairly simple problem, but I can't seem to find a straightforward solution.
I have a character list that looks like this:
my_info <- c("Fruits",
"North America",
"Apples",
"Michigan",
"Europe",
"Pomegranates",
"Greece",
"Oranges",
"Italy",
"Vegetables",
"North America",
"Potatoes",
"Idaho",
"Avocados",
"California",
"Europe",
"Artichokes",
"Italy",
"Meats",
"North America",
"Beef",
"Illinois")
I want to parse this character vector into a data frame that looks like this:
screenshot of R console
The food types and the region lists will always remain the same, but the foods and their locations are subject to change.
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
I was thinking I needed to use something like str_split, but use the food_types and regions as some sort of a delimiter? But I am not sure how to proceed. The character vector does have an order to it.
Thank you.
One solution can be to first convert your my_info vector in a matrix using ncol = 4. This will split your vector in a matrix/data frame.
Now, you can apply the rule of for food_type and region and swap any food_type or region which is present in other columns.
Note: I request OP to check data once, it seems every 4 elements are not able to make a complete row with description provided by OP.
df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))
names(df) <- c("Foodtype", "Region", "Food", "Location")
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
t(apply(df,1,function(x){
for(i in seq_along(x)){
#One can think of writing a swap function here.
if(x[i] %in% region ){
temp = x[i]
x[i] = x[2]
x[2] = temp
}
#Swap any food_type wrongly placed in other column
if(x[i] %in% food_type ){
temp = x[i]
x[i] = x[1]
x[1] = temp
}
}
x
}))
# Foodtype Region Food Location
# [1,] "Fruits" "North America" "Apples" "Michigan"
# [2,] "Pomegranates" "Europe" "Greece" "Oranges"
# [3,] "Vegetables" "North America" "Italy" "Potatoes"
# [4,] "Idaho" "Europe" "California" "Avocados"
# [5,] "Meats" "North America" "Artichokes" "Italy"
# [6,] "Fruits" "North America" "Beef" "Illinois"
#
I have a long solution, but should work as long as food and location are always in the same order.
First creating a few data.frames with dplyr.
library(dplyr)
info <- data_frame(my_info = my_info)
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)
Next creating a data.frame that joins all of these together and fill missing values with tidyr and removing the rows we do not need. Then The most important trick is the last one, creating a cols column based on the assumption that the order is always the same!
library(tidyr)
df <- info %>%
left_join(food_type, by = c("my_info" = "food_type_id")) %>%
left_join(region, by = c("my_info" = "region_id")) %>%
fill(food_type) %>%
group_by(food_type) %>%
fill(region) %>%
filter(!is.na(region) & !(my_info == region)) %>%
ungroup %>%
mutate(cols = rep(c("food", "location"), group_size(.)/2 ))
This returns:
# A tibble: 14 x 4
my_info food_type region cols
<chr> <chr> <chr> <chr>
1 Apples Fruits North America food
2 Michigan Fruits North America location
3 Pomegranates Fruits Europe food
4 Greece Fruits Europe location
5 Oranges Fruits Europe food
6 Italy Fruits Europe location
7 Beef Meats North America food
8 Illinois Meats North America location
9 Potatoes Vegetables North America food
10 Idaho Vegetables North America location
11 Avocados Vegetables North America food
12 California Vegetables North America location
13 Artichokes Vegetables Europe food
14 Italy Vegetables Europe location
Next use tidyr to spread the cols into food and location columns.
df <- df %>%
group_by(food_type, region, cols) %>%
mutate(ind = row_number()) %>%
spread(cols, my_info) %>%
select(-ind)
# A tibble: 7 x 4
# Groups: food_type, region [5]
food_type region food location
<chr> <chr> <chr> <chr>
1 Fruits Europe Pomegranates Greece
2 Fruits Europe Oranges Italy
3 Fruits North America Apples Michigan
4 Meats North America Beef Illinois
5 Vegetables Europe Artichokes Italy
6 Vegetables North America Potatoes Idaho
7 Vegetables North America Avocados California
This can all be done in one go, just remove the intermediate step of creating a data.frame.
Here are three alternatives. All of them use na.locf0 from zoo and the cn vector only shown in the first.
1) Let cn be a vector the same length as my_info which identifies which column number of the output that element of my_info belongs to. Let cdef be an output column definition vector of 1:4 with the output column names as its names. Then for each output column create a vector the same length as my_info whose rows corresponding to that column and NAs for the other elements. Then use na.locf0 to fill in the NA values and take the elements corresponding to column 4.
library(zoo)
cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4
cdef <- c(food_type = 1, region = 2, food = 3, location = 4)
m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])
giving:
> m
food_type region food location
[1,] "Fruits" "North America" "Apples" "Michigan"
[2,] "Fruits" "Europe" "Pomegranates" "Greece"
[3,] "Fruits" "Europe" "Oranges" "Italy"
[4,] "Vegetables" "North America" "Potatoes" "Idaho"
[5,] "Vegetables" "North America" "Avocados" "California"
[6,] "Vegetables" "Europe" "Artichokes" "Italy"
[7,] "Meats" "North America" "Beef" "Illinois"
We have created character matrix output since the output is entirely character but if you want a data frame anyways then use:
as.data.frame(mm, stringsAsFactors = FALSE)
2) Alternately, we can create m from cn by putting my_info[i] into position (i, cn[i]) of an n x 4 matrix mm of NAs, using na.locf to fill in the NAs and taking those rows corresponding to column 4.
n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")
identical(m2, m) # test
## [1] TRUE
3) A third alternative for creating m from cn is to construct the matrix column by column like this:
m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3],
region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3],
food = my_info[cn == 3],
location = my_info[cn == 4])
identical(m, m3) # test
## [1] TRUE
Related
Noob question, but how would I create a separate variable that is formed from specific attributes of other variables? For example, I'm trying to find Asian countries in the "region" variable that have a "democracy" variable score of "3." I want to create a variable called "asia3" that selects those Asian countries with a democracy score of 3.
The which operator should solve your request.
asia3 <- your_data[ which(your_data$Region=='Asia' & your_data$democracy == 3), ]
In base R, you can create a new variable based on a condition using an ifelse statement, then assign to a new variable called asia3.
df$asia3 <- ifelse(df$region == "Asia" & df$democracy == 3, "yes", "no")
region democracy asia3
1 Asia 3 yes
2 Australia 3 no
3 Asia 2 no
4 Europe 1 no
Or if you only need a logical output, then you do not need the ifelse:
df$asia3 <- df$region == "Asia" & df$democracy == 3
region democracy asia3
1 Asia 3 TRUE
2 Australia 3 FALSE
3 Asia 2 FALSE
4 Europe 1 FALSE
or with tidyverse
library(tidyverse)
df %>%
mutate(asia3 = ifelse(df$region == "Asia" & df$democracy == 3, TRUE, FALSE))
However, if you only want to keep the rows that meet those conditions, then you can:
#dplyr
df %>%
filter(region == "Asia" & democracy == 3)
#base R
df[df$region=='Asia' & df$democracy == 3, ]
# region democracy
#1 Asia 3
Data
df <-
structure(list(
region = c("Asia", "Australia", "Asia", "Europe"),
democracy = c(3, 3, 2, 1)
),
class = "data.frame",
row.names = c(NA,-4L))
I've got a df with country-level data entered in 2003.
Several rows of data belong to a country named 'Federal Republic of Yugoslavia'.
These are two separate countries today and I want to duplicate these rows of data so that I can rename each set of rows to its respective modern country name.
data.frame(Country = "Yugoslavia", Chickens = 567)
Using this minimal example, how do I create this dataframe?
data.frame(Country = c("Serbia", "Montenegro"), Chickens = 567)
you can do in one tidyverse pipe:
library(tidyverse)
df2 <- df %>%
mutate(Country = if_else(Country == "Yugoslavia", "Serbia", as.character(Country))) %>%
bind_rows(df) %>%
mutate(Country = if_else(Country == "Yugoslavia", "Montenegro", as.character(Country)))
You could also use mutate_if instead of the if_else statements.
Country Chickens
1 Serbia 567
2 Montenegro 567
By default data.frame turns character columns into factors. The substitution above coerces into character.
If you want to preserve the factor class then just add:
%>% mutate(Country = as.factor(Country))
... at the end.
You can do something like this:
data2<-data[data$country=="Yugoslavia"]
levels(data2$country)[levels(data2$country)=="Yugoslavia"]<-"Serbia"
levels(data$country)[levels(data$country)=="Yugoslavia"]<-"Montenegro"
rbind(data,data2)
You can write a function which returns the duplicated and renamed rows like:
fun <- function(y) {
if(y[["Country"]] == "Yugoslavia") rbind(replace(y, "Country", "Serbia")
, replace(y, "Country", "Montenegro"))
else y
}
do.call("rbind", apply(x, 1, fun))
# Country Chickens
#[1,] "Italy" " 2"
#[2,] "Serbia" "567"
#[3,] "Montenegro" "567"
#[4,] "Austria" " 3"
Or if order does not matter:
rbind(x[x$Country != "Yugoslavia",]
, replace(x[x$Country == "Yugoslavia",], "Country", "Serbia")
, replace(x[x$Country == "Yugoslavia",], "Country", "Montenegro"))
# Country Chickens
#1 Italy 2
#3 Austria 3
#2 Serbia 567
#21 Montenegro 567
Data:
x <- data.frame(Country = c("Italy","Yugoslavia","Austria"), Chickens = c(2,567,3))
x
# Country Chickens
#1 Italy 2
#2 Yugoslavia 567
#3 Austria 3
I have a table that looks like this:
Further down in the table, the countries in Target.Country are repeated in Source.Country, therefore repeating the combinations but with different numbers, sums and means. Is it possible for when the combinations are the same, sum the remaining columns together and add an additional column to find the average?
For example:
Source.Country Target.Country number sum_intensity mean_intensity
North Korea South Korea 26492 10674.9 0.402
South Korea North Korea 34912 53848.3 1.542
To be:
Source.Country Target.Country number sum_intensity mean_intensity Average
North Korea South Korea 61404 64523.2 1.944 1.05
Any help would be great!
A similar solution to what #Axeman proposed in the comments:
library(purrr)
library(dplyr)
df=data.frame(Source.Country=c('North Korea', 'South Korea'),
Target.Country=c('South Korea', 'North Korea'),
number=c(26492, 34912),
sum_intensity=c(10674.9, 53848.3),
mean_intensity=c(0.402, 1.542))
df %>% mutate(grp = purrr::map2_chr(Source.Country, Target.Country, ~paste(sort(c(as.character(.x), as.character(.y))), collapse=' '))) %>%
group_by(grp) %>%
summarise(number = sum(number),
sum_intensity = sum(sum_intensity),
mean_intensity = sum(mean_intensity),
average = sum_intensity/number)
# # A tibble: 1 x 5
# grp number sum_intensity mean_intensity average
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 North Korea South Korea 61404. 64523. 1.94 1.05
A few minor tweaks:
it does require collapse in the paste command
needs as.character to prevent the country names from being coerced into integers
mean_intensity can't be used as an output in the summary, then as an input, but an average of averages doesn't make much sense when number is unbalanced anyway. I just recalculated the average from the sums
I increased the dataframe to check if the code was working properly
df1<-rbind(c( "North Korea ","South Korea" , 26492 , 10674.9 ,
0.402), c( "South Korea", "North Korea" , 34912 , 53848.3 , 1.542),
c( "Canada ","South Korea" , 26492 , 10674.9 , 0.402),
c( "South Korea", "Canada" , 34912 , 53848.3 , 1.542))
colnames(df1)<-c("Source.Country", "Target.Country", "number", "sum_intensity",
"mean_intensity")
df1<-data.frame(df1)
df1$number<-as.numeric(as.character(df1$number))
df1$sum_intensity<-as.numeric(as.character(df1$sum_intensity))
df1$mean_intensity<-as.numeric(as.character(df1$mean_intensity))
df1$Countries<-apply(cbind(df1$Source.Country, df1$Target.Country), 1, function(x)
paste(sort(x), collapse=" "))
#
library(reshape)
m1 <- aggregate(df1$number~df1$Countries,data=df1,FUN=mean)
m2 <- aggregate(df1$sum_intensity~df1$Countries,data=df1,FUN=mean)
m3 <- aggregate(df1$mean_intensity~df1$Countries,data=df1,FUN=mean)
mvtab <- merge(rename(m1,c(y="number")),
rename(m2,c(y="sum_intensity")))
mtab2<-merge(mvtab, rename(m3,c(y="mean_intensity")))
I have a column in a dataframe which includes 30 different countries. I want to group these countries into 5 new values.
For example,
I have
China
Japan
US
Canada
....
Aggregate to new variables:
Asia
Asia
North America
North America
....
One solution I am thinking about is using nested ifelse. However it seems that I need 4 or 5 nested ifelse to get what I need. I don't think that's a good way. I want to know other efficient solutions.
One option would be to use a key/value dataset. The countrycode_data from the library(countrycode) can be used for this purpose. We match the 'country.name' column in 'countrycode_data' with the example data column ('Col1'). If there are no matches, it will return NA. Using the OP's example, 'US' returns NA as the 'country.name' is 'United States'. But, we can get the abbreviated form using the 'cowc' column. However, the abbreviated version is also USA, which we can find using grep. I would suggest to grep all NA elements in 'indx'. The 'indx' can be used for returning 'region' from the 'countrycode_data'.
library(countrycode)
indx <- match(df1$Col1, countrycode_data$country.name)
pat <- paste0('^',paste(df1$Col1[is.na(indx)], collapse='|'))
indx[is.na(indx)] <- grep(pat, countrycode_data$cowc)
countrycode_data$region[indx]
#[1] "Eastern Asia" "Eastern Asia" "Northern America" "Northern America"
NOTE: This will return a bit more specific than the general 'Asia'.
If we use the 'continent' column,
countrycode_data$continent[indx]
#[1] "Asia" "Asia" "Americas" "Americas"
data
df1 <- structure(list(Col1 = c("China", "Japan", "US", "Canada")),
.Names = "Col1", class = "data.frame", row.names = c(NA, -4L))
Another approach is to use the recode function from the car package:
library(car)
dat$Region <- recode(dat$Country, "c('China', 'Japan') = 'Asia'; c('US','Canada') = 'North America'")
Country Region
1 China Asia
2 Japan Asia
3 US North America
4 Canada North America
They are just 30 countries and so you can make few vectors like shown below, create a new column and replace according to the vectors.
asia <- c("India", "china")
NorthAmerica <- c("US", "canada")
df$continent <- df$countries
df$continent <- with(df, replace(continent, countries%in%asia,"Asia"))
df$continent <- with(df, replace(continent, countries%in%NorthAmerica,"North America"))
'continent' is a built-in destination code of the countrycode package. You can pass a vector of country names and get a vector of continent names back with...
library(countrycode)
countries <- c('China', 'Japan', 'US', 'Canada')
countrycode(countries, 'country.name', 'continent')
returns...
[1] "Asia" "Asia" "Americas" "Americas"
Make sure when using Veera's and Jay's approaches to define column as a vector in order to allow for the change of a column's levels:
df$continent <- as.factor(as.vector(df$countries))
This question is related to: Searching a data.frame in R
I want to search for multiple patterns , e.g. 'america' and 'united', in
all fields
in a given field
How can this be done? The case needs to be ignored.
Data:
ddf
id country area
1 1 United States of America North America
2 2 United Kingdom Europe
3 3 United Arab Emirates Arab
4 4 Saudi Arabia Arab
5 5 Brazil South America
ddf = structure(list(id = 1:5, country = c("United States of America",
"United Kingdom", "United Arab Emirates", "Saudi Arabia", "Brazil"
), area = c("North America", "Europe", "Arab", "Arab", "South America"
)), .Names = c("id", "country", "area"), class = "data.frame", row.names = c(NA,
-5L))
EDIT: To clarify, I have to search with AND and not OR. In this example, only 'United States of America' (row number 1) should come. If I search for 'brazil' and 'america', row number 5 should come (i.e. different search strings can be in different columns).
This actually fails for the "brazil" & "america" case but it was a useful test-bed for diagnosisng the logical problems;
hasAm <- sapply( ddf, grepl, patt="america", ignore.case=TRUE)
ddf[ rowSums(hasAm) > 0 , ]
#----------
id country area
1 1 United States of America North America
5 5 Brazil South America
#---------
hasUn <- sapply( ddf, grepl, patt="united", ignore.case=TRUE)
#---------
ddf[ rowSums( hasAm & hasUn) > 0 , ]
#-----------
id country area
1 1 United States of America North America
This edited version generalizes that strategy although it requires entering the selection criteria as a formula. I needed to first collapse each matrix so that summing across the cbind()-ed values didn't pick up multiple hits on a single term. So I have two rowSums, the outer one being done on m-column matrices where m is the number of terms in the formula, and the inner one being done on n-column matrices where n is the number of columns in the data-argument:
dfsel <- function(form, data) {
vars = all.vars(form)
selmatx <- lapply( vars, function(v)
sapply (data, grepl, patt=v, ignore.case=TRUE))
data[ rowSums( do.call(cbind,
lapply(selmatx,
function(L) {rowSums(L) > 0}) ) ) == length(vars)
, ] }
Demonstration:
> res <- dfsel( ~ united + america , ddf)
> res
id country area
1 1 United States of America North America
> res <- dfsel( ~ brazil + america , ddf)
> res
id country area
5 5 Brazil South America
Dumb way of solving it. Interested in other answers.
pattern<-c('America','United')
ddf1<-NULL
for (i in 1:length(pattern)){
new<-ddf[grep(paste0(pattern[i]),ddf$country),]
ddf1<-rbind(ddf1,new)
}
Going on the logic that no country in the world has "America" before "United" in its name, you could do
> f <- lapply(ddf, grep, pattern = "(united)(.*)(america)", ignore.case = TRUE)
> ddf[unique(unlist(f)), ]
# id country area
# 1 1 United States of America North America