R Create or Modify a dataframe using dplyr - r

I am very new programming, and I am learning how to use dplyr, and I am wondering how to solve this problem:
I have this dataframe:
countries <- c("USA","Canada","Denmark","Albania", "Turkey","France", "Italy")
values <- c(1, 1, 3, 3,7,8,9)
old_df <- data.frame(countries, values, stringsAsFactors = FALSE)
I want to modify the order into my dataset to obtain this:
countries <- c("USA , Canada","Denmark , Albania", "Turkey","France", "Italy")
values <- c(1,3,7,8,9)
new_df <- data.frame(countries, values, stringsAsFactors = FALSE)
Because I am using dyplr I think that the best way to solve my problem could be:
library(dplyr)
new_df <- group_by(values) %>%
transmute(countries = countries) %>%
ungroup
Thank you in advance for any clue about how to solve this.

library(dplyr)
old_df %>%
group_by(values) %>%
summarise(countries = paste0(countries, collapse = ", "))
# # A tibble: 5 x 2
# values countries
# <dbl> <chr>
# 1 1 USA, Canada
# 2 3 Denmark, Albania
# 3 7 Turkey
# 4 8 France
# 5 9 Italy
The point here is that for each unique value in values you want to combine some of your rows, so you need to use summarise (i.e. you want to end up with one row per values value).
You can use summarise(countries = paste0(sort(countries), collapse = ", ")) if you want to apply an alphabetical order when you combine countries.

Related

Assign a conditional value to new created column

My Data frame looks like this
Now, I want to add a new column which assigns one (!) specific value to each country. That means, there is only one value for Australia, one for Canada etc. for every year.
It should look like this:
Year Country R Ineq Adv NEW_COL
2018 Australia R1 Ineq1 1 x_Australia
2019 Australia R2 Ineq2 1 x_Australia
1972 Canada R1 Ineq1 1 x_Canada
...
Is there a smart way to do this?
Appreciate any help!
you use merge.
x = data.frame(country = c("AUS","CAN","AUS","USA"),
val1 = c(1:4))
y = data.frame(country = c("AUS","CAN","USA"),
val2 = c("a","b","c"))
merge(x,y)
country val1 val2
1 AUS 1 a
2 AUS 3 a
3 CAN 2 b
4 USA 4 c
You just manually create the (probably significantly smaller!) reference table that then gets duplicated in the original table in the merge. As you can see, my 3 row table (with a,b,c) is correctly duplicated up to the original (4 row) table such that every AUS gets "a".
You may use mutate and case_when from the package dplyr:
library(dplyr)
data <- data.frame(country = rep(c("AUS", "CAN"), each = 2))
data <- mutate(data,
newcol = case_when(
country == "CAN" ~ 1,
country == "AUS" ~ 2))
print(data)
You can use mutate and group_indices:
library(dplyr)
Sample data:
sample.df <- data.frame(Year = sample(1971:2019, 10, replace = T),
Country = sample(c("AUS", "Can", "UK", "US"), 10, replace = T))
Create new variable called ID, and assign unique ID to each Country group:
sample.df <- sample.df %>%
mutate(ID = group_indices(., Country))
If you want it to appear as x_Country, you can use paste (as commented):
sample.df <- sample.df %>%
mutate(ID = paste(group_indices(., Country), Country, sep = "_"))

extract identically named vectors from nested lists, where the list names vary? Using purrr?

I have to work with some data that is in recursive lists like this (simplified reproducible example below):
groups
#> $group1
#> $group1$countries
#> [1] "USA" "JPN"
#>
#>
#> $group2
#> $group2$countries
#> [1] "AUS" "GBR"
Code for data input below:
chars <- c("USA", "JPN")
chars2 <- c("AUS", "GBR")
group1 <- list(countries = chars)
group2 <- list(countries = chars2)
groups <- list(group1 = group1, group2 = group2)
groups
I'm trying to work out how to extract the vectors that are in the lists, without manually having to write a line of code for each group. The code below works, but my example has a large number of groups (and the number of groups will change), so it would be great to work out how to extract all of the vectors in a more efficient manner. This is the brute force way, that works:
countries1 <- groups$group1$countries
countries2 <- groups$group2$countries
In the example, the bottom level vector I'm trying to extract is always called countries, but the lists they're contained in change name, varying only by numbering.
Would there be an easy purrr solution? Or tidyverse solution? Or other solution?
Add some additional cases to your list
groups[["group3"]] <- list()
groups[["group4"]] <- list(foo = letters[1:2])
groups[["group5"]] <- list(foo = letters[1:2], countries = LETTERS[1:2])
Here's a function that maps any list to just the elements named "countries"; it returns NULL if there are no elements
fun = function(x)
x[["countries"]]
Map your original list to contain just the elements you're interested in
interesting <- Map(fun, groups)
Then transform these into a data.frame using a combination of unlist() and rep()
df <- data.frame(
country = unlist(interesting, use.names = FALSE),
name = rep(names(interesting), lengths(interesting))
)
Alternatively, use tidy syntax, e.g.,
interesting %>%
tibble(group = names(.), value = .) %>%
unnest("value")
The output is
# A tibble: 6 x 2
group value
<chr> <chr>
1 group1 USA
2 group1 JPN
3 group2 AUS
4 group2 GBR
5 group5 A
6 group5 B
If there are additional problems parsing individual elements of groups, then modify fun, e.g.,
fun = function(x)
as.character(x[["countries"]])
This will put the output in a list which will handle any number of groups
countries <- unlist(groups, recursive = FALSE)
names(countries) <- sub("^\\w+(\\d+)\\.(\\w+)", "\\2\\1", names(countries), perl = TRUE)
> countries
$countries1
[1] "USA" "JPN"
$countries2
[1] "AUS" "GBR"
You can simply transform your nested list to a data.frame and then unnest the country column.
library(dplyr)
library(tidyr)
groups %>%
tibble(group = names(groups),
country = .) %>%
unnest(country) %>%
unnest(country)
#> # A tibble: 4 x 2
#> group country
#> <chr> <chr>
#> 1 group1 USA
#> 2 group1 JPN
#> 3 group2 AUS
#> 4 group2 GBR
Created on 2020-01-15 by the reprex package (v0.3.0)
Since the countries are hidden 2 layers deep, you have to run unnest twice. Otherwise I think this is straightforward.
If you actually want to have each vector as a an object in you global environment a combination of purrr::map2/walk and list2env will work. In order to make this work, we have to give the country entries in the list individual names first, otherwise list2env just overwrites the same object over and over again.
library(purrr)
groups <-
map2(groups, 1:length(groups), ~setNames(.x, paste0(names(.x), .y)))
walk(groups, ~list2env(. , envir = .GlobalEnv))
This would create the exact same results you are describing in your question. I am not sure though, if it is the best solution for a smooth workflow, since I don't know where you are going with this.

Subset the Orginal dataframe with different combinations of 2 factor variables

I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4

Doing a ranged lookup with multiple variables in a matrix in R

I feel like I have a bit of a complicated problem (or at least for me it is!).
I have a table of prices which will need to be read from a csv which will look exactly like this:
V1 <- c("","Destination","Spain","Spain","Spain","Portugal","Portugal","Portugal","Italy","Italy","Italy")
V2 <- c("","Min_Duration",rep(c(1,3,6),3))
V3 <- c("","Max_Duration",rep(c(2,5,10),3))
V4 <- c("Full-board","Level_1",runif(9,100,200))
V5 <- c("Full-board","Level_2",runif(9,201,500))
V6 <- c("Full-board","Level_3",runif(9,501,1000))
V7 <- c("Half-board","Level_1",runif(9,100,200))
V8 <- c("Half-board","Level_2",runif(9,201,500))
V9 <- c("Half-board","Level_3",runif(9,501,1000))
Lookup_matrix <- as.data.frame(cbind(V1,V2,V3,V4,V5,V6,V7,V8))
The prices in the above table will of course come out a bit strange as they're completely random - but we can ignore that...
I also have a table like this:
Destination <- c("Spain", "Italy", "Portugal")
Duration <- c(2,4,8)
Level <- c(1,3,3)
Board <- c("Half-board","Half-board","Full-board")
Price <- "Empty"
Price_matrix <- as.data.frame(cbind(Destination,Duration,Level,Board,Price))
My question is - how do I populate the 'Price' column of the price matrix with the corresponding prices that can be found in the lookup matrix? Please note that the duration variable of the price matrix will have to fit into a range found between the 'Min_Duration' and 'Max_Duration' columns in the lookup matrix.
In Excel I would use an Index,Match formula. But I'm stumped with R.
Thanks in advance,
Dan
Here is a tidyverse possibility
First, please note that I rename your input objects; both Price_matrix and Lookup_matrix are data.frames (not matrices).
df1 <- Price_matrix
df2 <- Lookup_matrix
Next we need to fix the column names of df2 = Lookup_matrix.
# Fix column names
colnames(df2) <- gsub("^_", "", apply(df2[1:2, ], 2, paste0, collapse = "_"))
df2 <- df2[-(1:2), ]
We now basically do a left join of df1 and df2; in order for df2 to be in a suitable format we spread data from wide to long, extract Price values for every Board and Level, and expand entries from Min_Duration to Max_Duration. Then we join by Destination, Duration, Level and Board.
Note that in your example, Destination = Italy has no Level = 3 entry in Lookup_matrix; we therefore get Price = NA for this entry.
library(tidyverse)
left_join(
df1 %>%
mutate_if(is.factor, as.character) %>%
select(-Price),
df2 %>%
mutate_if(is.factor, as.character) %>%
gather(key, Price, -Destination, -Min_Duration, -Max_Duration) %>%
separate(key, into = c("Board", "Level"), sep = "_", extra = "merge") %>%
mutate(Level = sub("Level_", "", Level)) %>%
rowwise() %>%
mutate(Duration = list(seq(as.numeric(Min_Duration), as.numeric(Max_Duration)))) %>%
unnest() %>%
select(-Min_Duration, -Max_Duration) %>%
mutate(Duration = as.character(Duration)))
#Joining, by = c("Destination", "Duration", "Level", "Board")
# Destination Duration Level Board Price
#1 Spain 2 1 Half-board 119.010942545719
#2 Italy 4 3 Half-board <NA>
#3 Portugal 8 3 Full-board 764.536124917446
Using datatable:
library(data.table)
nms = trimws(do.call(paste, transpose(Lookup_matrix[1:2, ])))# column names
cat(do.call(paste, c(collapse="\n", Lookup_matrix[-(1:2), ])), file = "mm.csv")
# Rewrite the data in the correct format. You do not have to.
# Just doing Lookup_matrix1 = setNames(Lookup_matrix[-(1:2),],nms) is enough
# but it will not have rectified the column classes.
Lookup_matrix1 = fread("mm.csv", col.names = nms)
melt(Lookup_matrix1, 1:3)[,
c("Board", "Level") := .(sub("[.]", "-", sub("\\.Leve.*", "", variable)), sub("\\D+", "", variable))][
Price_matrix[, -5], on=c("Destination", "Board", "Level", "Min_Duration <= Duration", "Max_Duration >= Duration")]
Destination Min_Duration Max_Duration variable value Board Level
1: Spain 2 2 Half.board.Level_1 105.2304 Half-board 1
2: Italy 4 4 <NA> NA Half-board 3
3: Portugal 8 8 Full.board.Level_3 536.5132 Full-board 3

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources