Separate a column into multiple column in the desired way mentioned - r

I can separate (using ", ") a column into multiple column.
The idea is to reverse the order of words (separated by ", ") and then separate them into multiple columns. Example of reversing - "CA, SF" becomes "SF, CA"
Below is an example
library(tidyverse)
# sample example
tbl <- tibble(
letter = c("US, CA, SF","NYC", "Florida, Miami")
)
# desired result
tbl_desired <- tibble(
country = c("US", NA, NA),
state = c("CA", NA, "Florida"),
city = c("SF", "NYC", "Miami")
)
# please edit it to get the desired result
tbl %>%
# please add line to reverse the string
mutate() %>%
separate(letter, into = c("country", "state", "city"), sep = ", ")

There is fill argument in separate which can be used (by default, it is "warn"), but we can change that to either "right" or "left". Here, it should be filled from the "left"
library(tidyr)
separate(tbl, letter, into = c("country", "state", "city"),
sep = ", ", fill = "left")
-output
# A tibble: 3 × 3
country state city
<chr> <chr> <chr>
1 US CA SF
2 <NA> <NA> NYC
3 <NA> Florida Miami

Related

Order legend in a R Plotly Bubblemap following factor order

I'm working on a Bubble map where I generated two columns, one for a color id (column Color) and one for a text refering to the id (column Class). This is a classification of my individuals (Color always belongs to Class).
Class is a factor following a certain order that I made with :
COME1039$Class <- as.factor(COME1039$Class, levels = c('moins de 100 000 F.CFP',
'entre 100 000 et 5 millions F.CFP',
'entre 5 millions et 1 milliard F.CFP',
'entre 1 milliard et 20 milliards F.CFP',
'plus de 20 milliards F.CFP'))
This is my code
g <- list(
scope = 'world',
visible = F,
showland = TRUE,
landcolor = toRGB("#EAECEE"),
showcountries = T,
countrycolor = toRGB("#D6DBDF"),
showocean = T,
oceancolor = toRGB("#808B96")
)
COM.g1 <- plot_geo(data = COME1039,
sizes = c(1, 700))
COM.g1 <- COM.g1 %>% add_markers(
x = ~LONGITUDE,
y = ~LATITUDE,
name = ~Class,
size = ~`Poids Imports`,
color = ~Color,
colors=c(ispfPalette[c(1,2,3,7,6)]),
text=sprintf("<b>%s</b> <br>Poids imports: %s tonnes<br>Valeur imports: %s millions de F.CFP",
COME1039$NomISO,
formatC(COME1039$`Poids Imports`/1000,
small.interval = ",",
digits = 1,
big.mark = " ",
decimal.mark = ",",
format = "f"),
formatC(COME1039$`Valeur Imports`/1000000,
small.interval = ",",
digits = 1,
big.mark = " ",
decimal.mark = ",",
format = "f")),
hovertemplate = "%{text}<extra></extra>"
)
COM.g1 <- COM.g1%>% layout(geo=g)
COM.g1 <- COM.g1%>% layout(dragmode=F)
COM.g1 <- COM.g1 %>% layout(showlegend=T)
COM.g1 <- COM.g1 %>% layout(legend = list(title=list(text='Valeurs des importations<br>'),
orientation = "h",
itemsizing='constant',
x=0,
y=0)) %>% hide_colorbar()
COM.g1
Unfortunately my data are too big to be added here, but this is the output I get :
As you can see, the order of the legend is not the one of the factor levels. How to get it ? If data are mandatory to help you to give me a hint, I will try to limit their size.
Many thanks !
Plotly is going to alphabetize your legend and you have to 'make' it listen. The order of the traces in your plot is the order in which the items appear in your legend. So if you rearrange the traces in the object, you'll rearrange the legend.
I don't have your data, so I used some data from rnaturalearth.
First I created a plot, using plot_geo. Then I used plotly_build() to make sure I had the trace order in the Plotly object. I used lapply to investigate the current order of the traces. Then I created a new order, rearranged the traces, and plotted it again.
The initial plot and build.
library(tidyverse)
library(plotly)
library(rnaturalearth)
canada <- ne_states(country = "Canada", returnclass = "SF")
x = plot_geo(canada, sizes = c(1, 700)) %>%
add_markers(x = ~longitude, y = ~latitude,
name = ~name, color = ~name)
x <- plotly_build(x) # capture all elements of the object
Now for the investigation; this is more so you can see how this all comes together.
# what order are they in?
y = vector()
invisible(
lapply(1:length(x$x$data),
function(i) {
z <- x$x$data[[i]]$name
message(i, " ", z)
})
)
# 1 Alberta
# 2 British Columbia
# 3 Manitoba
# 4 New Brunswick
# 5 Newfoundland and Labrador
# 6 Northwest Territories
# 7 Nova Scotia
# 8 Nunavut
# 9 Ontario
# 10 Prince Edward Island
# 11 Québec
# 12 Saskatchewan
# 13 Yukon
In your question, you show that you made the legend element a factor. That's what I've done as well with this data.
can2 = canada %>%
mutate(name = ordered(name,
levels = c("Manitoba", "New Brunswick",
"Newfoundland and Labrador",
"Northwest Territories",
"Alberta", "British Columbia",
"Nova Scotia", "Nunavut",
"Ontario", "Prince Edward Island",
"Québec", "Saskatchewan", "Yukon")))
I used the data to reorder the traces in my Plotly object. This creates a vector. It starts with the levels and their row number or order (1:13). Then I alphabetized the data by the levels (so it matches the current order in the Plotly object).
The output of this set of function calls is a vector of numbers (i.e., 5, 6, 1, etc.). Since I have 13 names, I have 1:13. You could always make it dynamic, as well 1:length(levels(can2$name).
# capture order
df1 = data.frame(who = levels(can2$name), ord = 1:13) %>%
arrange(who) %>% select(ord) %>% unlist()
Now all that's left is to rearrange the object traces and visualize it.
x$x$data = x$x$data[order(c(df1))] # reorder the traces
x # visualize
Originally:
With reordered traces:

How to draw bar chart in R, based on Levels?

I will put my data first, to better understand the question:
amount city agent address
1 Madras Vinod 45/BA
2 Kalkta Bola 56/AS
3 Mumbai Pavan 44/AA
4 Tasha Barez 58/SD
5 Tasha Khan 22/AW
6 Madras Baaz 56/QE
7 Mumbai Neer 99/CC
8 Mumbai Bazan 97/DF
I am learning R. In a scenario, I want to calculate the total numbers of amount in a specific city and then draw a bar chart for that, showing all cities. Considering the data above, I want something like this:
amount city
7 Madras
2 Kalkta
18 Mumbai
9 Tasha
After some searching I found that aggregate function can help, but I faced a problem that says the length is not the same.
Would you please tell me, how can I achieve this?
base R
res <- do.call(rbind,
by(dat, dat$city, FUN = function(z) data.frame(city = z$city[1], amount = sum(z$amount)))
)
barplot(res$amount, names.arg=res$city)
tidyverse
library(dplyr)
res <- dat %>%
group_by(city) %>%
summarize(amount = sum(amount))
barplot(res$amount, names.arg=res$city)
Data
dat <- structure(list(amount = 1:8, city = c("Madras", "Kalkta", "Mumbai", "Tasha", "Tasha", "Madras", "Mumbai", "Mumbai"), agent = c("Vinod", "Bola", "Pavan", "Barez", "Khan", "Baaz", "Neer", "Bazan"), address = c("45/BA", "56/AS", "44/AA", "58/SD", "22/AW", "56/QE", "99/CC", "97/DF")), class = "data.frame", row.names = c(NA, -8L))
Another way to do it using the tidyverse
amount <- c(1,2,3,4,5,6,7,8)
city <- c("Madras", "Kalkta", "Mumbai", "Tasha", "Tasha", "Madras", "Mumbai",
"Mumbai")
df <- tibble(amount = amount, city = city)
df %>%
group_by(city) %>%
summarise(amount = sum(amount, na.rm = T)) %>%
ggplot(aes(x = city, y = amount)) +
geom_col() +
geom_label(aes(label = amount)) +
theme_bw()

Order by name if repeat

I would like to sort the bars in descending order by value and if the value is repeated the name of the city must appear in alphabetical order
library(plotly)
city <- c("Paris", "New York", "Rio", "Salvador", "Curitiba", "Natal")
value <- c(10,20,30,10,10,10)
data <- data.frame(city, value, stringsAsFactors = FALSE)
data$city <- factor(data$city, levels = unique(data$city)[order(data$value, decreasing = FALSE)])
fig <- plot_ly(y = data$city, x = data$value, type = "bar", orientation = 'h')
Can be achieved using order function on dataframe. Applies order on value column, (-) sign indicates decreasing, and then on city name
data_ordered <- data[order(-data$value, data$city),]
data_ordered
city value
3 Rio 30
2 New York 20
5 Curitiba 10
6 Natal 10
1 Paris 10
4 Salvador 10
data_ordered$city <- factor(data_ordered$city, levels = data_ordered$city)
plot_ly(y = data_ordered$city, x = data_ordered$value, type = "bar", orientation = 'h') %>%
layout(yaxis = list(autorange = "reversed"))
Using tidyverse, i suggest that :
library(tidyverse)
city <- c("Paris", "New York", "Rio", "Salvador", "Curitiba", "Natal")
value <- c(10,20,30,10,10,10)
data <- data.frame(city, value)
db <- as_tibble(data)
db %>%
ggplot(aes(x = reorder(city, -value), y=value))+
geom_col()
The "reorder" function in the definition of "x" make what you want, and the alphabetical order is respected.
To make this graph vertically, add coord_flip in the end.
The "-value" can be switch to "value" if you want reorder
library(tidyverse)
city <- c("Paris", "New York", "Rio", "Salvador", "Curitiba", "Natal", "Zoo", "Aaa")
value <- c(10,20,30,10,10,10,10,10)
data <- data.frame(city, value)
db <- as_tibble(data)
db %>%
ggplot(aes(x = reorder(city, value), y=value))+
geom_col() +
coord_flip()

How to gsub for matching strings and simultaneously remove non-matching strings?

I have a dataframe with a column of strings that I want to further label into the following categories: city, country, and continent. I used gsub to replace all the cities with "City," all the countries with "Country," and all the continents with "Continent."
#This is what I have
dataframe
Color Letter Words
red A Paris,Asia,parrot,Antarctica,North America,cat,lizard
blue A Panama,New York,Africa,dog,Tokyo,Washington DC,fish
red B Copenhagen,bird,USA,Japan,Chicago,Mexico,insect
blue B Israel,Antarctica,horse,South America,North America,turtle,Brazil
#This is what I want
dataframe
Color Letter New
red A City,Continent
blue A Country,City,Continent
red B City,Country
blue B Country,Continent
#This is the code I have so far
dataframe$New <- NA
#groups all the cities
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City", x)})
#groups all the countries
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Panama|USA|Japan|Mexico|Israel|Brazil", "Country", x)})
#groups all the continents
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Asia|Antarctica|Africa|North America|South America", "Continent", x)})
dataframe$Words <- NULL
How do I keep prevent overwriting in dataframe$New each time and how do I delete the extra words (i.e. fish, horse, cat)?
The above data is an example based on a very large dataset. In the dataset the Words column has many repeats. See below for some sample rows from dataframe$Words:
Words
Panama,Paris
Panama,Israel,cat
Panama,Paris,horse,
Panama,Asia
Panama
Panama,Chicago
Israel,Chicago
Israel,lizard,Paris
Israel,Panama,horse,Africa
```
Consider pasting several ifelse calls checking for specific strings:
dataframe$New <- paste(ifelse(grepl("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", dataframe$Words), "City", "N/A"),
ifelse(grepl("Panama|USA|Japan|Mexico|Israel|Brazil", dataframe$Words), "Country", "N/A"),
ifelse(grepl("Asia|Antarctica|Africa|North America|South America", dataframe$Words), "Continent", "N/A"),
sep=",")
dataframe$New <- gsub("N/A,|,N/A", "", dataframe$New)
dataframe
# Color Letter Words New
# 1 red A Paris,Asia,parrot,Antarctica,North America,cat,lizard City,Continent
# 2 blue A Panama,New York,Africa,dog,Tokyo,Washington DC,fish City,Country,Continent
# 3 red B Copenhagen,bird,USA,Japan,Chicago,Mexico,insect City,Country
# 4 blue B Israel,Antarctica,horse,South America,North America,turtle,Brazil Country,Continent
Or dryer version with do.call + lapply:
strs <- list(c("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City"),
c("Panama|USA|Japan|Mexico|Israel|Brazil", "Country"),
c("Asia|Antarctica|Africa|North America|South America", "Continent"))
df$New2 <- do.call(paste,
c(lapply(strs, function(s) ifelse(grepl(s[1], df$Words), s[2], "N/A")),
list(sep=",")))
df$New2 <- gsub("N/A,|,N/A", "", df$New2)
It may be better to create a key/value pair of list and then extract the elements after replacement by matching the 'key's
library(gsubfn)
# key val list
lst1 <- list(Paris = "City", `New York` = "City", Tokyo = "City",
`Washington DC` = "City",
Copenhagen = "City", Chicago = "City", Panama = "Country",
USA = "Country", Japan = "Country", Mexico = "Country", Israel = "Country",
Brazil = "Country", Asia = "Continent", Antarctica = "Continent",
Africa = "Continent", `North America` = "Continent",
`South America` = "Continent")
Extract the matching values with strapply into a list, loop over the list with sapply and paste the unique strings that are either 'City', 'Continent' or 'Country'
nm1 <- c("City", "Continent", "Country")
df1$New <- sapply(strapply(df1$Words, "([^,]+)", lst1), function(x)
paste(unique(x[x %in% nm1]), collapse=","))
df1$New
#[1] "City,Continent" "Country,City,Continent"
#[3] "City,Country" "Country,Continent"
data
df1 <- structure(list(Color = c("red", "blue", "red", "blue"), Letter = c("A",
"A", "B", "B"), Words = c("Paris,Asia,parrot,Antarctica,North America,cat,lizard",
"Panama,New York,Africa,dog,Tokyo,Washington DC,fish",
"Copenhagen,bird,USA,Japan,Chicago,Mexico,insect",
"Israel,Antarctica,horse,South America,North America,turtle,Brazil"
)), class = "data.frame", row.names = c(NA, -4L))

Splitting a dataframe column where new column values depend upon original data

I often work with dataframes that have columns with character string values that need to be separated. This results from a "select multiple" option in the data entry programme (which I cannot change unfortunately). I have tried tidyr::separate but that does not order the results properly. An example:
require(tidyr)
df = data.frame(
x = 1:3,
sick = c(NA, "malaria", "diarrhoea malaria"))
df <- df %>%
separate(sick, c("diarrhoea", "cough", "malaria"),
sep = " ", fill = "right", remove = FALSE)
But I want the result to look like this:
df2 = data.frame(
x = 1:3,
sick = c(NA, "malaria", "diarrhoea malaria"),
diarrhoea = c(NA, NA, "diarrhoea"),
cough = c(NA, NA, NA),
malaria = c(NA, "malaria", "malaria"))
Any help in the right direction would be much appreciated.
We can try with separate_rows and dcast
library(tidyr)
library(reshape2)
library(dplyr)
separate_rows(df, sick) %>%
mutate(sick = factor(sick, levels = c("diarrhoea", "cough", "malaria")), sick1 = sick) %>%
dcast(., x~sick, value.var = "sick1", drop=FALSE) %>%
bind_cols(., df[2]) %>%
select(x, sick, diarrhoea, cough, malaria)
# x sick diarrhoea cough malaria
#1 1 <NA> <NA> <NA> <NA>
#2 2 malaria <NA> <NA> malaria
#3 3 diarrhoea malaria diarrhoea <NA> malaria
Or another option is using cSplit from splitstackshape with dcast from data.table
library(splitstackshape)
dcast(cSplit(df, "sick", " ", "long")[, sick:= factor(sick, levels =
c("diarrhoea", "cough", "malaria"))], x~sick, value.var = "sick", drop = FALSE)[,
sick := df$sick][]

Resources