I have a dataframe that looks like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|AUS Sydney |
|CHN Beijing |
But I'd like to make duplicate rows of the existing rows but with the 2nd and 3rd character changed to lowercase such that the dataframe becomes like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|Jpn Tokyo |
|AUS Sydney |
|Aus Sydney |
|CHN Beijing |
|Chn Beijing |
Would anyone have an idea how to do that?
We expand the rows with uncount, then create a logical condition with duplicated on the 'site', replace the substring values to lower case using sub within case_when
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df1 %>%
uncount(2) %>%
mutate(site = case_when(duplicated(site)
~ sub("^(.)(\\w+)", "\\1\\L\\2", site, perl = TRUE),
TRUE ~ site))
-output
df1
# A tibble: 6 x 1
site
<chr>
1 JPN Tokyo
2 Jpn Tokyo
3 AUS Sydney
4 Aus Sydney
5 CHN Beijing
6 Chn Beijing
data
df1 <- structure(list(site = c("JPN Tokyo", "AUS Sydney", "CHN Beijing"
)), class = "data.frame", row.names = c(NA, -3L))
edit: #AnilGoyal suggested the use of map_dfr, that reduced the call to only one line.
library(tidyverse)
data <-
tribble(
~site,
'JPN Tokyo',
'AUS Sydney',
'CHN Beijing' )
#option1
map_dfr(data$site, ~list(sites = c(.x, str_to_title(.x))))
#> # A tibble: 6 x 1
#> sites
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
#option2
map(data$site, ~rbind(.x, str_to_title(.x))) %>%
reduce(rbind) %>%
tibble(site = .)
#> # A tibble: 6 x 1
#> site[,1]
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
Created on 2021-06-08 by the reprex package (v2.0.0)
You can use substr to replace characters at specific position.
df1 <- df
substr(df1$site, 2, 3) <- tolower(substr(df1$site, 2, 3))
df1
# site
#1 Jpn Tokyo
#2 Aus Sydney
#3 Chn Beijing
res <- rbind(df1, df)
res[order(res$site), , drop = FALSE]
# site
#2 Aus Sydney
#5 AUS Sydney
#3 Chn Beijing
#6 CHN Beijing
#1 Jpn Tokyo
#4 JPN Tokyo
Related
I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2
This question already has answers here:
Split multiple comma-separated column into separate rows [duplicate]
(3 answers)
Splitting row-values into multiple rows in R dataframe [duplicate]
(2 answers)
Split df row into multiple rows
(4 answers)
Split delimited strings in multiple columns and separate them into rows
(4 answers)
Closed last year.
I have this kind of dataset
Name Year Subject State
Jack 2003 Math/ Sci/ Music MA/ AB/ XY
Sam 2004 Math/ PE CA/ AB
Nicole 2005 Math/ Life Sci/ Geography NY/ DE/ FG
This is what I want as output:
Name Year Subject State
Jack 2003 Math MA
Jack 2003 Sci AB
Jack 2003 Music XY
Sam 2004 Math CA
Sam 2004 PE AB
Nicole 2005 Math NY
Nicole 2005 Life Sci DE
Nicole 2005 Geography FG
Keep in mind that the first element of 'subject' corresponds to the first of 'State' and so on. I need to keep this correspondence. I think I have to use something like 'pivot_longer' but I do not use R every day and I'm not skilled enought. Thanks in advance :)
Thanks!!
Name <- c("Jack", "Sam", "Nicole")
Year = c(2003, 2004, 2005)
Subject = c("Math/ Sci/ Music", "Math/ PE", "Math/ Life Sci/ Geography")
State = c("MA/ AB/ XY", "CA/ AB", "NY/ DE/ FG")
library(tidyr)
df <- data.frame(Name, Year, Subject, State)
df %>% separate_rows(Subject, State, sep = "/ ")
#> # A tibble: 8 × 4
#> Name Year Subject State
#> <chr> <dbl> <chr> <chr>
#> 1 Jack 2003 Math MA
#> 2 Jack 2003 Sci AB
#> 3 Jack 2003 Music XY
#> 4 Sam 2004 Math CA
#> 5 Sam 2004 PE AB
#> 6 Nicole 2005 Math NY
#> 7 Nicole 2005 Life Sci DE
#> 8 Nicole 2005 Geography FG
Created on 2022-01-27 by the reprex package (v2.0.1)
Alternatively, using pivot_wider after pivot_longer:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
Name = c("Jack", "Sam", "Nicole"),
Year = c(2003L, 2004L, 2005L),
Subject = c("Math/ Sci/ Music","Math/ PE",
"Math/ Life Sci/ Geography"),
State = c("MA/ AB/ XY", "CA/ AB", "NY/ DE/ FG")
)
df %>%
pivot_longer(cols = 3:4) %>%
pivot_wider(id_cols = c(Name, Year), values_fn = \(x) str_split(x, "/ ")) %>%
unnest(everything())
#> # A tibble: 8 × 4
#> Name Year Subject State
#> <chr> <int> <chr> <chr>
#> 1 Jack 2003 Math MA
#> 2 Jack 2003 Sci AB
#> 3 Jack 2003 Music XY
#> 4 Sam 2004 Math CA
#> 5 Sam 2004 PE AB
#> 6 Nicole 2005 Math NY
#> 7 Nicole 2005 Life Sci DE
#> 8 Nicole 2005 Geography FG
I have a data frame similar to:
df<-as.data.frame(cbind(rep("Canada",6),
c(rep("Alberta",3), rep("Manitoba",2),rep("Unknown_province",1)),
c("Edmonton", "Unknown_city","Unknown_city","Brandon","Unknown_city","Unknown_city")))
colnames(df)<- c("Country","Province","City")
I would like to substitute all entries that contain "Unknown" with NA.
I have tried using grepl, but it removes all entries for that variable if one entry matches, I would like to only replace individual cells.
df[grepl("Unknown", df, ignore.case=TRUE)] <- NA
df1 <- df # This is to ensure that we can refert back to df incase there is an issue
Then you could use any of the following:
is.na(df1) <- array(grepl('Unknown', as.matrix(df1)), dim(df1))
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
or even:
df1[] <- sub("Unknown.*", NA, as.matrix(df1), ignore.case = TRUE)
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
Note that grepl and even sub are vectorized hence no need to use the *aply family or even for loops
Here is one possible way to solve your problem:
df[] <- lapply(df, function(x) ifelse(grepl("Unknown", x, TRUE), NA, x))
df
# Country Province City
# 1 Canada Alberta Edmonton
# 2 Canada Alberta <NA>
# 3 Canada Alberta <NA>
# 4 Canada Manitoba Brandon
# 5 Canada Manitoba <NA>
# 6 Canada <NA> <NA>
Using dplyr
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(),
~ case_when(str_detect(., 'Unknown', negate = TRUE) ~ .)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
I like to use replace() in such cases in which values in a vector are replaced or left as is, depending on a condition :
library(dplyr)
library(stringr)
df%>%mutate(across(everything(), ~replace(.x, str_detect(.x, 'Unknown'), NA)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
df[]<- lapply(df, gsub, pattern = "Unknown", replacement = NA, fixed = TRUE)
I have a huge database (more than 65M of rows) and I noticed that some cells are misplaced. As an example, let's say I have this:
library("tidyverse")
DATA <- tribble(
~SURNAME,~NAME,~STATE,~COUNTRY,
'Smith','Emma','California','USA',
'Johnson','Oliia','Texas','USA',
'Williams','James','USA','California',
'Jones','Noah','Pennsylvania','USA',
'Williams','Liam','Illinois','USA',
'Brown','Sophia','USA','Louisiana',
'Daves','Evelyn','USA','Oregon',
'Miller','Jacob','New Mexico','USA',
'Williams','Lucas','Connecticut','USA',
'Daves','John','California','USA',
'Jones','Carl','USA','Illinois'
)
=====
> DATA
# A tibble: 11 x 4
SURNAME NAME STATE COUNTRY
<chr> <chr> <chr> <chr>
1 Smith Emma California USA
2 Johnson Oliia Texas USA
3 Williams James USA California
4 Jones Noah Pennsylvania USA
5 Williams Liam Illinois USA
6 Brown Sophia USA Louisiana
7 Daves Evelyn USA Oregon
8 Miller Jacob New Mexico USA
9 Williams Lucas Connecticut USA
10 Daves John California USA
11 Jones Carl USA Illinois
As you can see, the Country and State are misplaced in some rows, how can I efficiently swap those ones?
Kind Regards,
Luiz.
Using data.table and the in-built state.name vector:
setDT(DATA)
DATA[COUNTRY %in% state.name, `:=`(COUNTRY = STATE, STATE = COUNTRY)]
DATA
# SURNAME NAME STATE COUNTRY
# 1: Smith Emma California USA
# 2: Johnson Oliia Texas USA
# 3: Williams James California USA
# 4: Jones Noah Pennsylvania USA
# 5: Williams Liam Illinois USA
# 6: Brown Sophia Louisiana USA
# 7: Daves Evelyn Oregon USA
# 8: Miller Jacob New Mexico USA
# 9: Williams Lucas Connecticut USA
# 10: Daves John California USA
# 11: Jones Carl Illinois USA
Check this solution (it assumes that COUNTRY column is in ISO3 format e.g. MEX, CAN):
DATA %>%
mutate(
COUNTRY_TMP = if_else(str_detect(COUNTRY, '[A-Z]{3}'), COUNTRY, STATE),
STATE = if_else(str_detect(COUNTRY, '[A-Z]{3}'), STATE, COUNTRY),
COUNTRY = COUNTRY_TMP
) %>%
select(-COUNTRY_TMP)
Assuming all country names are followed ISO3 format, we can first install the countrycode package. In this package, there is a data frame called codelist with a column iso3c with the ISO3 country names. We can use that as follows to swap the country name.
library(tidyverse)
library(countrycode)
DATA2 <- DATA %>%
mutate(STATE2 = ifelse(STATE %in% codelist$iso3c &
!COUNTRY %in% codelist$iso3c, COUNTRY, STATE),
COUNTRY2 = ifelse(!STATE %in% codelist$iso3c &
COUNTRY %in% codelist$iso3c, COUNTRY, STATE)) %>%
select(-STATE, -COUNTRY) %>%
rename(STATE = STATE2, COUNTRY = COUNTRY2)
DATA2
# # A tibble: 11 x 4
# SURNAME NAME STATE COUNTRY
# <chr> <chr> <chr> <chr>
# 1 Smith Emma California USA
# 2 Johnson Oliia Texas USA
# 3 Williams James California USA
# 4 Jones Noah Pennsylvania USA
# 5 Williams Liam Illinois USA
# 6 Brown Sophia Louisiana USA
# 7 Daves Evelyn Oregon USA
# 8 Miller Jacob New Mexico USA
# 9 Williams Lucas Connecticut USA
# 10 Daves John California USA
# 11 Jones Carl Illinois USA
I have a dataframe as follows:
df <- tibble::tribble(~home, ~visitor, ~hcountry, ~vcountry,
"Milan", "Manchester", "ITA", "ENG",
"LIVERPOOL", "MILAN", "ENG", "ITA",
"Real Madrid", "Juventus", "SPA", "ITA")
#> # A tibble: 3 x 4
#> home visitor hcountry vcountry
#> <chr> <chr> <chr> <chr>
#> 1 Milan Manchester ITA ENG
#> 2 LIVERPOOL MILAN ENG ITA
#> 3 Real Madrid Juventus SPA ITA
and would like to get only the italian teams ie: Milan, Milan, Juventus...how is it possible without using loops?
First off, I recommend a basic R tutorial to familiarise yourself with basic R data operations like subsetting etc. See for example R for Beginners on CRAN.
In your case you can do:
df[df$hcountry == "ITA" | df$vcountry == "ITA", ]
# home visitor hcountry vcountry
#1 Milan Manchester ITA ENG
#2 LIVERPOOL MILAN ENG ITA
#3 Real Madrid Juventus SPA ITA
Or
subset(df, hcountry == "ITA" | vcountry == "ITA")
Sample data
df <- read.table(text =
"home visitor hcountry vcountry
Milan Manchester ITA ENG
LIVERPOOL MILAN ENG ITA
'Real Madrid' Juventus SPA ITA", header =T)
Alternatively you could try stacking home and visitor countries to find unique values
library(dplyr)
library(tidyr)
df %>% gather(key1, country, -c(home, visitor)) %>%
gather(key2, team, -c(key1, country)) %>%
mutate_at(vars(key1, key2), substr, start=1, stop=1) %>%
filter(key1==key2) %>% select(-key1, -key2) %>%
mutate(team=tools::toTitleCase(tolower(team))) %>%
filter(country=="ITA") %>%
distinct()
#> # A tibble: 2 x 2
#> country team
#> <chr> <chr>
#> 1 ITA Milan
#> 2 ITA Juventus
Remove last distinct() if you want to see Milan value duplicated
We can use filter from dplyr
library(dplyr)
df %>%
filter(hcountry == "ITA" | vcountry == "ITA")