How to convert string in value to attributes and values? - r

I have 3mio observations with the attribute "other_tags". The value of "other_tags" have to be converted to new attributes and values.
dput()
structure(list(osm_id = c(105093, 107975, 373652), other_tags = structure(c(2L,
3L, 1L), .Label = c("\"addr:city\"=>\"Neuenegg\",\"addr:street\"=>\"Stuberweg\",\"building\"=>\"school\",\"building:levels\"=>\"2\"",
"\"building\"=>\"commercial\",\"name\"=>\"Pollahof\",\"type\"=>\"multipolygon\"",
"\"building\"=>\"yes\",\"amenity\"=>\"sport\",\"type\"=>\"multipolygon\""
), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Here is a subsample of the data:
osm_id other_tags
105093 "building"=>"commercial","name"=>"Pollahof","type"=>"multipolygon"
107975 "building"=>"yes","amenity"=>"sport","type"=>"multipolygon"
373652 "addr:city"=>"Neuenegg","addr:street"=>"Stuberweg","building"=>"school","building:levels"=>"2"
This is the desired data format: Make new attributes (only for building and amenity) and add the value.
osm_id building amenity
105093 commercial
107975 yes sport
373652 school
Thx for your help!

Not that difficult.
other_tags is factor column, so we have to use as.charachter on that
Extract results in an intermediate list say s where all variable are separated; after splitting these from split = ',' using strsplit
store these attributes in a seaparte rwo for each attribute in anew dataframe say df2
use separate() from tidyr to break attributae name and value in two separate columns. separator sep is used as => this time
remove extra quotation marks by using str_remove_all
optionally filter the dataset
pivot_wider into the desired format.
library(tidyverse)
s <- strsplit(as.character(df$other_tags), split = ",")
df2 <- data.frame(osm_id = rep(df$osm_id, sapply(s, length)), other_tags = unlist(s))
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA
If however, filter is not used
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 8
osm_id building name type amenity `addr:city` `addr:street` `building:levels`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 105093 commercial Pollahof multipolygon NA NA NA NA
2 107975 yes NA multipolygon sport NA NA NA
3 373652 school NA NA NA Neuenegg Stuberweg 2
A single pipe syntax
df %>% mutate(other_tags = as.character(other_tags),
other_tags = str_split(other_tags, ",")) %>%
unnest(other_tags) %>%
mutate(other_tags = str_remove_all(other_tags, '"')) %>%
separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA

We can use (g)sub and str_extract as well as lookaround (in just two lines of code):
library(stringr)
df$building <- str_extract(gsub('"','', df$other_tags),'(?<=building=>)\\w+(?=,)')
df$amenity <- str_extract(gsub('"','', df$other_tags),'(?<=amenity=>)\\w+(?=,)')
If for some reason you want to remove column other_tags:
df$other_tags <- NULL
Result:
df
osm_id building amenity
1 105093 commercial <NA>
2 107975 yes sport
3 373652 school <NA>

Related

R dataframe Removing duplicates / choosing which duplicate to remove

I have a dataframe that has duplicates based on their identifying ID, but some of the columns are different. I'd like to keep the rows (or the duplicates) that have the extra bit of info. The structure of the df is as such.
id <- c("3235453", "3235453", "21354315", "21354315", "2121421")
Plan_name<- c("angers", "strasbourg", "Benzema", "angers", "montpellier")
service_line<- c("", "AMRS", "", "Therapy", "")
treatment<-c("", "MH", "", "MH", "")
df <- data.frame (id, Plan_name, treatment, service_line)
As you can see, the ID row has duplicates, but I'd like to keep the second duplicate where there is more info in treatment and service_line.
I have tried using
df[duplicated(df[,c(1,3)]),]
but it doesn't work as an empty df is returned. Any suggestions?
Maybe you want something like this:
First we replace all blank with NA, then we arrange be Section.B and finally slice() first row from group:
library(dplyr)
df %>%
mutate(across(-c(id, Plan_name),~ifelse(.=="", NA, .))) %>%
group_by(id) %>%
arrange(Section.B, .by_group = TRUE) %>%
slice(1)
id Plan_name Section.B Section.C
<chr> <chr> <chr> <chr>
1 2121421 montpellier NA NA
2 21354315 angers MH Therapy
3 3235453 strasbourg MH AMRS
Try with
library(dplyr)
df %>%
filter(if_all(treatment:service_line, ~ .x != ""))
-output
id Plan_name Section.B Section.C
1 3235453 strasbourg MH AMRS
2 21354315 angers MH Therapy
If we need ids with blanks and not duplicated as well
df %>%
group_by(id) %>%
filter(n() == 1|if_all(treatment:service_line, ~ .x != "")) %>%
ungroup
-output
# A tibble: 3 × 4
id Plan_name treatment service_line
<chr> <chr> <chr> <chr>
1 3235453 strasbourg "MH" "AMRS"
2 21354315 angers "MH" "Therapy"
3 2121421 montpellier "" ""

extract valus of another dataframe if value of one column is partially match in R

Sorry I didn't clarify my question,
my aim is if dt$id %in% df$id , extract df$score add to new column at dt,
I have a dataframe like this :
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
And I have another dataframe like
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
I want to mutate a new column "score"
I want to got output like below :
id
city
score
AR01
AM
2587/2/885
ERS02
Bis
901/3371
KVC
CHB
NA
or
id
city
score
score2
score3
AR01
AM
2587
2
885
ERS02
Bis
901
3371
NA
KVC
CHB
NA
NA
NA
I tried to use ifelse to achieve but always got error,
do any one can provide ideas? Thank you.
A simple left_join (after mutateing id values in df) is required:
library(dplyr)
library(stringr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/"))
# A tibble: 3 × 3
id city score
<chr> <chr> <chr>
1 AR01 AM 2587/2/885
2 ERS02 NA 901/2587/3371
3 QR01 Bis 3372/2
For the second solution you can use separate:
library(dyplr)
library(stringr)
library(tidyr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/")) %>%
separate(score,
into = paste("score", 1:3),
sep = "/" )
# A tibble: 3 × 5
id city `score 1` `score 2` `score 3`
<chr> <chr> <chr> <chr> <chr>
1 AR01 AM 2587 2 885
2 ERS02 NA 901 2587 3371
3 QR01 Bis 3372 2 NA
You could create groups by extracting everything before the . using sub to group_by on and merge the rows with paste separated with / and right_join them by id like this:
library(tibble)
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
library(dplyr)
df %>%
mutate(id = sub('\\..*', "", id)) %>%
group_by(id) %>%
mutate(score = paste(score, collapse = '/')) %>%
distinct(id, .keep_all = TRUE) %>%
ungroup() %>%
right_join(., dt, by = 'id')
#> # A tibble: 3 × 3
#> score id city
#> <chr> <chr> <chr>
#> 1 2587/2/885 AR01 AM
#> 2 3372/2 QR01 Bis
#> 3 <NA> KVC CHB
Created on 2022-10-01 with reprex v2.0.2

Fixing column names with unnest_wider in R

I am having a problem in R and seek your help!
I have a tibble that looks like this (unfortunately, I can't figure out how to write the code to create the table here). My table looks exactly like this to the viewer, e.g., you can see the letter "c" in the table.
person zip_code
Laura c("11001", "28720", "32948", "10309")
Mel c("80239", "23909")
Jake c("20930", "23929", "13909")
In short, my "zip_code" column contains rows of character vectors, each of which contain multiple ZIP codes.
I would like to separate the column "zip_code" into multiple columns, each containing one zip code (e.g., "zip_code_1", "zip_code_2", etc.). To do so, I have been using unnest_wider:
unnest_wider(zip_code, names_sep="_")
However, whenever I do this, the names of the new columns generated by unnest_wider come out wrong. Instead of being "zip_code_1", "zip_code_2", "zip_code_3," the new names are "zip_code_1[,1]", "zip_code_2[,1]", and zip_code_3[,1]". Basically, each column name has a "[,1]" afterward.
I have not repeated these column names anywhere, so I have no idea why they look like this.
I cannot manually rename them with:
dplyr::rename(zip_code_1=`zip_code_[,1]`)
If I do this, I get an error message.
Any help fixing these names is greatly appreciated! Thank you!
With the OP's data, it is a case of matrix column, thus if we convert to a vector (doesn't have dim attributes), the names_sep should work
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(zip = map(zip, c)) %>%
unnest_wider(zip, names_sep = "_")
# A tibble: 3 × 4
zip_1 zip_2 zip_3 zip_4
<chr> <chr> <chr> <chr>
1 10010 10019 10010 10019
2 10019 10032 10019 10032
3 11787 11375 11787 11375
Or as #IceCreamToucan mentioned, the transform option in unnest_wider would make it concise
unnest_wider(df1, zip, names_sep = '_', transform = c)
# A tibble: 3 × 4
zip_1 zip_2 zip_3 zip_4
<chr> <chr> <chr> <chr>
1 10010 10019 10010 10019
2 10019 10032 10019 10032
3 11787 11375 11787 11375
data
df1 <- structure(list(zip = list(structure(c("10010", "10019", "10010",
"10019"), dim = c(4L, 1L)), structure(c("10019", "10032", "10019",
"10032"), dim = c(4L, 1L)), structure(c("11787", "11375", "11787",
"11375"), dim = c(4L, 1L)))), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
library(dplyr)
library(tidyr) # unnest, pivot_wider
dat %>%
mutate(
# because your sample data didn't have an ID-column
person = LETTERS[row_number()],
# it's better to work with list-columns of strings, not matrices
zip_code = lapply(zip_code, c)
) %>%
unnest(zip_code) %>%
group_by(person) %>%
mutate(rn = paste0("zip", row_number())) %>%
pivot_wider(person, names_from = "rn", values_from = "zip_code") %>%
ungroup()
# # A tibble: 6 x 5
# person zip1 zip2 zip3 zip4
# <chr> <chr> <chr> <chr> <chr>
# 1 A 11374 11374 NA NA
# 2 B 10023 10023 NA NA
# 3 C 10028 10028 NA NA
# 4 D 11210 12498 11210 12498
# 5 E 10301 10301 NA NA
# 6 F 12524 10605 12524 10605
Data
dat <- structure(list(zip_code = list(structure(c("11374", "11374"), .Dim = 2:1), structure(c("10023", "10023"), .Dim = 2:1), structure(c("10028", "10028"), .Dim = 2:1), structure(c("11210", "12498", "11210", "12498"), .Dim = c(4L, 1L)), structure(c("10301", "10301"), .Dim = 2:1), structure(c("12524", "10605", "12524", "10605"), .Dim = c(4L, 1L)))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

Fill cells dataframe based on multiple conditions

How to fill cells based on multiple conditions?
There are a lot of players (columns) in this game, but I only included 2 for the sake of this example. I want to loop over a lot of players.
Every row represents a game round.
Conditions:
IF player00[i] score = 0 &
IF lossallowed00[i] = "no"
THEN Fill flag00[i] with "FLAG"
df <-data.frame(
player001 = c(1,0,3),
player002 = c(1,0,5),
lossallowed001 = c("no", "yes", "no"),
lossallowed002 = c("no", "no", "yes"),
flag001 = NA,
flag002 = NA
)
#desired output:
#player001 player002 lossallowed001 lossallowed002 flag001 flag002
# 1 1 no no NA NA
# 0 0 yes no NA FLAG
# 3 5 no yes NA NA
If you use a method of reshaping to long format, splitting out the IDs based on the pattern of column names being variables made of letters and IDs being made of numbers, you can do the operation all at once in a couple lines and reshape back to wide. Using regex means you're not bound by either the number of players or the names of columns. I added an ID column for the games to differentiate rows; you could drop it afterward.
The reshaping itself is covered pretty extensively already (Reshaping multiple sets of measurement columns (wide format) into single columns (long format) for example) but is useful for problems that need to scale like this.
library(dplyr)
df %>%
tibble::rowid_to_column(var = "game") %>%
tidyr::pivot_longer(-game, names_to = c(".value", "num"),
names_pattern = "(^[a-z]+)(\\d+$)") %>%
mutate(flag = ifelse(player == 0 & lossallowed == "no", "FLAG", NA_character_)) %>%
tidyr::pivot_wider(id_cols = game, names_from = num, values_from = player:flag,
names_glue = "{.value}{num}")
#> # A tibble: 3 × 7
#> game player001 player002 lossallowed001 lossallowed002 flag001 flag002
#> <int> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 1 1 no no <NA> <NA>
#> 2 2 0 0 yes no <NA> FLAG
#> 3 3 3 5 no yes <NA> <NA>
A possible solution:
library(tidyverse)
df <-data.frame(player001 = c(1,0,3), player002 = c(1,0,5),lossallowed001 = c("no", "yes", "no"), loseallowed002 = c("no", "no", "yes"),flag001 = NA, flag002 = NA)
df %>%
rownames_to_column("id") %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(cols = -id) %>%
group_by(str_extract(name, "\\d{3}$"), id) %>%
mutate(value = if_else(row_number() == 3 & first(value) == "0" &
nth(value, 2) == "no", "FLAG", value)) %>%
ungroup %>% select(name, value) %>%
pivot_wider(names_from = name, values_from = value, values_fn = list) %>%
unnest(cols = everything()) %>% type.convert(as.is = TRUE)
#> # A tibble: 3 × 6
#> player001 player002 lossallowed001 loseallowed002 flag001 flag002
#> <int> <int> <chr> <chr> <lgl> <chr>
#> 1 1 1 no no NA <NA>
#> 2 0 0 yes no NA FLAG
#> 3 3 5 no yes NA <NA>
You can do this. First reshape the data, and then add the column. Use bind_cols if you want the data to be merged back.
library(purrr)
library(dplyr)
map(set_names(paste0("00", 1:2)), ~ select(df, ends_with(.x))) %>%
map(., ~ mutate(., newcol = ifelse(.[[1]] == 0 & .[[2]] == "no", "FLAG", NA)))
$`001`
player001 lossallowed001 flag001 newcol
1 1 no NA NA
2 0 yes NA NA
3 3 no NA NA
$`002`
player002 loseallowed002 flag002 newcol
1 1 no NA <NA>
2 0 no NA FLAG
3 5 yes NA <NA>
Here's a solution in the tidyverse. While I arrived at this solution independently, this is likely a duplicate of #camille's solution here, which was posted shortly before mine.
library(tidyverse)
# ...
# Code to generate 'df'.
# ...
df %>%
# Index the matches.
mutate(match_id = row_number()) %>%
# Pivot to get a row for each player {001, 002, ...} and match.
pivot_longer(
# Target columns whose names end with a separate suffix of 3+ digits.
matches("^(.*\\D)(\\d{3,})$"),
names_pattern = "^(.*\\D)(\\d{3,})$",
# Index the players by their suffixes; and give each the following three columns:
# 'player' (score), 'lossallowed', and 'flag'.
names_to = c(".value", "player_id")
) %>%
# Flag the appropriate cases.
mutate(
flag = if_else(player == 0 & lossallowed == "no", "FLAG", NA_character_)
) %>%
# Return to original, wide format.
pivot_wider(
names_from = player_id,
values_from = !c(match_id, player_id),
names_glue = "{.value}{player_id}"
) %>%
arrange(match_id) %>% select(!match_id)

create list from characters in R tibble

I have a tibble with a character column. The character in each row is a set of words like this: "type:mytype,variable:myvariable,variable:myothervariable:asubvariableofthisothervariable". Things like that. I want to either convert this into columns in my tibble (a column "type", a column "variable", and so on; but then I don't really know what to do with my 3rd level words), or convert it to a column list x, so that x has a structure of sublists: x$type, x$variable, x$variable$myothervariable.
I'm not sure what is the best approach, but also, I don't know how to implement this two approaches that I suggest here. I have to say that I have maximum 3 levels, and more 1st level words than "type" and "variable".
Small Reproducible Example:
df <- tibble()
df$id<- 1:3
df$keywords <- c(
"type:novel,genre:humor:black,year:2010"
"type:dictionary,language:english,type:bilingual,otherlang:french"
"type:essay,topic:philosophy:purposeoflife,year:2005"
)
# expected would be in idea 1:
colnames(df)
# n, keywords, type, genre, year,
# language, otherlang, topic
# on idea 2:
colnames(df)
# n, keywords, keywords.as.list
We can use separate_rows from tidyr to split the 'keywords' column by ,, then with cSplit, split the column 'keywords' into multiple columns at :, reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(splitstackshape)
df %>%
separate_rows(keywords, sep=",") %>%
cSplit("keywords", ":") %>%
pivot_longer(cols = keywords_2:keywords_3, values_drop_na = TRUE) %>%
select(-name) %>%
mutate(rn = rowid(id, keywords_1)) %>%
pivot_wider(names_from = keywords_1, values_from = value) %>%
select(-rn) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 6 x 7
# id type genre year language otherlang topic
# <int> <chr> <chr> <int> <chr> <chr> <chr>
#1 1 novel humor 2010 <NA> <NA> <NA>
#2 1 <NA> black NA <NA> <NA> <NA>
#3 2 dictionary <NA> NA english french <NA>
#4 2 bilingual <NA> NA <NA> <NA> <NA>
#5 3 essay <NA> 2005 <NA> <NA> philosophy
#6 3 <NA> <NA> NA <NA> <NA> purposeoflife
data
df <- structure(list(id = 1:3, keywords = c("type:novel,genre:humor:black,year:2010",
"type:dictionary,language:english,type:bilingual,otherlang:french",
"type:essay,topic:philosophy:purposeoflife,year:2005")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

Resources