I have a tibble with a character column. The character in each row is a set of words like this: "type:mytype,variable:myvariable,variable:myothervariable:asubvariableofthisothervariable". Things like that. I want to either convert this into columns in my tibble (a column "type", a column "variable", and so on; but then I don't really know what to do with my 3rd level words), or convert it to a column list x, so that x has a structure of sublists: x$type, x$variable, x$variable$myothervariable.
I'm not sure what is the best approach, but also, I don't know how to implement this two approaches that I suggest here. I have to say that I have maximum 3 levels, and more 1st level words than "type" and "variable".
Small Reproducible Example:
df <- tibble()
df$id<- 1:3
df$keywords <- c(
"type:novel,genre:humor:black,year:2010"
"type:dictionary,language:english,type:bilingual,otherlang:french"
"type:essay,topic:philosophy:purposeoflife,year:2005"
)
# expected would be in idea 1:
colnames(df)
# n, keywords, type, genre, year,
# language, otherlang, topic
# on idea 2:
colnames(df)
# n, keywords, keywords.as.list
We can use separate_rows from tidyr to split the 'keywords' column by ,, then with cSplit, split the column 'keywords' into multiple columns at :, reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(splitstackshape)
df %>%
separate_rows(keywords, sep=",") %>%
cSplit("keywords", ":") %>%
pivot_longer(cols = keywords_2:keywords_3, values_drop_na = TRUE) %>%
select(-name) %>%
mutate(rn = rowid(id, keywords_1)) %>%
pivot_wider(names_from = keywords_1, values_from = value) %>%
select(-rn) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 6 x 7
# id type genre year language otherlang topic
# <int> <chr> <chr> <int> <chr> <chr> <chr>
#1 1 novel humor 2010 <NA> <NA> <NA>
#2 1 <NA> black NA <NA> <NA> <NA>
#3 2 dictionary <NA> NA english french <NA>
#4 2 bilingual <NA> NA <NA> <NA> <NA>
#5 3 essay <NA> 2005 <NA> <NA> philosophy
#6 3 <NA> <NA> NA <NA> <NA> purposeoflife
data
df <- structure(list(id = 1:3, keywords = c("type:novel,genre:humor:black,year:2010",
"type:dictionary,language:english,type:bilingual,otherlang:french",
"type:essay,topic:philosophy:purposeoflife,year:2005")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Related
I have 3mio observations with the attribute "other_tags". The value of "other_tags" have to be converted to new attributes and values.
dput()
structure(list(osm_id = c(105093, 107975, 373652), other_tags = structure(c(2L,
3L, 1L), .Label = c("\"addr:city\"=>\"Neuenegg\",\"addr:street\"=>\"Stuberweg\",\"building\"=>\"school\",\"building:levels\"=>\"2\"",
"\"building\"=>\"commercial\",\"name\"=>\"Pollahof\",\"type\"=>\"multipolygon\"",
"\"building\"=>\"yes\",\"amenity\"=>\"sport\",\"type\"=>\"multipolygon\""
), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Here is a subsample of the data:
osm_id other_tags
105093 "building"=>"commercial","name"=>"Pollahof","type"=>"multipolygon"
107975 "building"=>"yes","amenity"=>"sport","type"=>"multipolygon"
373652 "addr:city"=>"Neuenegg","addr:street"=>"Stuberweg","building"=>"school","building:levels"=>"2"
This is the desired data format: Make new attributes (only for building and amenity) and add the value.
osm_id building amenity
105093 commercial
107975 yes sport
373652 school
Thx for your help!
Not that difficult.
other_tags is factor column, so we have to use as.charachter on that
Extract results in an intermediate list say s where all variable are separated; after splitting these from split = ',' using strsplit
store these attributes in a seaparte rwo for each attribute in anew dataframe say df2
use separate() from tidyr to break attributae name and value in two separate columns. separator sep is used as => this time
remove extra quotation marks by using str_remove_all
optionally filter the dataset
pivot_wider into the desired format.
library(tidyverse)
s <- strsplit(as.character(df$other_tags), split = ",")
df2 <- data.frame(osm_id = rep(df$osm_id, sapply(s, length)), other_tags = unlist(s))
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA
If however, filter is not used
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 8
osm_id building name type amenity `addr:city` `addr:street` `building:levels`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 105093 commercial Pollahof multipolygon NA NA NA NA
2 107975 yes NA multipolygon sport NA NA NA
3 373652 school NA NA NA Neuenegg Stuberweg 2
A single pipe syntax
df %>% mutate(other_tags = as.character(other_tags),
other_tags = str_split(other_tags, ",")) %>%
unnest(other_tags) %>%
mutate(other_tags = str_remove_all(other_tags, '"')) %>%
separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA
We can use (g)sub and str_extract as well as lookaround (in just two lines of code):
library(stringr)
df$building <- str_extract(gsub('"','', df$other_tags),'(?<=building=>)\\w+(?=,)')
df$amenity <- str_extract(gsub('"','', df$other_tags),'(?<=amenity=>)\\w+(?=,)')
If for some reason you want to remove column other_tags:
df$other_tags <- NULL
Result:
df
osm_id building amenity
1 105093 commercial <NA>
2 107975 yes sport
3 373652 school <NA>
Some data
example_df <- data.frame(
url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
numbs = 1:5
)
lookup_df <- data.frame(
string = c('blog', 'subscription', 'UK'),
group = c('blog', 'subs', 'UK')
)
library(fuzzyjoin)
data_combined <- example_df %>%
fuzzy_left_join(lookup_df, by = c("url" = "string"),
match_fun = `%in%`)
data_combined
url numbs string group
1 blog/blah 1 <NA> <NA>
2 blog/?utm_medium=foo 2 <NA> <NA>
3 blah 3 <NA> <NA>
4 subscription/apples 4 <NA> <NA>
5 UK/something 5 <NA> <NA>
I expected data_combined to have values for string and group where there's a match based on match_fun. Instead all NA.
Example, the first value of string in lookup_df is 'blog'. Since this is %in% the first value of example_df string, expected a match with value 'blog' and 'blog' in string and group fields.
If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)
-output
# url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UK
Trying to spread two column data to a format where there will be some NA values.
dataframe:
df <- data.frame(Names = c("TXT","LSL","TXT","TXT","TXT","USL","LSL"), Values = c("apple",-2,"orange","banana","pear",10,-1),stringsAsFactors = F)
If a row includes TXT following rows that has LSL or USL will belong to that row.
For ex:
in the first row; Name is TXT Value is apple next row is LSL value will be for apple's LSL and since no USL that will be NA until the next TXT name.
If there is a TXT followed by another TXT, then LSL and USL values for that row will be NA
trying to create this:
I tried using spread with row numbers as unique identifier but that's not what I want:
df %>% group_by(Names) %>% mutate(row = row_number()) %>% spread(key = Names,value = Values)
I guess I need to create following full table with NAs then spread but couldn't figure out how.
We can expand the dataset with complete after creating a grouping index based on the occurence of 'TXT'
library(dplyr)
library(tidyr)
df %>%
group_by(grp = cumsum(Names == 'TXT')) %>%
complete(Names = unique(.$Names)) %>%
ungroup %>%
spread(Names, Values) %>%
select(TXT, LSL, USL)
# A tibble: 4 x 3
# TXT LSL USL
# <chr> <chr> <chr>
#1 apple -2 <NA>
#2 orange <NA> <NA>
#3 banana <NA> <NA>
#4 pear -1 10
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), cumsum(Names == 'TXT')~Names, value.var = 'Values')[, -1]
# LSL TXT USL
#1: -2 apple <NA>
#2: <NA> orange <NA>
#3: <NA> banana <NA>
#4: -1 pear 10
I have a large dataframe and I would like to split a column into many columns based on two conditions the caret character ^ and the letter following IMM-. Based on the data below Column 1 would be split into columns named IMM-A, IMM-B, IMM-C, and IMM-W. I tried the separate function but it only works if you specify the column names and because my data is not uniform I don't always know what the column names should be.
SampleId Column1
1 IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
The expected output would be;
SampleId IMM-A IMM-B IMM-C IMM-W
1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
Not clear about the expected output. Based on the description, we may need
library(tidyverse)
map(strsplit(df$Column1, "[*+^]"), ~
stack(setNames(as.list(.x[c(FALSE, TRUE)]), .x[c(TRUE, FALSE)])) %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
spread(ind, values)) %>%
set_names(df$SampleId) %>%
bind_rows(.id = 'SampleId') %>%
select(-rn)
# A tibble: 6 x 5
# SampleId `IMM-A` `IMM-B` `IMM-C` `IMM-W`
# <chr> <chr> <chr> <chr> <chr>
#1 1 010306 6900 1212 <NA>
#2 1 0209 779999 3333 <NA>
#3 2 010306 <NA> 6900 1212
#4 2 0209 <NA> 779999 3333
#5 3 <NA> 010306 6900 1212
#6 3 <NA> 0209 779999 3333
Update
Based on the OP's expected output, we expand the data by splitting the 'Column1' at the ^ delimiter, then separate the 'Column1' into 'colA', 'colB' at the delimiter *, remove the 'colB' and spread to 'wide' format
df %>%
separate_rows(Column1, sep = "\\^") %>%
separate(Column1, into = c("colA", "colB"), remove = FALSE, sep="[*]") %>%
select(-colB) %>%
spread(colA, Column1, fill = "")
#SampleId IMM-A IMM-B IMM-C IMM-W
#1 1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
#2 2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
#3 3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
data
df <- structure(list(SampleId = 1:3, Column1 =
c("IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333",
"IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333",
"IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333"
)), class = "data.frame", row.names = c(NA, -3L))
I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.
We can use summarise_at and subset the first non-NA element
library(dplyr)
d %>%
group_by(grp_id) %>%
summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
# grp_id a b c d e f
# <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 1 NA No No <NA> <NA> No
#2 2 NA Yes <NA> Yes <NA> No
#3 3 NA <NA> <NA> No No No
In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))
We can use aggregate. No packages are used.
YN <- function(x) c(na.omit(as.character(x)), NA)[1]
aggregate(d[3:8], d["grp_id"], YN)
giving:
## grp_id a b c d e f
## 1 1 <NA> No No <NA> <NA> No
## 2 2 <NA> Yes <NA> Yes <NA> No
## 3 3 <NA> <NA> <NA> No No No
The above gives character columns. If you prefer factor columns then use this:
YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)
Note: Other alternate implementations of YN are:
YN <- function(x) sort(as.character(x), na.last = TRUE)[1]
YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]
library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]
You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)
In fact, a tidyr-only solution for your problem is very doable.
d %>%
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
na.omit() %>%
select(-id) %>%
spread(col, value, fill=NA, drop=FALSE)
The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.
Here's a description of how it works:
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.
na.omit() %>%
This removes all the NA values from the long data.
select(-id) %>%
This eliminates the id column.
spread(col, value, fill=NA, drop=FALSE)
This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.
I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)
mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%
anywhere between the gather() and spread() functions to ensure factor output.