I have a dataframe with 13 columns that has data separated by a "^" sign. What I'm trying to come up with is some code that would read each column and parse out the data in between the "^" into its own column.
I can do this on a single column but performing the function I want on each column has proved tricky.
This is easy to do on a single column of data.
#df = original dataset
#split first column based on '^' symbol -output is a list
df2 <-strsplit(as.character(df$`Col1`),"\\^")
#turn list into df again
df3 <-as.data.frame(do.call(rbind,df2),stringsAsFactors = F)
This gives me one dataframe with the text-to-columns output of 1 column. The problem is I have 12 other columns.
Original df example:
col1 col2 col3
baby^monkey cow^pig^sheep tree^root^grass^man
Desired Output:
Col1_1 Col1_2 Col2_1 Col2_2 Col2_3 Col3_1 Col3_2 Col3_3 Col3_4
baby monkey cow pig sheep tree root grass man
With a few functions from dplyr and tidyr, you can reshape the data into a long format, separate the strings by ^ into individual rows, make row numbers along the column groups, and spread back into wide shape.
library(tidyr)
library(dplyr)
df <- read.table(text = "col1 col2 col3
baby^monkey cow^pig^sheep tree^root^grass^man",
header = T, stringsAsFactors = F)
df %>%
gather(key, value) %>%
separate_rows(value, sep = "\\^") %>%
group_by(key) %>%
mutate(row = row_number()) %>%
unite(key, key, row) %>%
spread(key, value)
#> # A tibble: 1 x 9
#> col1_1 col1_2 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3 col3_4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 baby monkey cow pig sheep tree root grass man
Related
I have a dataframe that has duplicates based on their identifying ID, but some of the columns are different. I'd like to keep the rows (or the duplicates) that have the extra bit of info. The structure of the df is as such.
id <- c("3235453", "3235453", "21354315", "21354315", "2121421")
Plan_name<- c("angers", "strasbourg", "Benzema", "angers", "montpellier")
service_line<- c("", "AMRS", "", "Therapy", "")
treatment<-c("", "MH", "", "MH", "")
df <- data.frame (id, Plan_name, treatment, service_line)
As you can see, the ID row has duplicates, but I'd like to keep the second duplicate where there is more info in treatment and service_line.
I have tried using
df[duplicated(df[,c(1,3)]),]
but it doesn't work as an empty df is returned. Any suggestions?
Maybe you want something like this:
First we replace all blank with NA, then we arrange be Section.B and finally slice() first row from group:
library(dplyr)
df %>%
mutate(across(-c(id, Plan_name),~ifelse(.=="", NA, .))) %>%
group_by(id) %>%
arrange(Section.B, .by_group = TRUE) %>%
slice(1)
id Plan_name Section.B Section.C
<chr> <chr> <chr> <chr>
1 2121421 montpellier NA NA
2 21354315 angers MH Therapy
3 3235453 strasbourg MH AMRS
Try with
library(dplyr)
df %>%
filter(if_all(treatment:service_line, ~ .x != ""))
-output
id Plan_name Section.B Section.C
1 3235453 strasbourg MH AMRS
2 21354315 angers MH Therapy
If we need ids with blanks and not duplicated as well
df %>%
group_by(id) %>%
filter(n() == 1|if_all(treatment:service_line, ~ .x != "")) %>%
ungroup
-output
# A tibble: 3 × 4
id Plan_name treatment service_line
<chr> <chr> <chr> <chr>
1 3235453 strasbourg "MH" "AMRS"
2 21354315 angers "MH" "Therapy"
3 2121421 montpellier "" ""
I have a large dataframe. I'm trying to remove v character from variable names of a data frame
df <- tibble(q_ve5 = 1:2,
q_f_1v = 3:4,
q_vf_2 = 3:4,
q_e6 = 5:6,
q_ev8 = 5:6)
I tried this. It seems my regular expression pattern is not correct
df %>%
rename_all(~ str_remove(., "\\v\\d+$"))
My desired col names:
q_e5 q_f_1 q_f_2 q_e6 q_e8
If we need to remove only 'v' the one of more digits (\\d+) at the end ($) is not needed as the expected output also removes 'v' from first column 'q_ve5'
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_remove(., "v"), everything())
-output
# A tibble: 2 × 5
q_e5 q_f_1 q_f_2 q_e6 q_e8
<int> <int> <int> <int> <int>
1 1 3 3 5 5
2 2 4 4 6 6
Or without any packages
names(df) <- sub("v", "", names(df))
I have 3mio observations with the attribute "other_tags". The value of "other_tags" have to be converted to new attributes and values.
dput()
structure(list(osm_id = c(105093, 107975, 373652), other_tags = structure(c(2L,
3L, 1L), .Label = c("\"addr:city\"=>\"Neuenegg\",\"addr:street\"=>\"Stuberweg\",\"building\"=>\"school\",\"building:levels\"=>\"2\"",
"\"building\"=>\"commercial\",\"name\"=>\"Pollahof\",\"type\"=>\"multipolygon\"",
"\"building\"=>\"yes\",\"amenity\"=>\"sport\",\"type\"=>\"multipolygon\""
), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Here is a subsample of the data:
osm_id other_tags
105093 "building"=>"commercial","name"=>"Pollahof","type"=>"multipolygon"
107975 "building"=>"yes","amenity"=>"sport","type"=>"multipolygon"
373652 "addr:city"=>"Neuenegg","addr:street"=>"Stuberweg","building"=>"school","building:levels"=>"2"
This is the desired data format: Make new attributes (only for building and amenity) and add the value.
osm_id building amenity
105093 commercial
107975 yes sport
373652 school
Thx for your help!
Not that difficult.
other_tags is factor column, so we have to use as.charachter on that
Extract results in an intermediate list say s where all variable are separated; after splitting these from split = ',' using strsplit
store these attributes in a seaparte rwo for each attribute in anew dataframe say df2
use separate() from tidyr to break attributae name and value in two separate columns. separator sep is used as => this time
remove extra quotation marks by using str_remove_all
optionally filter the dataset
pivot_wider into the desired format.
library(tidyverse)
s <- strsplit(as.character(df$other_tags), split = ",")
df2 <- data.frame(osm_id = rep(df$osm_id, sapply(s, length)), other_tags = unlist(s))
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA
If however, filter is not used
df2 %>% separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
mutate(across(starts_with("Col"), ~str_remove_all(., '"'))) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 8
osm_id building name type amenity `addr:city` `addr:street` `building:levels`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 105093 commercial Pollahof multipolygon NA NA NA NA
2 107975 yes NA multipolygon sport NA NA NA
3 373652 school NA NA NA Neuenegg Stuberweg 2
A single pipe syntax
df %>% mutate(other_tags = as.character(other_tags),
other_tags = str_split(other_tags, ",")) %>%
unnest(other_tags) %>%
mutate(other_tags = str_remove_all(other_tags, '"')) %>%
separate(other_tags, into = c("Col1", "Col2"), sep = "=>") %>%
filter(Col1 %in% c("amenity", "building")) %>%
pivot_wider(id_cols = osm_id, names_from = Col1, values_from = Col2)
# A tibble: 3 x 3
osm_id building amenity
<dbl> <chr> <chr>
1 105093 commercial NA
2 107975 yes sport
3 373652 school NA
We can use (g)sub and str_extract as well as lookaround (in just two lines of code):
library(stringr)
df$building <- str_extract(gsub('"','', df$other_tags),'(?<=building=>)\\w+(?=,)')
df$amenity <- str_extract(gsub('"','', df$other_tags),'(?<=amenity=>)\\w+(?=,)')
If for some reason you want to remove column other_tags:
df$other_tags <- NULL
Result:
df
osm_id building amenity
1 105093 commercial <NA>
2 107975 yes sport
3 373652 school <NA>
I see a lot of examples of how to count values for one column. I can't find a solution for counting for several columns.
I have data like
city col1 col2 col3 col4
I want to group by city and count unique values in col1, col2, col3...
aggregate(. ~ city, hh2, function(x) length(unique(x)))
I can count using aggregate, but it replaces city names with numbers and it's unclear how to revert it.
Here's an approach using dplyr::across, which is a handy way to calculate across multiple columns:
my_data <- data.frame(
city = c(rep("A", 3), rep("B", 3)),
col1 = 1:6,
col2 = 0,
col3 = c(1:3, 4, 4, 4),
col4 = 1:2
)
library(dplyr)
my_data %>%
group_by(city) %>%
summarize(across(col1:col4, n_distinct))
# A tibble: 2 x 5
city col1 col2 col3 col4
* <chr> <int> <int> <int> <int>
1 A 3 1 3 2
2 B 3 1 1 2
Looks to me like tidy data is what you're after. Here's an example with the tidyverse and subset of the mpg data set in ggplot2.
library(tidyverse)
data <- mpg[c("model", 'cty', 'hwy')]
head(data) #to see the initial data layout.
data %>%
pivot_longer(cols = c('cty', 'hwy'), names_to = 'cat', values_to = 'values') %>%
group_by(model, cat) %>%
summarise(avg = mean(values))
Consider the following sample dataframe:
> df
id name time
1 1 b 10
2 1 b 12
3 1 a 0
4 2 a 5
5 2 b 11
6 2 a 9
7 2 b 7
8 1 a 15
9 2 b 1
10 1 a 3
df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L),
name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id",
"name", "time"), row.names = c(NA, -10L), class = "data.frame")
I need to identify and record all sequences seq <- c("a","b"), where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5.
The expected result for the sample data is
a b
1 3 10
2 5 7
3 9 11
There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?
library(dplyr); library(tidyr)
# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>%
# get logical vector indicating rows of a followed by b and mark each pair as unique
# by cumsum
mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>%
# subset rows where conditions are met
filter(ab | lag(ab)) %>%
# reshape your data frame to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [3 x 4]
#Groups: id [2]
# id g a b
#* <int> <int> <int> <int>
#1 1 1 3 10
#2 2 1 5 7
#3 2 2 9 11
If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift function(which accepts a vector as lag/lead steps) from data.table combined with Reduce, say if we need to check pattern abb:
library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)
df %>% arrange(id, time) %>% group_by(id) %>%
# same logic as before but use Reduce function to check multiple lags condition
mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)),
g = cumsum(ab)) %>%
# use reduce or to subset sequence rows having the same length as the pattern
filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>%
# make unique names
group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>%
# pivoting the table to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [1 x 5]
#Groups: id, g [1]
# id g a_1 b_2 b_3
#* <int> <int> <int> <int> <int>
#1 1 1 3 10 12
It's somewhat convoluted, but how about a rolling join?
library(data.table)
setorder(setDT(df), id, time)
df[ name == "b" ][
df[, if(name == "a") .(time = last(time)), by=.(id, name, r = rleid(id,name))],
on = .(id, time),
roll = -Inf,
nomatch = 0,
.(a = i.time, b = x.time)
]
a b
1: 3 10
2: 5 7
3: 9 11
You can use an ifelse in filter with lag and lead, and then tidyr::spread to reshape to wide:
library(tidyverse)
df %>% arrange(id, time) %>% group_by(id) %>%
filter(ifelse(name == 'b', # if name is b...
lag(name) == 'a', # is the previous name a?
lead(name) == 'b')) %>% # else if name is not b, is next name b?
ungroup() %>% mutate(i = rep(seq(n() / 2), each = 2)) %>% # create indices to spread by
spread(name, time) %>% select(a, b) # spread to wide and clean up
## # A tibble: 3 × 2
## a b
## * <int> <int>
## 1 3 10
## 2 5 7
## 3 9 11
Based on the comment below, here's a version that uses gregexpr to find the first index of a matched pattern, which while more complicated, scales more easily to longer patterns like "aabb":
df %>% group_by(pattern = 'aabb', id) %>% # add pattern as column, group
arrange(time) %>%
# collapse each group to a string for name and a list column for time
summarise(name = paste(name, collapse = ''), time = list(time)) %>%
# group and add list-column of start indices for each match
rowwise() %>% mutate(i = gregexpr(pattern, name)) %>%
unnest(i, .drop = FALSE) %>% # expand, keeping other list columns
filter(i != -1) %>% # chop out rows with no match from gregexpr
rowwise() %>% # regroup
# subset with sequence from index through pattern length
mutate(time = list(time[i + 0:(nchar(pattern) - 1)]),
pattern = strsplit(pattern, '')) %>% # expand pattern to list column
rownames_to_column('match') %>% # add rownames as match index column
unnest(pattern, time) %>% # expand matches in parallel
# paste sequence onto each letter (important for spreading if repeated letters)
group_by(match) %>% mutate(pattern = paste0(pattern, seq(n()))) %>%
spread(pattern, time) # spread to wide form
## Source: local data frame [1 x 8]
## Groups: match [1]
##
## match id name i a1 a2 b3 b4
## * <chr> <int> <chr> <int> <int> <int> <int> <int>
## 1 1 1 aabba 1 0 3 10 12
Note that if the pattern doesn't happen to be in alphabetical order, the resulting columns will not be ordered by their indices. Since indices are preserved, though, you can sort with something like select(1:4, parse_number(names(.)[-1:-4]) + 4).