applying str_split to a column in dataframe

applying str_split to a column in dataframe - r

I have the following df named i:
structure(list(price = c(11772, 14790, 2990, 1499, 21980, 27999
), fuel = c("diesel", "petrol", "petrol", "diesel", "diesel",
"petrol"), gearbox = c("manual", "manual", "manual", "manual",
"automatic", "manual"), colour = c("white", "purple", "yellow",
"silver", "red", "rising blue metalli"), engine_size = c(1685,
1199, 998, 1753, 2179, 1984), mileage = c(18839, 7649, 45058,
126000, 31891, 100), year = c("2013 hyundai ix35", "2016 citroen citroen ds3 cabrio",
"2007 peugeot 107 hatchback", "2007 ford ford focus hatchback", "2012 jaguar xf saloon",
"2016 volkswagen scirocco coupe"), doors = c(5, 2, 3, 5, 4, 3
)), .Names = c("price", "fuel", "gearbox", "colour", "engine_size",
"mileage", "year", "doors"), row.names = c(NA, 6L), class = "data.frame")
Some of the words in column 'year' are duplicated. I would like to remove them. As a first step I would like to separate the character string in this column in separate words.
I was able to do it for a separate string, but when I try to apply it to the whole data frame it gives an error
unlist(str_split( "2013 hyunday ix35", "[[:blank:]]"))
[1] "2013" "hyunday" "ix35"
for( k in 1:nrow(i))
+ i[k,7]<-unlist(str_split( i[k, 7], "[[:blank:]]"))
Error in [<-.data.frame(*tmp*, k, 7, value = c("2013", "hyundai", :
replacement has 3 rows, data has 1

We can split by one or more space (\\s+), and paste the unique elements together by looping through the list output (sapply(..)
i$year <- sapply(strsplit(i$year, "\\s+"), function(x) paste(unique(x), collapse=' '))

Working with dplyr and stringr (with help of purrr for working with list), you could do this :
library(dplyr)
df %>%
mutate(newyear = purrr::map_chr(
stringr::str_split(year, pattern = "[[:blank:]]"),
~ paste(unique(.x), collapse = " ")
))
#> price fuel gearbox colour engine_size mileage
#> 1 11772 diesel manual white 1685 18839
#> 2 14790 petrol manual purple 1199 7649
#> 3 2990 petrol manual yellow 998 45058
#> 4 1499 diesel manual silver 1753 126000
#> 5 21980 diesel automatic red 2179 31891
#> 6 27999 petrol manual rising blue metalli 1984 100
#> year doors newyear
#> 1 2013 hyundai ix35 5 2013 hyundai ix35
#> 2 2016 citroen citroen ds3 cabrio 2 2016 citroen ds3 cabrio
#> 3 2007 peugeot 107 hatchback 3 2007 peugeot 107 hatchback
#> 4 2007 ford ford focus hatchback 5 2007 ford focus hatchback
#> 5 2012 jaguar xf saloon 4 2012 jaguar xf saloon
#> 6 2016 volkswagen scirocco coupe 3 2016 volkswagen scirocco coupe

Related

Transform data to long with grouped columns

For this week's tidytuesday challenge, for some reason, I am not able to group the column names in R which I was doing with pivot_longer function from tidyr previously. So, here is my code and I do not get it why it does throw an error and not give what I want.
library(tidyverse)
tuesdata <- tidytuesdayR::tt_load(2023, week = 7)
age_gaps <- tuesdata$age_gaps
df_long <- age_gaps %>%
pivot_longer(cols= actor_1_name:actor_2_name, names_to = "actornumber", values_to = "actorname") %>%
pivot_longer(cols= character_1_gender:character_2_gender, names_to = "gendernumber", values_to = "gender") %>%
pivot_longer(cols= actor_1_age:actor_2_age, names_to = "agenumber", values_to = "age") %>%
select(movie_name, release_year, director, age_difference, actorname, gender, age)
As seen from the code, the initial data has 1155 rows and after doing the quick data wrangling, I am expecting to get a data of 1155x2=2310 rows as I would like to merge the columns on actor names and their relevant information such as age and birthdate. Yet, the code does not give me the expected outcome and I am wondering why and how can I solve this problem. Thank you for your attention beforehand.
Example data (first 6 rows)
age_gaps <- structure(list(movie_name = c("Harold and Maude", "Venus", "The Quiet American",
"The Big Lebowski", "Beginners", "Poison Ivy"), release_year = c(1971,
2006, 2002, 1998, 2010, 1992), director = c("Hal Ashby", "Roger Michell",
"Phillip Noyce", "Joel Coen", "Mike Mills", "Katt Shea"), age_difference = c(52,
50, 49, 45, 43, 42), couple_number = c(1, 1, 1, 1, 1, 1), actor_1_name = c("Ruth Gordon",
"Peter O'Toole", "Michael Caine", "David Huddleston", "Christopher Plummer",
"Tom Skerritt"), actor_2_name = c("Bud Cort", "Jodie Whittaker",
"Do Thi Hai Yen", "Tara Reid", "Goran Visnjic", "Drew Barrymore"
), character_1_gender = c("woman", "man", "man", "man", "man",
"man"), character_2_gender = c("man", "woman", "woman", "woman",
"man", "woman"), actor_1_birthdate = structure(c(-26725, -13666,
-13442, -14351, -14629, -13278), class = "Date"), actor_2_birthdate = structure(c(-7948,
4536, 4656, 2137, 982, 1878), class = "Date"), actor_1_age = c(75,
74, 69, 68, 81, 59), actor_2_age = c(23, 24, 20, 23, 38, 17)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
age_gaps %>%
pivot_longer(actor_1_name:actor_2_age,
names_prefix = "(actor|character)_",
names_to = c("actor", ".value"),
names_sep = '_')
# A tibble: 12 × 10
movie_name release_year director age_difference couple_number actor name gender birthdate age
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
1 Harold and Maude 1971 Hal Ashby 52 1 1 Ruth Gordon woman 1896-10-30 75
2 Harold and Maude 1971 Hal Ashby 52 1 2 Bud Cort man 1948-03-29 23
3 Venus 2006 Roger Michell 50 1 1 Peter O'Toole man 1932-08-02 74
4 Venus 2006 Roger Michell 50 1 2 Jodie Whittaker woman 1982-06-03 24
5 The Quiet American 2002 Phillip Noyce 49 1 1 Michael Caine man 1933-03-14 69
6 The Quiet American 2002 Phillip Noyce 49 1 2 Do Thi Hai Yen woman 1982-10-01 20
7 The Big Lebowski 1998 Joel Coen 45 1 1 David Huddleston man 1930-09-17 68
8 The Big Lebowski 1998 Joel Coen 45 1 2 Tara Reid woman 1975-11-08 23
9 Beginners 2010 Mike Mills 43 1 1 Christopher Plummer man 1929-12-13 81
10 Beginners 2010 Mike Mills 43 1 2 Goran Visnjic man 1972-09-09 38
11 Poison Ivy 1992 Katt Shea 42 1 1 Tom Skerritt man 1933-08-25 59
12 Poison Ivy 1992 Katt Shea 42 1 2 Drew Barrymore woman 1975-02-22 17

Selecting variable column names for further IRR calculation in R

I have a table of cash flows for various projects over time (years) and want to calculate the IRR for each project. I can't seem to select the appropriate columns, which vary, for each project. The table structure is as follows:
structure(list(`Portfolio Company` = c("Ventures II", "Pal III",
"River Fund II", "Ventures III"),
minc = c(2007, 2008, 2008, 2012),
maxc = c(2021, 2021, 2021, 2020),
num_pers = c(14, 13, 13, 8),
`2007` = c(-660000, NA, NA, NA),
`2008` = c(-525000, -954219, -1427182.55, NA),
`2009` = c(-351991.03, -626798, -1694353.41, NA),
`2010` = c(-299717.06, -243248, -1193954, NA),
`2011` = c(-239257.08, 465738, -288309, NA),
`2012` = c(-9057.31000000001, -369011, 128509.63, -480000),
`2013` = c(-237233.9, -131111, 53718, -411734.58),
`2014` = c(-106181.76, -271181, 887640, -600000),
`2015` = c(-84760.51, 441808, 906289, -900000),
`2016` = c(2770719.21, -377799, 166110, -150000),
`2017` = c(157820.08, -12147, 1425198, -255000),
`2018` = c(204424.36,-1626110, 361270, -180000),
`2019` = c(563463.62, 119577, 531555, 3300402.62),
`2020` = c(96247.29, 7057926, 2247027, 36111.6),
`2021` = c(614848.68, 1277996, 258289, NA)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L),
groups = structure(list(`Portfolio Company` =c("Ventures II","Ventures III","Pal III", "River Fund II"),
.rows = structure(list(1L, 4L, 2L, 3L),
ptype = integer(0),
class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .drop = TRUE))
Each project (Portfolio Company) has a different start and end date which is captured by the minc and maxc columns. I would like to use the text in minc and maxc to select from minc:maxc for each project to perform the IRR calculation. I get a variety of errors including: object maxc not found, incorrect arg ... Have tried about 20 combinations of !!sym, as.String (from NLP package) ... none works.
This is the code that created the table and the problematic select code:
sum_fund_CF <- funds %>% group_by(`TX_YR`, `Portfolio Company`) %>%
summarise(CF=sum(if_else(is.na(Proceeds),0,Proceeds)-if_else(is.na(Investment),0,Investment))) %>% ungroup() #organizes source data and calculates cash flows
sum_fund_CF <- sum_fund_CF %>%
group_by(`Portfolio Company`) %>% mutate(minc=min(`TX_YR`),maxc=max(`TX_YR`),num_pers=maxc-minc) %>%
pivot_wider(names_from = TX_YR, values_from = `CF`) #creates the table and finds first year and last year of cash flow, and num of periods between them
sum_fund_CF %>% group_by(`Portfolio Company`)%>% select(!!sym(as.String(maxc))):!!sym(as.String(max))) #want to select appropriate columns for each record to do the IRR analysis ... IRR() ... need a string of cash flows and no NA.
I'm sure it's something simple, but this has me perplexed. Thanks !

You can modify your definition of IRR accordingly. I followed this article on how to calculate IRR using the jrvFinance package.
The filter function from the dplyr package is used after group_by, to select the years indicated by the minc and maxc columns.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(jrvFinance)
data <- structure(list(`Portfolio Company` = c("Ventures II", "Pal III",
"River Fund II", "Ventures III"),
minc = c(2007, 2008, 2008, 2012),
maxc = c(2021, 2021, 2021, 2020),
num_pers = c(14, 13, 13, 8),
`2007` = c(-660000, NA, NA, NA),
`2008` = c(-525000, -954219, -1427182.55, NA),
`2009` = c(-351991.03, -626798, -1694353.41, NA),
`2010` = c(-299717.06, -243248, -1193954, NA),
`2011` = c(-239257.08, 465738, -288309, NA),
`2012` = c(-9057.31000000001, -369011, 128509.63, -480000),
`2013` = c(-237233.9, -131111, 53718, -411734.58),
`2014` = c(-106181.76, -271181, 887640, -600000),
`2015` = c(-84760.51, 441808, 906289, -900000),
`2016` = c(2770719.21, -377799, 166110, -150000),
`2017` = c(157820.08, -12147, 1425198, -255000),
`2018` = c(204424.36,-1626110, 361270, -180000),
`2019` = c(563463.62, 119577, 531555, 3300402.62),
`2020` = c(96247.29, 7057926, 2247027, 36111.6),
`2021` = c(614848.68, 1277996, 258289, NA)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L),
groups = structure(list(`Portfolio Company` =c("Ventures II","Ventures III","Pal III", "River Fund II"),
.rows = structure(list(1L, 4L, 2L, 3L),
ptype = integer(0),
class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .drop = TRUE))
clean_data <- data %>%
clean_names() %>%
ungroup() %>%
pivot_longer(cols = -1:-4,
names_to = "year",
values_to = "cashflow") %>%
mutate(year = str_replace(year, "x", ""),
year = as.numeric(year))
clean_data %>%
print(n = 20)
#> # A tibble: 60 x 6
#> portfolio_company minc maxc num_pers year cashflow
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ventures II 2007 2021 14 2007 -660000
#> 2 Ventures II 2007 2021 14 2008 -525000
#> 3 Ventures II 2007 2021 14 2009 -351991.
#> 4 Ventures II 2007 2021 14 2010 -299717.
#> 5 Ventures II 2007 2021 14 2011 -239257.
#> 6 Ventures II 2007 2021 14 2012 -9057.
#> 7 Ventures II 2007 2021 14 2013 -237234.
#> 8 Ventures II 2007 2021 14 2014 -106182.
#> 9 Ventures II 2007 2021 14 2015 -84761.
#> 10 Ventures II 2007 2021 14 2016 2770719.
#> 11 Ventures II 2007 2021 14 2017 157820.
#> 12 Ventures II 2007 2021 14 2018 204424.
#> 13 Ventures II 2007 2021 14 2019 563464.
#> 14 Ventures II 2007 2021 14 2020 96247.
#> 15 Ventures II 2007 2021 14 2021 614849.
#> 16 Pal III 2008 2021 13 2007 NA
#> 17 Pal III 2008 2021 13 2008 -954219
#> 18 Pal III 2008 2021 13 2009 -626798
#> 19 Pal III 2008 2021 13 2010 -243248
#> 20 Pal III 2008 2021 13 2011 465738
#> # ... with 40 more rows
clean_data %>%
group_by(portfolio_company) %>%
filter(between(year, min(minc), max(maxc))) %>%
summarise(irr = irr(cashflow,
cf.freq = 1))
#> # A tibble: 4 x 2
#> portfolio_company irr
#> <chr> <dbl>
#> 1 Pal III 0.111
#> 2 River Fund II 0.0510
#> 3 Ventures II 0.0729
#> 4 Ventures III 0.0251
Created on 2022-01-04 by the reprex package (v2.0.1)

Another way to do it using jvrFinance::irr().
library(jrvFinance)
library(tidyverse)
df %>%
rowwise() %>%
summarise(irr = irr(na.omit(c_across(matches('^\\d')))), .groups = 'drop')
#> # A tibble: 4 × 2
#> `Portfolio Company` irr
#> <chr> <dbl>
#> 1 Ventures II 0.0729
#> 2 Pal III 0.111
#> 3 River Fund II 0.0510
#> 4 Ventures III 0.0251
Created on 2022-01-04 by the reprex package (v2.0.1)

Splitting full address column in multiple columns

I have a dataframe with the following column structure (over 1000+ rows total):
addressfull
POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map
POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map
POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map
POINT(2.915206183 24.315583523)||DEF_32||--||13||map
POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map
structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
The column contains an location, street, housenumber, zip code, city, and country. I want to split the column addressfull with R in multiple columns, as example:
street house number zip city country
molengraaf 20 1689 GL Utrecht Netherlands
winkellaan 67 5788 BG Amsterdam Netherlands
vermeerstraat 18 0932 DC Rotterdam Netherlands
na na na na na
Zandhorstlaan 122 0823 GT Ochtrup Germany
I have readed the tidyr and stringr documentation. I can see an pattern for splitting (by ")", "| from position x", and ",". but i can't figure out the correct code to split the column into multiple columns.
Can someone help me?

You could brute force it using sub for a base R approach:
df$steet <- sub("^(\\S+)\\s+.*$", "\\1", df$adressfull)
df$`house number` <- sub("^\\S+\\s+(\\d+).*$", "\\1", df$adressfull)
df$zip <- sub("^\\S+\\s+\\d+,\\s*(\\d+\\s+[A-Z]+).*$", "\\1", df$adressfull)
df$city <- sub("^.*?(\\S+),\\s*\\S+$", "\\1", df$adressfull)
df$country <- sub("^.*,\\s*(\\S+)$", "\\1", df$adressfull)
df
adressfull steet house number zip
1 molengraaf 20, 1689 GL Utrecht, Netherlands molengraaf 20 1689 GL
city country
1 Utrecht Netherlands
Data:
df <- data.frame(adressfull=c("molengraaf 20, 1689 GL Utrecht, Netherlands"),
stringsAsFactors=FALSE)
This assumes that we have already isolated just the address text. To do that, conisder:
text <- "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map"
addresfull <- unlist(strsplit(text, "\\|\\|"))[3]
addresfull
[1] "molengraaf 20, 1689 GL Utrecht, Netherlands"

A stringrsolution is this:
addresssplit <- data.frame(
street = str_extract(addressfull$addressfull, "(?<=DEF_\\d{2}\\|\\|)\\w+\\b"),
number = str_extract(addressfull$addressfull, "\\d{1,}(?=,)"),
zip = str_extract(addressfull$addressfull, "(?<=\\s)\\d{4}\\s[A-Z]{2}"),
city = str_extract(addressfull$addressfull, "(?<=\\d{4}\\s[A-Z]{2}\\s)\\w+"),
country = str_extract(addressfull$addressfull, "(?<=[a-z]\\b,\\s)\\w+\\b")
)
RESULT:
addresssplit
street number zip city country
1 molengraaf 20 1689 GL Utrecht Netherlands
2 winkellaan 67 5788 BG Amsterdam Netherlands
3 vermeerstraat 18 0932 DC Rotterdam Netherlands
4 <NA> <NA> <NA> <NA> <NA>
5 Zandhorstlaan 122 0823 GT Ochtrup Germany
DATA:
addressfull <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))

This would be a tidyverse approach to the problem:
library(tidyverse)
df <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
df %>% separate(addressfull, sep = "\\|\\|", into = c("Coords", "DEF", "ADDRESS"),extra = "drop") %>%
select(ADDRESS) %>%
separate(ADDRESS, sep = ",", into = c("street", "city", "country")) %>%
separate(street, sep = "(?= \\d)", into = c("street", "house_number")) %>%
separate(city, sep = "(?<=[A-Z][A-Z])", into = c("zip", "city"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> street house_number zip city country
#> 1 molengraaf 20 1689 GL Utrecht Netherlands
#> 2 winkellaan 67 5788 BG Amsterdam Netherlands
#> 3 vermeerstraat 18 0932 DC Rotterdam Netherlands
#> 4 -- <NA> <NA> <NA> <NA>
#> 5 Zandhorstlaan 122 0823 GT Ochtrup Germany
Created on 2020-02-27 by the reprex package (v0.3.0)

Create new variables from disorganized data that has variables in multiple columns

I have a data frame with the following structure:`
var1 var2 var3
año: 2005 km: 128000 marca: chevrolet
año: 2019 marca: hyundai km: 50000
marca: toyota año: 2012 km: 340000
`
I need to create new variables where the corresponding information is assigned
año marca km
2005 chevrolet 128000
2019 hyundai 50000
2012 toyota 340000
I'd love it if someone could help me with a loop for this purpose.

library(tidyverse)
df <- tibble::tribble(
~var1, ~var2, ~var3,
"ano: 2005", "km: 128000", "marca: chevrolet",
"ano: 2019", "marca: hyundai", "km: 50000",
"marca: toyota", "ano: 2012", "km: 340000"
)
df %>%
stack() %>%
select(-ind) %>%
separate(values, into = c("column", "value")) %>%
pivot_wider(value, column, values_fn = list(value = list)) %>%
unnest(cols = c(ano, marca, km))
#> # A tibble: 3 x 3
#> ano marca km
#> <chr> <chr> <chr>
#> 1 2005 toyota 128000
#> 2 2019 hyundai 50000
#> 3 2012 chevrolet 340000

Here is a base R code
pat <- c("ano","marca","km")
dfout <- setNames(data.frame(t(apply(df,
1,
function(v) trimws(gsub(".*:","",v))[match(gsub(":.*","",v),pat)]))),pat)
such that
> dfout
ano marca km
1 2005 chevrolet 128000
2 2019 hyundai 50000
3 2012 toyota 340000
DATA
df <- structure(list(var1 = c("ano: 2005", "ano: 2019", "marca: toyota"
), var2 = c("km: 128000", "marca: hyundai", "ano: 2012"), var3 = c("marca: chevrolet",
"km: 50000", "km: 340000")), class = "data.frame", row.names = c(NA,
-3L))

One way to solve it using purrr, dplyr and tidyr could be:
map_dfr(.x = split.default(df, 1:length(df)),
~ .x %>%
mutate(rowid = row_number()) %>%
separate(1, sep = ": ", into = c("column", "variable"))) %>%
pivot_wider(names_from = "column", values_from = "variable")
rowid ano marca km
<int> <chr> <chr> <chr>
1 1 2005 chevrolet 128000
2 2 2019 hyundai 50000
3 3 2012 toyota 340000

how to use an element of a character vector as a symbol argument to a function using non-standard evaluation := operator

I'm trying to write a function that accepts a character vector of variable names as symbolic arguments.
Here is some data taken from the "fertility" dataset in the questionr package. The important thing is that it includes some columns of labelled data.
library(tidyverse)
library(labelled)
df <- structure(list(id_woman = structure(c(391, 1643, 85, 881, 1981,
1072, 1978, 1607, 738), label = "Woman Id",
format.spss = "F8.0"),
weight = structure(c(1.80315, 1.80315, 1.80315, 1.80315,
1.80315, 0.997934, 0.997934, 0.997934, 0.192455),
label = "Sample weight", format.spss = "F8.2"),
residency = structure(c(2, 2, 2, 2, 2, 2, 2, 2, 2),
label = "Urban / rural residency",
labels = c(urban = 1, rural = 2),
class = "haven_labelled"),
region = structure(c(4, 4, 4, 4, 4, 3, 3, 3, 3), label = "Region",
labels = c(North = 1, East = 2, South = 3, West = 4),
class = "haven_labelled")),
row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
This function simply takes a variable name and converts it from labelled data to a factor.
my.func <- function(var){
df %>%
mutate({{var}} := to_factor({{var}}))
}
Both of these lines work.
my.func(residency)
my.func("residency")
They return this:
id_woman weight residency region
<dbl> <dbl> <fct> <dbl+lbl>
1 391 1.80 rural 4 [West]
2 1643 1.80 rural 4 [West]
3 85 1.80 rural 4 [West]
4 881 1.80 rural 4 [West]
5 1981 1.80 rural 4 [West]
6 1072 0.998 rural 3 [South]
7 1978 0.998 rural 3 [South]
8 1607 0.998 rural 3 [South]
9 738 0.192 rural 3 [South]
The trouble comes if I try to provide the variable name as part of a vector, like this:
var.names <- c("residency", "region")
my.func(var.names[1])
Error: The LHS of `:=` must be a string or a symbol
Call `rlang::last_error()` to see a backtrace
I tried this, but it also failed.
my.func(rlang::sym(var.names[1]))
Error: The LHS of `:=` must be a string or a symbol
Call `rlang::last_error()` to see a backtrace

In this case, we have to evaluate (!!)
my.func(!!var.names[1])
# A tibble: 9 x 4
# id_woman weight residency region
# <dbl> <dbl> <fct> <dbl+lbl>
#1 391 1.80 residency 4 [West]
#2 1643 1.80 residency 4 [West]
#3 85 1.80 residency 4 [West]
#4 881 1.80 residency 4 [West]
#5 1981 1.80 residency 4 [West]
#6 1072 0.998 residency 3 [South]
#7 1978 0.998 residency 3 [South]
#8 1607 0.998 residency 3 [South]
#9 738 0.192 residency 3 [South]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

applying str_split to a column in dataframe - r

We can split by one or more space (\\s+), and paste the unique elements together by looping through the list output (sapply(..) i$year <- sapply(strsplit(i$year, "\\s+"), function(x) paste(unique(x), collapse=' '))

Related

Transform data to long with grouped columns

Selecting variable column names for further IRR calculation in R

Splitting full address column in multiple columns

Create new variables from disorganized data that has variables in multiple columns

how to use an element of a character vector as a symbol argument to a function using non-standard evaluation := operator

Categories

Resources