How can I write a function that is iterable? - r

I need to modify a function (below) that will apply row-wise with dplyr::mutate to remove any '_' characters and capitalise the first letter of each word.
My function
simple_cap <- function(x) {
s <- strsplit(x, "_")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
My data
df <- read.table(text = c('
location obs
1 australia 12454.
2 new_south_wales 3931.
3 victoria 3244.
4 queensland 2477.
5 south_australia 834.
6 western_australia 1335.
7 tasmania 246.'), stringsAsFactors = F)
The dplyr::mutate call:
df %>% mutate(
location = simple_cap(location)
)
The output
location obs
1 Australia 12454
2 Australia 3931
3 Australia 3244
4 Australia 2477
5 Australia 834
6 Australia 1335
7 Australia 246
How can I change my function so that it can be used to iterate over the values in df$location rather than replacing them all with the output from the first element?

1) With gsub
We can use gsub to select the lower case characters ([a-z]), capture as a group ((...)) that is the first letter of the string (^) or (|) that follows an underscore (_) and replace with the backreference after converting to upper case (\\U)
Wrap with another gsub to remove the _ and replace with " "
df %>%
mutate(location = gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", location, perl = TRUE)))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
2) With stringi
Or another option is stri_trans_totitle from stringi
library(stringi)
df %>%
mutate(location = stri_trans_totitle(stri_replace_all_fixed(location, "_", " ")))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
3) Using OP's modified function
The strsplit output is a list. In the OP's code, it is just subsetting the first element by extracting [[1]]. But, here we have a list of length 7. So, one option is to use map from purrr (or with lapply/sapply from base R) and then do the pasteing of the substring
simple_cap <- function(x) {
s <- strsplit(x, "_")
purrr::map_chr(s, ~
paste(toupper(substring(.x, 1,1)), substring(.x, 2),
sep="", collapse=" "))
}
df %>%
mutate(location = simple_cap(location))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246
4) OP's modified function with sapply
simple_cap <- function(x) {
s <- strsplit(x, "_")
sapply(s, function(.s)
paste(toupper(substring(.s, 1,1)), substring(.s, 2),
sep="", collapse=" "))
}
5) No external packages
But, this can be done without using any external package
df$location <- gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", df$location, perl = TRUE))

There is a str_to_title function in stringr which capitalises the first character of word and with gsub we replace all the "_" (underscore) with " " (blank space).
library(stringr)
library(dplyr)
df %>%
mutate(location = str_to_title(gsub("_", " ", location)))
# location obs
#1 Australia 12454
#2 New South Wales 3931
#3 Victoria 3244
#4 Queensland 2477
#5 South Australia 834
#6 Western Australia 1335
#7 Tasmania 246

Ronak Shah and akrun have solved your specific problem. Here's the general solution to your title question (how do I write a function that is iterable).
In the parlance of R, you want a vectorized function -- a function that accepts a vector input and returns a vector output. There are two ways to do this.
1) Make sure each step in your function can accept a vector input and return a vector output. #akrun's 4th answer identifies the step in your code that prevents it from doing this, s <- strsplit(x, "_")[[1]].
2) Turn a non-vectorized function into a vectorized one with Vectorize. Option 1 is more efficient, but sometimes it's not possible. This is clearly an example where it's possible, but to show you how this works, lets vectorize your function with Vectorize
simple_cap <- function(x) {
s <- strsplit(x, "_")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
simple_cap_v <- Vectorize(simple_cap, USE.NAMES = FALSE)
simple_cap(df$location)
# [1] "Australia"
simple_cap_v(df$location)
# [1] "Australia" "New South Wales" "Victoria" "Queensland"
# [5] "South Australia" "Western Australia" "Tasmania"
df %>% mutate(
location = simple_cap_v(location)
)
# location obs
# 1 Australia 12454
# 2 New South Wales 3931
# 3 Victoria 3244
# 4 Queensland 2477
# 5 South Australia 834
# 6 Western Australia 1335
# 7 Tasmania 246
Vectorize returns a function that is a wrapper to mapply. Effectively, a call to simple_cap_v(x) is now mapply(simple_cap, x, USE.NAMES = FALSE)

Related

R - conditional pattern matching using grepl

I have two data frames, like so:
name <- c("joe", "kim", "kerry", "david")
name2 <- c("kim", "david", "joe", "kerry")
school <- c("cambridge", "south carolina", "vermont binghamton", "delaware")
school2 <- c("south carolina", "delaware", "cambridge magdalene", "vermont")
df1 <- data.frame(name, school)
df2 <- data.frame(name2, school2)
What I would like to do is the following:
Search df2$name2 for a match in df1$name.
If a match is found, compare df2$school2 to df1$school from the matching row.
If no match is found for df2$school2 in df1$school, return FALSE in column df2$perfect.match
So for example, since "joe" in df2 matches "joe" in df1, there's a match. However, since the values for "school" in both aren't the same, the would be a column in df2 with a value of FALSE in the third row. Same for 4th row in df2.
I have tried using grep and grepl. I figure grepl would be best, since it returns a logical value. What I tried was:
df2$perfect.match <- ifelse(grepl(paste(df2$name2, collapse = "|"),
df1$name, fixed = F) & grepl(paste(df2$school2, collapse = "|"), df1$school, fixed = F), "", "FALSE")
however, all I get is this:
name2 school2 perfect.match
1 kim south carolina FALSE
2 david delaware
3 joe cambridge magdalene
4 kerry vermont
When my desired result is:
df2
name2 school2 perfect.match
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE
If possible, something speedy would be best. The real dataframe are quite large. Thanks.
UPDATE:
I would like to also be able to force the rows that are false to have the same value for df2$school as their corresponding name match in df1$school Like so:
name2 school2
1 kim south carolina
2 david delaware
3 joe cambridge
4 kerry vermont binghamton
You can just do...
df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)
df2
name2 school2 perfect.match
1 kim south carolina TRUE
2 david delaware TRUE
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE
We can use match and %in%. grepl wouldn't be right here since this is exact matching and not pattern matching.
df2$perfect_match <- df2$school2 %in% df1$school[match(df2$name2, df1$name)]
df2
# name2 school2 perfect_match
#1 kim south carolina TRUE
#2 david delaware TRUE
#3 joe cambridge magdalene FALSE
#4 kerry vermont FALSE
Slightly faster than pasting the columns together:
matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
microbenchmark::microbenchmark(
v1 = {matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
},
v2 = {df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)}
)
Using dplyr, you can do:
dfX <- df1 %>%
bind_rows(.,df2) %>%
group_by(name) %>%
distinct(school) %>%
count(name, name = "perfect.matched") %>%
left_join(df2,.,by = 'name') %>%
mutate(., perfect.matched = ifelse(perfect.matched ==1,"","FALSE"))
And to get the following output:
> dfX
name school perfect.matched
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE

Creating a loop to add labels to colums: library(Hmisc)

I have a dataset which looks something like this:
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
I have another dataset which looks something like this:
Indicator Code Indicator Name
P Power
H Happiness
I would like to add info in the second column of the second dataset (Power, Happiness) as a label to the abbreviation used in the first dataset with a loop, but I don't know exactly how to write the loop.
This is how far I got:
library(Hmisc)
for i in df2[,1]{
if (df1[,i] == df2[i,]){
label(df1[,i]) <- df2[i,2]
}}
But this merely checks whether names are the same and does not search for it.
Could anyone guide further?
Desired output:
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
If you specifically want to use a loop, this approach gives the output you describe:
df <- data.frame(Year = c(2000, 2001, 2000, 2001),
Country = c("France", "France", "UK","UK"),
Matchcode = c("0001", "0002", "0003", "0004"),
P = c(1213, 1234, 1726, 6433),
H = c(1872, 2345, 2234, 9082))
lookup <- data.frame(code = c ("P", "H"),
label = c("Power", "Happiness"),
stringsAsFactors = FALSE)
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
Hmisc::label(df[[i]]) <- lookup$label[(match(colnames(df), lookup$code))[i]]
}
}
This works:
Hmisc::label(df[4])
# P
# "Power"
It also checks out in the RStudio viewer:
Like several of the other answerers and commenters, I had originally thought you wanted to append the "label = " text to the column names. For anyone wanting that, this is the (loop) code.
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
colnames(df)[i] <- paste0(colnames(df)[i],
"(label=",
lookup$label[(match(colnames(df), lookup$code))[i]],
")")
}
}
It's not clear to me at all what you're trying to do with Hmisc::label but I think you're misinterpreting the role & function of Hmisc::label.
Consider the following:
Let's construct a sample data.frame consisting of 2 rows and 2 columns.
df <- setNames(data.frame(matrix(0, ncol = 2, nrow = 2)), c("a", "b"))
df
# a b
#1 0 0
#2 0 0
We extract the column names. Note that cn is a character vector.
cn <- colnames(df)
cn
#[1] "a" "b"
We now set a Hmisc::label for cn.
label(cn) <- "label for cn"
cn
#label for cn
#[1] "a" "b"
We inspect the attributes of cn
attributes(cn)
#$label
#[1] "label for cn"
#
#$class
#[1] "labelled" "character"
We now assign cn to the column names of df.
colnames(df) <- cn
df
# a b
#1 0 0
#2 0 0
Note how the label attribute is not stored as part of the column names.
Here's a dplyr solution:
# example datasets
df = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
library(dplyr)
data.frame(original_names = names(df)) %>% # get original names
left_join(df2, by=c("original_names"="IndicatorName")) %>% # join names that should be updated
mutate(new_names = ifelse(is.na(IndicatorCode), original_names, paste0(original_names,"(label=",IndicatorCode,")"))) %>% # if there is a match update the name
pull(new_names) -> list_new_names # get column of new names and store it in a vector
# update names
names(df) = list_new_names
# check new names
df
# Year Country Matchcode P(label=Power) H(label=Happiness)
# 1 2000 France 1 1213 1872
# 2 2001 France 2 1234 2345
# 3 2000 UK 3 1726 2234
# 4 2001 UK 4 6433 9082
This would work. Find the corresponding text using %in%, and use paste0 to generate the label.
colnames(df1)[4:5] <- paste0(colnames(df1)[4:5], '(label=', df2$V2[colnames(df1)[4:5] %in% df2$V1], ')')
df1
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 1 1213 1872
2 2001 France 2 1234 2345
3 2000 UK 3 1726 2234
4 2001 UK 4 6433 9082
Data used
df1 <- read.table(text="Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082", header=T, stringsAsFactors=F)
df2 <- read.table(text="
P Power
H Happiness", header=F, stringsAsFactors=F)
If you still stick with Hmisc, you can modify the 'print' function to handle the extra information provided by the labels, or rather (and less harmfull) says to R that your data has to be printed using the labels. You can achieve this by creating a new data frame class for which the print function behaves differently.
The 'print' trick is not necessary with Rstudio that natively uses the labels together with the column names.
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082 ", header=T)
df2 = read.table(text = "
var lab
P Power
H Happiness", header=T, stringsAsFactors=FALSE)
## Set the labels of the columns in df1 accordingly to df2
library(Hmisc)
for (i in 1:ncol(df1)) {
lab <- df2[df2$var==colnames(df1)[i],2]
if (length(lab!=0)) label(df1[[i]]) <- lab
}
# A print' function dedicated to 'truc' objects
# Mainly it is the code from the original 'print' except for dimnames[[2L]]
print.truc <- function (x, ..., digits = NULL, quote = FALSE, right = TRUE,
row.names = TRUE)
{
n <- length(row.names(x))
if (length(x) == 0L) {
cat(sprintf(ngettext(n, "data frame with 0 columns and %d row",
"data frame with 0 columns and %d rows"), n), "\n",
sep = "")
}
else if (n == 0L) {
print.default(names(x), quote = FALSE)
cat(gettext("<0 rows> (or 0-length row.names)\n"))
}
else {
m <- as.matrix(format.data.frame(x, digits = digits,
na.encode = FALSE))
if (!isTRUE(row.names))
dimnames(m)[[1L]] <- if (isFALSE(row.names))
rep.int("", n)
else row.names
dimnames(m)[[2L]] <- purrr::map(1:ncol(x),
function(i) {
z <- attributes(x[[i]])$label
if (length(z)!=0) z else colnames(x)[i]
})
print(m, ..., quote = quote, right = right)
}
invisible(x)
}
# Says that 'df1' is an 'enhanced' data frame
class(df1) <- c("truc",class(df1))
# Print as enhanced
print(df1)
# Eyra Country Matchcode Power Happiness
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
# Print using standard way
print(as.data.frame(df1))
# Year Country Matchcode P H
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
No need for a loop with Hmisc, can do this in one line using the option self = FALSE in the label command.
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
Ie.
library(Hmisc, warn.conflicts = FALSE, quietly = TRUE)
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
sapply(df1, label)
#> Year Country Matchcode P H
#> "" "" "" "Power" "Happiness"
Created on 2020-09-14 by the reprex package (v0.3.0)

Regular Expressions to Unmerge row entries

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04
You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Transform data.frame of lists to data.frame

I would like to transform this data.frame
Loc Time(h)
Paris, Luxembourg 10,15
Paris, Lyon, Berlin 9,12,11
to this
Loc Time(h)
Paris 10
Luxembourg 15
Paris 9
Lyon 12
Berlin 11
You could use Ananda Mahto's cSplit function, provided you have data.table installed.
If dat is your data,
devtools::source_gist(11380733)
cSplit(dat, c("Loc", "Time"), direction = "long")
# Loc Time
# 1: Paris 10
# 2: Luxembourg 15
# 3: Paris 9
# 4: Lyon 12
# 5: Berlin 11
Assuming each entry in your dataframe is a character string, of the form you say above, you could do the following
#notice the space in ", " for the first line
newLoc<-sapply(df$Loc, function(entry) {unlist(strsplit(entry,", ", fixed=TRUE))})
#and the lack there of in the second
newTime<-sapply(df$`Time(h)`, function(entry) {unlist(strsplit(entry, ",", fixed=TRUE))})
I think we also need to flatten the results
dim(newLoc)<-NULL
dim(newTime)<-NULL
Then combine back into a df
data.frame(cbind(Loc=newLoc, `Time(h)`=newTime))

Resources