Add new column if range of columns contains string in R - r

I have a dataframe like below. I would like to add 2 columns:
ContainsANZ: Indicates if any of the columns from F0 to F3 contain 'Australia' or 'New Zealand' ignoring NA values
AllANZ: Indicates if all non NA columns contain 'Australia' or 'New Zealand'
Starting dataframe would be:
dfContainsANZ
Col.A Col.B Col.C F0 F1 F2 F3
1 data 0 xxx Australia Singapore <NA> <NA>
2 data 1 yyy United States United States United States <NA>
3 data 0 zzz Australia Australia Australia Australia
4 data 0 ooo Hong Kong London Australia <NA>
5 data 1 xxx New Zealand <NA> <NA> <NA>
The end result should look like this:
df
Col.A Col.B Col.C F0 F1 F2 F3 ContainsANZ AllANZ
1 data 0 xxx Australia Singapore <NA> <NA> Australia undefined
2 data 1 yyy United States United States United States <NA> undefined undefined
3 data 0 zzz Australia Australia Australia Australia Australia Australia
4 data 0 ooo Hong Kong London Australia <NA> Australia undefined
5 data 1 xxx New Zealand <NA> <NA> <NA> New Zealand New Zealand
I'm using dplyr (preferred solution) and have come up with a code like this which doesn't work and is very repetitive. Is there a better way to write this so that I am not having to copy F0|F1|F2... rules over again? My real data set has more. Is NAs interfering with the code?
df <- df %>%
mutate(ANZFlag =
ifelse(
F0 == 'Australia' |
F1 == 'Australia' |
F2 == 'Australia' |
F3 == 'Australia',
'Australia',
ifelse(
F0 == 'New Zealand' |
F1 == 'New Zealand' |
F2 == 'New Zealand' |
F3 == 'New Zealand',
'New Zealand', 'undefined'
)
)
)

Still some typing, but I think this gets at the essence you're looking for:
library(dplyr)
df <- read.table(text='Col.A,Col.B,Col.C,F0,F1,F2,F3
data,0,xxx,Australia,Singapore,NA,NA
data,1,yyy,"United States","United States","United States",NA
data,0,zzz,Australia,Australia,Australia,Australia
data,0,ooo,"Hong Kong",London,Australia,NA
data,1,xxx,"New Zealand",NA,NA,NA', header=TRUE, sep=",", stringsAsFactors=FALSE)
down_under <- function(x) {
mtch <- c("Australia", "New Zealand")
cols <- unlist(x)[c("F0", "F1", "F2", "F3")]
bind_cols(x, data_frame(ContainsANZ=any(mtch %in% cols, na.rm=TRUE),
AllANZ=all(as.vector(na.omit(cols)) %in% cols)))
}
rowwise(df) %>% do(down_under(.))
## Source: local data frame [5 x 9]
## Groups: <by row>
##
## Col.A Col.B Col.C F0 F1 F2 F3 ContainsANZ AllANZ
## (chr) (int) (chr) (chr) (chr) (chr) (chr) (lgl) (lgl)
## 1 data 0 xxx Australia Singapore NA NA TRUE TRUE
## 2 data 1 yyy United States United States United States NA FALSE TRUE
## 3 data 0 zzz Australia Australia Australia Australia TRUE TRUE
## 4 data 0 ooo Hong Kong London Australia NA TRUE TRUE
## 5 data 1 xxx New Zealand NA NA NA TRUE TRUE

Related

Replace all partial string entries with NA

I have a data frame similar to:
df<-as.data.frame(cbind(rep("Canada",6),
c(rep("Alberta",3), rep("Manitoba",2),rep("Unknown_province",1)),
c("Edmonton", "Unknown_city","Unknown_city","Brandon","Unknown_city","Unknown_city")))
colnames(df)<- c("Country","Province","City")
I would like to substitute all entries that contain "Unknown" with NA.
I have tried using grepl, but it removes all entries for that variable if one entry matches, I would like to only replace individual cells.
df[grepl("Unknown", df, ignore.case=TRUE)] <- NA
df1 <- df # This is to ensure that we can refert back to df incase there is an issue
Then you could use any of the following:
is.na(df1) <- array(grepl('Unknown', as.matrix(df1)), dim(df1))
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
or even:
df1[] <- sub("Unknown.*", NA, as.matrix(df1), ignore.case = TRUE)
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
Note that grepl and even sub are vectorized hence no need to use the *aply family or even for loops
Here is one possible way to solve your problem:
df[] <- lapply(df, function(x) ifelse(grepl("Unknown", x, TRUE), NA, x))
df
# Country Province City
# 1 Canada Alberta Edmonton
# 2 Canada Alberta <NA>
# 3 Canada Alberta <NA>
# 4 Canada Manitoba Brandon
# 5 Canada Manitoba <NA>
# 6 Canada <NA> <NA>
Using dplyr
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(),
~ case_when(str_detect(., 'Unknown', negate = TRUE) ~ .)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
I like to use replace() in such cases in which values in a vector are replaced or left as is, depending on a condition :
library(dplyr)
library(stringr)
df%>%mutate(across(everything(), ~replace(.x, str_detect(.x, 'Unknown'), NA)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
df[]<- lapply(df, gsub, pattern = "Unknown", replacement = NA, fixed = TRUE)

Can you use a dataframe to assist with "find and replace" in R

I am trying to clean some Census data where all the States are given a FIPS code instead of the state abbreviation. I want to run something to go through the column with the FIPS code and convert them to the state abbreviation. Find all the 1's and convert them to AL, all the 2's to AK and so one. I know i can do this with ifelse statement but was wondering if there was a more efficient way with out writing 51 ifelse statements. Thank you all for your assistance.
Here's a try. I'll use data from https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html for valid FIPS codes, and make a fake "bad data" frame.
FIPS <- read.table("https://www2.census.gov/geo/docs/reference/state.txt",
sep = "|", header = TRUE, colClasses = "character")
head(FIPS)
# STATE STUSAB STATE_NAME STATENS
# 1 01 AL Alabama 01779775
# 2 02 AK Alaska 01785533
# 3 04 AZ Arizona 01779777
# 4 05 AR Arkansas 00068085
# 5 06 CA California 01779778
# 6 08 CO Colorado 01779779
baddata <- data.frame(stateabbr = c("AL", "AK", "22"))
baddata
# stateabbr
# 1 AL
# 2 AK
# 3 22
Base R
fixeddata <- merge(baddata, FIPS[,c("STATE", "STUSAB")],
by.x = "stateabbr", by.y = "STATE", all.x = TRUE)
fixeddata
# stateabbr STUSAB
# 1 22 LA
# 2 AK <NA>
# 3 AL <NA>
fixeddata$stateabbr <- ifelse(is.na(fixeddata$STUSAB), fixeddata$STUSAB, fixeddata$stateabbr)
fixeddata$STUSAB <- NULL
fixeddata
# stateabbr
# 1 22
# 2 <NA>
# 3 <NA>
dplyr
library(dplyr)
left_join(baddata, FIPS[,c("STATE", "STUSAB")], by = c("stateabbr" = "STATE")) %>%
mutate(stateabbr = coalesce(STUSAB, stateabbr)) %>%
select(-STUSAB)
# stateabbr
# 1 AL
# 2 AK
# 3 LA

Creating a loop to add labels to colums: library(Hmisc)

I have a dataset which looks something like this:
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
I have another dataset which looks something like this:
Indicator Code Indicator Name
P Power
H Happiness
I would like to add info in the second column of the second dataset (Power, Happiness) as a label to the abbreviation used in the first dataset with a loop, but I don't know exactly how to write the loop.
This is how far I got:
library(Hmisc)
for i in df2[,1]{
if (df1[,i] == df2[i,]){
label(df1[,i]) <- df2[i,2]
}}
But this merely checks whether names are the same and does not search for it.
Could anyone guide further?
Desired output:
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
If you specifically want to use a loop, this approach gives the output you describe:
df <- data.frame(Year = c(2000, 2001, 2000, 2001),
Country = c("France", "France", "UK","UK"),
Matchcode = c("0001", "0002", "0003", "0004"),
P = c(1213, 1234, 1726, 6433),
H = c(1872, 2345, 2234, 9082))
lookup <- data.frame(code = c ("P", "H"),
label = c("Power", "Happiness"),
stringsAsFactors = FALSE)
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
Hmisc::label(df[[i]]) <- lookup$label[(match(colnames(df), lookup$code))[i]]
}
}
This works:
Hmisc::label(df[4])
# P
# "Power"
It also checks out in the RStudio viewer:
Like several of the other answerers and commenters, I had originally thought you wanted to append the "label = " text to the column names. For anyone wanting that, this is the (loop) code.
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
colnames(df)[i] <- paste0(colnames(df)[i],
"(label=",
lookup$label[(match(colnames(df), lookup$code))[i]],
")")
}
}
It's not clear to me at all what you're trying to do with Hmisc::label but I think you're misinterpreting the role & function of Hmisc::label.
Consider the following:
Let's construct a sample data.frame consisting of 2 rows and 2 columns.
df <- setNames(data.frame(matrix(0, ncol = 2, nrow = 2)), c("a", "b"))
df
# a b
#1 0 0
#2 0 0
We extract the column names. Note that cn is a character vector.
cn <- colnames(df)
cn
#[1] "a" "b"
We now set a Hmisc::label for cn.
label(cn) <- "label for cn"
cn
#label for cn
#[1] "a" "b"
We inspect the attributes of cn
attributes(cn)
#$label
#[1] "label for cn"
#
#$class
#[1] "labelled" "character"
We now assign cn to the column names of df.
colnames(df) <- cn
df
# a b
#1 0 0
#2 0 0
Note how the label attribute is not stored as part of the column names.
Here's a dplyr solution:
# example datasets
df = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
library(dplyr)
data.frame(original_names = names(df)) %>% # get original names
left_join(df2, by=c("original_names"="IndicatorName")) %>% # join names that should be updated
mutate(new_names = ifelse(is.na(IndicatorCode), original_names, paste0(original_names,"(label=",IndicatorCode,")"))) %>% # if there is a match update the name
pull(new_names) -> list_new_names # get column of new names and store it in a vector
# update names
names(df) = list_new_names
# check new names
df
# Year Country Matchcode P(label=Power) H(label=Happiness)
# 1 2000 France 1 1213 1872
# 2 2001 France 2 1234 2345
# 3 2000 UK 3 1726 2234
# 4 2001 UK 4 6433 9082
This would work. Find the corresponding text using %in%, and use paste0 to generate the label.
colnames(df1)[4:5] <- paste0(colnames(df1)[4:5], '(label=', df2$V2[colnames(df1)[4:5] %in% df2$V1], ')')
df1
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 1 1213 1872
2 2001 France 2 1234 2345
3 2000 UK 3 1726 2234
4 2001 UK 4 6433 9082
Data used
df1 <- read.table(text="Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082", header=T, stringsAsFactors=F)
df2 <- read.table(text="
P Power
H Happiness", header=F, stringsAsFactors=F)
If you still stick with Hmisc, you can modify the 'print' function to handle the extra information provided by the labels, or rather (and less harmfull) says to R that your data has to be printed using the labels. You can achieve this by creating a new data frame class for which the print function behaves differently.
The 'print' trick is not necessary with Rstudio that natively uses the labels together with the column names.
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082 ", header=T)
df2 = read.table(text = "
var lab
P Power
H Happiness", header=T, stringsAsFactors=FALSE)
## Set the labels of the columns in df1 accordingly to df2
library(Hmisc)
for (i in 1:ncol(df1)) {
lab <- df2[df2$var==colnames(df1)[i],2]
if (length(lab!=0)) label(df1[[i]]) <- lab
}
# A print' function dedicated to 'truc' objects
# Mainly it is the code from the original 'print' except for dimnames[[2L]]
print.truc <- function (x, ..., digits = NULL, quote = FALSE, right = TRUE,
row.names = TRUE)
{
n <- length(row.names(x))
if (length(x) == 0L) {
cat(sprintf(ngettext(n, "data frame with 0 columns and %d row",
"data frame with 0 columns and %d rows"), n), "\n",
sep = "")
}
else if (n == 0L) {
print.default(names(x), quote = FALSE)
cat(gettext("<0 rows> (or 0-length row.names)\n"))
}
else {
m <- as.matrix(format.data.frame(x, digits = digits,
na.encode = FALSE))
if (!isTRUE(row.names))
dimnames(m)[[1L]] <- if (isFALSE(row.names))
rep.int("", n)
else row.names
dimnames(m)[[2L]] <- purrr::map(1:ncol(x),
function(i) {
z <- attributes(x[[i]])$label
if (length(z)!=0) z else colnames(x)[i]
})
print(m, ..., quote = quote, right = right)
}
invisible(x)
}
# Says that 'df1' is an 'enhanced' data frame
class(df1) <- c("truc",class(df1))
# Print as enhanced
print(df1)
# Eyra Country Matchcode Power Happiness
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
# Print using standard way
print(as.data.frame(df1))
# Year Country Matchcode P H
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
No need for a loop with Hmisc, can do this in one line using the option self = FALSE in the label command.
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
Ie.
library(Hmisc, warn.conflicts = FALSE, quietly = TRUE)
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
sapply(df1, label)
#> Year Country Matchcode P H
#> "" "" "" "Power" "Happiness"
Created on 2020-09-14 by the reprex package (v0.3.0)

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

formula in R name string

I have a data frame of the following form:
country company hitid
1 Switzerland CH1 <NA>
2 Switzerland CH2 <NA>
3 Switzerland CH3 <NA>
4 Sweden SU1 <NA>
5 Sweden SU2 <NA>
6 Sweden SU3 <NA>
in the hitid collumn, I would like to fill in automatically results of a loop I have run before. The results are given in the form d$COUNTRY$hitid, where for each country, I have got another hitid that I would like to fill in.
EDIT:
my loop output is of the following form:
$Switzerland
HITTypeId HITId Valid
1 1010 123 TRUE
$Sweden
HITTypeId HITId Valid
1 1010 456 TRUE
Is there any way that one can use a formula inside of a name string? That i could construct something like:
hitid=d$"formula to look up country"$hitid
Or any ideas how to construct this problem more elegant?
Basically I just want to extract the HITId for each country out of the loop and into the existing datfile.
Here a plyr solution.
library(plyr)
ddply(dat,.(country),transform,
hitid= d[[unique(country)]]$hitid)
Where I assume that :
d <- list(Switzerland=list(hitid=1),
Sweden=list(hitid=2))
This makes some assumptions about your data, i.e., that DF$country is a character column and that d is a list.
DF <- read.table(text=" country company hitid
1 Switzerland CH1 <NA>
2 Switzerland CH2 <NA>
3 Switzerland CH3 <NA>
4 Sweden SU1 <NA>
5 Sweden SU2 <NA>
6 Sweden SU3 <NA>",header=TRUE,stringsAsFactors=FALSE)
d <- list(Switzerland=list(hitid=123),Sweden=list(hitid=456))
fun <- function(x) d[[x]][["hitid"]]
DF$hitid <- sapply(DF$country,fun)
# country company hitid
# 1 Switzerland CH1 123
# 2 Switzerland CH2 123
# 3 Switzerland CH3 123
# 4 Sweden SU1 456
# 5 Sweden SU2 456
# 6 Sweden SU3 456

Resources