paste0 multiple lists of different lengths without looping - r

I am paste0'ing a bunch of variables into a definitive url list
id <- 1:10
animal <- c("dog", "cat", "fish")
base <- "www.google.com/"
urls <- paste0(base, "id=", id, "search=", animal)
The output looks like:
[1] "www.google.com/id=1search=dog" "www.google.com/id=2search=cat" "www.google.com/id=3search=fish"
[4] "www.google.com/id=4search=dog" "www.google.com/id=5search=cat" "www.google.com/id=6search=fish"
[7] "www.google.com/id=7search=dog" "www.google.com/id=8search=cat" "www.google.com/id=9search=fish"
[10] "www.google.com/id=10search=dog"
But I actually want the ids and animals to be repeated in sequence like:
[1] "www.google.com/id=1search=dog" "www.google.com/id=2search=dog" "www.google.com/id=3search=dog"
[4] "www.google.com/id=4search=dog" "www.google.com/id=5search=dog" "www.google.com/id=6search=dog"
[7] "www.google.com/id=7search=dog" "www.google.com/id=8search=dog" "www.google.com/id=9search=dog"
[10] "www.google.com/id=10search=dog" "www.google.com/id=1search=cat" ...

You can modify the code by including rep in paste0 or sprintf
sprintf('%sid=%dsearch=%s', base, id, rep(animal,each=length(id)))
Or
paste0(base, 'id=',id, 'search=', rep(animal,each=length(id)))
Or as #MrFlick suggested, we can use expand.grid to get all the combinations between 'animal' and 'id'
with(expand.grid(a=animal, i=id), paste0(base, "id=", i, "search=", a))

Related

how to sort list.files() in correct date order?

Using normal list.files() in the working directory return the file list but the numeric order is messed up.
f <- list.files(pattern="*.nc")
f
# [1] "te1971-1.nc" "te1971-10.nc" "te1971-11.nc" "te1971-12.nc"
# [5] "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [9] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc"
where the number after "-" describes the month number.
I used the following to try to sort it
myFiles <- paste("te", i, "-", c(1:12), ".nc", sep = "")
mixedsort(myFiles)
it returns ordered files but in reverse:
[1] "te1971-12.nc" "te1971-11.nc" "tev1971-10.nc" "te1971-9.nc"
[5] "te1971-8.nc" "te1971-7.nc" "te1971-6.nc" "te1971-5.nc"
[9] "te1971-4.nc" "te1971-3.nc" "te1971-2.nc" "te1971-1.nc"
How do I fix this?
The issue is that the values get alphabetically sorted.
You could gsub out years and months as groups (.) and add "-1" as first day of the month to the yield, coerce it as.Date and order by that.
x[order(as.Date(gsub('.*(\\d{4})-(\\d{,2}).*', '\\1-\\2-1', x)))]
# [1] "te1971-1.nc" "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [6] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc" "te1971-10.nc"
# [11] "te1971-11.nc" "te1971-12.nc"
Data:
x <- c("te1971-1.nc", "te1971-10.nc", "te1971-11.nc", "te1971-12.nc",
"te1971-2.nc", "te1971-3.nc", "te1971-4.nc", "te1971-5.nc", "te1971-6.nc",
"te1971-7.nc", "te1971-8.nc", "te1971-9.nc")

How to create and save subset dataframes for sequence of year-month

I would like to filter from a dataframe observations for a given year-month and then save it as a separate dataframe and name it with the respective year-month.
I would be grateful if someone could suggest a more efficient code than the one below. Also, this code is not filtering correctely the observations.
data <- data.frame(year = c(rep(2012,12),rep(2013,12),rep(2014,12),rep(2015,12),rep(2016,12)),
month = rep(1:12,5),
info = seq(60)*100)
years <- 2012:2016
months <- 1:12
for(year in years){
for(month in months){
data_sel <- data %>%
filter(year==year & month==month)
if(month<10){
month_alt <- paste0("0",month) # months 1-9 should show up as 01-09
}
Newname <- paste0(year,month_alt,'_','data_sel')
assign(Newname, data_sel)
}
}
The output I am looking to get is below (separate objects containing data from a given year-month):
> ls()
[1] "201201_data_sel" "201202_data_sel" "201203_data_sel" "201204_data_sel"
[5] "201205_data_sel" "201206_data_sel" "201207_data_sel" "201208_data_sel"
[9] "201209_data_sel" "201301_data_sel" "201302_data_sel" "201303_data_sel"
[13] "201304_data_sel" "201305_data_sel" "201306_data_sel" "201307_data_sel"
[17] "201308_data_sel" "201309_data_sel" "201401_data_sel" "201402_data_sel"
[21] "201403_data_sel" "201404_data_sel" "201405_data_sel" "201406_data_sel"
[25] "201407_data_sel" "201408_data_sel" "201409_data_sel" "201501_data_sel"
[29] "201502_data_sel" "201503_data_sel" "201504_data_sel" "201505_data_sel"
[33] "201506_data_sel" "201507_data_sel" "201508_data_sel" "201509_data_sel"
[37] "201601_data_sel" "201602_data_sel" "201603_data_sel" "201604_data_sel"
[41] "201605_data_sel" "201606_data_sel" "201607_data_sel" "201608_data_sel"
[45] "201609_data_sel" "data" "data_sel" "month"
[49] "month_alt" "months" "Newname" "year"
[53] "years"
You could do:
library(dplyr)
g <- data %>%
mutate(month = sprintf("%02d", month)) %>%
group_by(year, month)
setNames(group_split(g), with(group_keys(g), paste0("data_sel_", year, month))) %>%
list2env(envir = .GlobalEnv)
Starting an object name with a digit is not allowed in R, so in paste0 "data_sel_" is first.
As mentioned in the comments it might be better to not pipe to list2env and store the output as a list with named elements.

create groups in a list based on values in r and rbind files in lists

Please see part of the list below. The list actually spans from 19800101 to 20161231. Firstly, I want to create groups based on year, i.e. put elements 19800101-19801231 to one group called 1980 and so on so forth. Then I would rbind those small files in each group to be one big file, say rbind 19800101 to 19801231 as a single file 1980.
Any ideas? Thanks!
[1] "19800101.csv" "19800102.csv" "19800103.csv" "19800104.csv" "19800105.csv" "19800106.csv" "19800107.csv"
[8] "19800108.csv" "19800109.csv" "19800110.csv" "19800111.csv" "19800112.csv" "19800113.csv" "19800114.csv"
[15] "19800115.csv" "19800116.csv" "19800117.csv" "19800118.csv" "19800119.csv" "19800120.csv" "19800121.csv"
[22] "19800122.csv" "19800123.csv" "19800124.csv" "19800125.csv" "19800126.csv" "19800127.csv" "19800128.csv"
[29] "19800129.csv" "19800130.csv" "19800131.csv" "19800201.csv" "19800202.csv" "19800203.csv" "19800204.csv"
[36] "19800205.csv" "19800206.csv" "19800207.csv" "19800208.csv" "19800209.csv" "19800210.csv" "19800211.csv"
[43] "19800212.csv" "19800213.csv" "19800214.csv" "19800215.csv" "19800216.csv" "19800217.csv" "19800218.csv"
[50] "19800219.csv" "19800220.csv" "19800221.csv" "19800222.csv" "19800223.csv" "19800224.csv" "19800225.csv"
[57] "19800226.csv" "19800227.csv" "19800228.csv" "19800229.csv" "19800301.csv" "19800302.csv" "19800303.csv"
[64] "19800304.csv" "19800305.csv" "19800306.csv" "19800307.csv" "19800308.csv" "19800309.csv" "19800310.csv"
[71] "19800311.csv" "19800312.csv" "19800313.csv" "19800314.csv" "19800315.csv" "19800316.csv" "19800317.csv"
[78] "19800318.csv" "19800319.csv" "19800320.csv" "19800321.csv" "19800322.csv" "19800323.csv" "19800324.csv"
[85] "19800325.csv" "19800326.csv" "19800327.csv" "19800328.csv" "19800329.csv" "19800330.csv" "19800331.csv"
We can split by the first 4 characters by making use of substr into a list of vectors.
lst1 <- split(v1, as.integer(substr(v1, 1, 4)))
The list elements can be accessed by [[ or $
lst1$`1980`
lst1$[["1980"]]
Then, we can read over the list and rbind the datasets
lst2 <- lapply(lst1, function(x) do.call(rbind, lapply(x, read.csv)))
If we need to write it to csv without keeping it in a list
library(readr)
for(nm in names(lst1)) {
tmp <- data.frame()
for(i in seq_along(lst1[[nm]])) {
tmp <- rbind(tmp, read_csv(lst1[[nm]][i]))
}
write_csv(tmp, path = paste0(nm, ".csv"))
rm(tmp)
}
Or with tidyverse
library(purrr)
library(readr)
library(dplyr)
lst2 <- map(lst1, ~ map_dfr(.x, read_csv))
If it is to create a grouping column, use
df1 <- data.frame(v1)
df1$grp <- substr(df1$v1, 1, 4)

Partial Match String and full replacement over multiple vectors

Would like to efficiently replace all partial match strings over a single column by supplying a vector of strings which will be searched (and matched) and also be used as replacement. i.e. for each vector in df below, it will partially match for vectors in vec_string. Where matches is found, it will simply replace the entire string with vec_string. i.e. turning 'subscriber manager' to 'manager'. By supplying more vectors into vec_string, it will search through the whole df until all is complete.
I have started the function, but can't seem to finish it off by replacing the vectors in df with vec_string. Appreciate your help
df <- c(
'solicitor'
,'subscriber manager'
,'licensed conveyancer'
,'paralegal'
,'property assistant'
,'secretary'
,'conveyancing paralegal'
,'licensee'
,'conveyancer'
,'principal'
,'assistant'
,'senior conveyancer'
,'law clerk'
,'lawyer'
,'legal practice director'
,'legal secretary'
,'personal assistant'
,'legal assistant'
,'conveyancing clerk')
vec_string <- c('manager','law')
#function to search and replace
replace_func <-
function(vec,str_vec) {
repl_str <- list()
for(i in 1:length(str_vec)) {
repl_str[[i]] <- grep(str_vec[i],unique(tolower(vec)))
}
names(repl_str) <- vec_string
return(repl_str)
}
replace_func(df,vec_string)
$`manager`
[1] 2
$law
[1] 13 14
As you can see, the function returns a named list with elements to which the replacement will
This should do the trick
res = sapply(df,function(x){
match = which(sapply(vec_string,function(y) grepl(y,x)))
if (length(match)){x=vec_string[match[1]]}else{x}
})
res
[1] "solicitor" "manager" "licensed conveyancer"
[4] "paralegal" "property assistant" "secretary"
[7] "conveyancing paralegal" "licensee" "conveyancer"
[10] "principal" "assistant" "senior conveyancer"
[13] "law" "law" "legal practice director"
[16] "legal secretary" "personal assistant" "legal assistant"
[19] "conveyancing clerk"
We compare each part of df with each part of vec_string. If there is a match, the vec_string part is returned, else it is left as it is. Watch out as if there are more than 1 matches it will keep the first one.

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

Resources