Choose data frames by dynamic names - r

I have several frames called y2010, y2011, y2012, ... and a frame called z.
"firstcolumn" contains fitting names.
I want to match the content of each frame (y2010, y2011, y2012, ...) by left_join to z within a loop.
for(i in 2010:2017) {
z<-left_join(z, paste0("y", 2011) , by="firstcolumn")
}
But I cannot choose the frames y2010, y2011, y2012, ... by paste0.
How can I proceed?

Use get:
z <- left_join(z, get(paste0("y", 2011)), by="firstcolumn")

To avoid a for-loop, you can use mget to put them in a list and lapply to merge,
library(dplyr)
lapply(mget(ls(pattern = 'y[0-9]+')), function(i) left_join(z, i, by = 'firstcolumn'))

It sounds like you might also want to look at Reduce:
Reduce(function(x, y) left_join(x, y, by = "firstcolumn"),
mget(c("z", paste0("y", 2010:2017))))
It's always better to provide some sample data along with expected output. Here's some sample data:
ls() ## Just to show I'm starting with nothing in my workspace
# character(0)
set.seed(1)
list2env(setNames(replicate(9, data.frame(firstcolumn = sample(letters[1:5], 3), data = sample(10, 3, TRUE), stringsAsFactors = FALSE), FALSE), c("z", paste0("y", 2010:2017))), .GlobalEnv)
ls()
# [1] "y2010" "y2011" "y2012" "y2013" "y2014" "y2015" "y2016" "y2017" "z"
Here's a comparison of using Reduce with using your for loop:
library(dplyr)
Reduce(function(x, y) left_join(x, y, by = "firstcolumn"),
mget(c("z", paste0("y", 2010:2017))))
# firstcolumn data.x data.y data.x.x data.y.y data.x.x.x data.y.y.y data.x.x.x.x
# 1 b 10 2 8 3 4 7 NA
# 2 e 3 1 NA NA 9 9 NA
# 3 d 9 NA 5 7 NA NA 5
# data.y.y.y.y data
# 1 5 3
# 2 NA NA
# 3 8 9
usingGet <- function() {
for(i in 2010:2017) {
z <- left_join(z, get(paste0("y", i)) , by="firstcolumn")
}
z
}
# firstcolumn data.x data.y data.x.x data.y.y data.x.x.x data.y.y.y data.x.x.x.x
# 1 b 10 2 8 3 4 7 NA
# 2 e 3 1 NA NA 9 9 NA
# 3 d 9 NA 5 7 NA NA 5
# data.y.y.y.y data
# 1 5 3
# 2 NA NA
# 3 8 9

Related

Merge more than 2 dataframes together depending on column value in R

I would like to merge a dataframe with multiple dataframes depending on a value in two columns. I know I can merge two dataframes based on an element in a column using the merge() function, but I dont know how to do it when there are more than 2 dataframes.
For example, take this as the primary dataframe:
yr.col <- c(rep("2018",3), rep("2017",4), rep("2016",5))
mnth.col <- sample.int(4,12, replace = TRUE)
lon <- c(paste(1:12,"x"))
lat <- c(paste(1:12,"y"))
df <- data.frame(yr.col,lon,lat)
These are the other dataframes, which have the temperature for the set of lon and lat in different years.
tmp_18 <- sample.int(8,12,replace = TRUE)
tmp_17 <- sample.int(8,12,replace = TRUE)
tmp_16 <- sample.int(8,12,replace = TRUE)
env_18 <- data.frame(tmp_18,lon,lat)
env_17 <- data.frame(tmp_17, lon, lat)
env_16 <- data.frame(tmp_16, lon, lat)
Aim: I want to merge df with either env_18 env_17 or env_16 depending on df$yr.col
Expected result: A dataframe with a new column called tmp where the number from the correct env datasets are found
Previously tried:
1)
if (df$yr.col=="2018"){
df.new$tmp <- merge(df,env_18, by=c("lon", "lat"))
df.new$tmp.yr <- "2018"
}else if (df$yr.col=="2017"){
df.new$tmp <- merge(df, env_17, by=c("lon", "lat"))
df.new$tmp.yr <- "2017"
} else {
df.new$tmp <- merge(df, env_16, by=c("lon", "lat"))
df.new$tmp.yr <- "2016"}
produces this warning:
Warning message:
In if (df$yr.col == "2018") { :
the condition has length > 1 and only the first element will be used
It only takes the first dataframe env_18 and merges that with df
I have also tried 2)
df.new <- ifelse(df$yr.col=="2018", merge(df, env_18, by=c("lon", "lat")),
ifelse(df$yr.col=="2017", merge(df, env_17, by=c("lon", "lat")),
ifelse(df$yr.col=="2016", merge(df, env_16, by=c("lon", "lat")), "NA")))
df.new <- data.frame(matrix(unlist(df.new), nrow=length(df.new)))
but this does not give the desired outcome.
Is there some magic way to do this that I have not condisered or have I made an error? Perhaps a for-loop or function?
Thank you so much for your help in advance! I really appreciate it :))
You can use dplyr and purrr for that. I could have used inner_join, but decided to keep merge as in the original post.
map2_dfr(list(env_16, env_17, env_18),
2016:2018,
function(x,y){merge(df %>% filter(yr.col == y), x, by=c("lon", "lat"))})
Output
lon lat yr.col tmp_16 tmp_17 tmp_18
1 10 x 10 y 2016 1 NA NA
2 11 x 11 y 2016 8 NA NA
3 12 x 12 y 2016 7 NA NA
4 8 x 8 y 2016 7 NA NA
5 9 x 9 y 2016 2 NA NA
6 4 x 4 y 2017 NA 5 NA
7 5 x 5 y 2017 NA 4 NA
8 6 x 6 y 2017 NA 8 NA
9 7 x 7 y 2017 NA 7 NA
10 1 x 1 y 2018 NA NA 6
11 2 x 2 y 2018 NA NA 2
12 3 x 3 y 2018 NA NA 1
You can also create one column from tmps and drop the rest:
df$tmp <- coalesce(df$tmp_16, df$tmp_17, df$tmp_18)

Is there a way to recode an SPSS function in R to create a single new variable?

Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}

Place on specific positions of dataframes specific values

DF <- data.frame(x1=c(NA,7,7,8,NA), x2=c(1,4,NA,NA,4)) # a data frame with NA
WhereAreMissingValues <- which(is.na(DF), arr.ind=TRUE) # find the position of the missing values
Modes <- apply(DF, 2, function(x) {which(tabulate(x) == max(tabulate(x)))}) # find the modes of each column
DF
WhereAreMissingValues
Modes
I would like to replace the NAs of each column of DF with the mode, accordingly.
Please for some help.
Map provides here a one line solution:
data.frame(Map(function(u,v){u[is.na(u)]=v;u},DF, Modes))
# x1 x2
#1 7 1
#2 7 4
#3 7 4
#4 8 4
#5 7 4
Here's how I would do this.
First I'll define an helper function
Myfunc <- function(x) as.numeric(names(sort(-table(x)))[1L])
Then just use lapply over the data set
DF[] <- lapply(DF, function(x){x[is.na(x)] <- Myfunc(x) ; x})
DF
# x1 x2
# 1 7 1
# 2 7 4
# 3 7 4
# 4 8 4
# 5 7 4

Build a vector/frame by combining regmatches results

I have a regular expression that parses a bunch of text, an when doing regmatches(myText,myRegex) it returns a list which looks like:
[[1]]
[1] "a=1" "b=3" "a=9" "c=2" "b=4"
...
I'd like to build a data.frame or table - whatever suits best - to finally have something like:
a b c
1 3 2
9 4 ...
Is it possible to make this in a simple fashion? What are your suggestions?
Thanks in advance.
Its not entirely clear what the general case is here but this works on the data provided.
Assuming this input:
x <- c("a=1", "b=3", "a=9", "c=2", "b=4")
split the values by the names producing s and massage into a data.frame:
s <- split(as.numeric(sub(".*=", "", x)), sub("=.*", "", x))
as.data.frame(do.call(cbind, lapply(s, ts)))
giving:
a b c
1 1 3 2
2 9 4 NA
No packages needed.
You can either use base R methods
d1 <- read.table(text=gsub("[[:punct:]]", " " , unlist(lst)))
d2 <- transform(d1, indx=ave(seq_along(V1), V1, FUN=seq_along))
res <- reshape(d2, timevar='V1', idvar='indx', direction='wide')[,-1]
colnames(res) <- gsub(".*\\.", "", colnames(res))
res
# a b c
#1 1 3 2
#3 9 4 2
#6 4 5 NA
#9 9 NA NA
Or using dcast from reshape2 on d2
library(reshape2)
dcast(d2,indx~V1, value.var='V2')[,-1]
# a b c
#1 1 3 2
#2 9 4 2
#3 4 5 NA
#4 9 NA NA
data
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))
Using rex may make this type of extraction task a little simpler.
x <- c("a=1", "b=3", "a=9", "c=2", "b=4", "a=2")
First extract the names and values from the strings.
library(rex)
matches <- re_matches(x,
rex(
capture(name="name", letter),
"=",
capture(name="value", digit)
))
#> name value
#>1 a 1
#>2 b 3
#>3 a 9
#>4 c 2
#>5 b 4
#>6 a 2
Then tally the groups using split().
groups <- split(as.numeric(matches$value), matches$name)
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4
#>
#>$c
#>[1] 2
If we try to convert directly to a data.frame from split() the groups with fewer members will have their members recycled rather than NA, so instead explicitly fill with NA.
largest_group <- max(sapply(groups, length))
#>[1] 3
groups <- lapply(groups, function(group) {
if (length(group) < largest_group) {
group[largest_group] <- NA
}
group
})
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4 NA
#>
#>$c
#>[1] 2 NA NA
Finally we can create the data.frame
do.call('data.frame', groups)
#> a b c
#>1 1 3 2
#>2 9 4 NA
#>3 2 NA NA
Here's an approach using tools from my "splitstackshape" package:
library(splitstackshape)
dcast.data.table( ## Makes the long data wide
getanID( ## Adds an ID variable for dcast
## create a single column data.table and split it by the "="
cSplit(as.data.table(unlist(lst)), "V1", "="), "V1_1"),
.id ~ V1_1, value.var = "V1_2")
# .id a b c
# 1: 1 1 3 2
# 2: 2 9 4 2
# 3: 3 4 5 NA
# 4: 4 9 NA NA
This uses #akrun's sample data:
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))

retain improper name with indexing

I have need to name columns of a data.frame with duplicate names. inside of data.frame you can use check.names = FALSE to do the naughty name deed. But if you index this then you lose the naughty names when indexing. I want to retain those names. So beloe is an example and the output I get and I'd like to get:
x <- data.frame(b= 4:6, a =6:8, a =6:8, check.names = FALSE)
x[, -1]
I get:
a a.1
1 6 6
2 7 7
3 8 8
I'd like:
a a
1 6 6
2 7 7
3 8 8
How about this:
subdf <- function(df, ii) {
do.call("data.frame", c(as.list(df)[ii], check.names=FALSE))
}
subdf(x, -1)
# a a
# 1 6 6
# 2 7 7
# 3 8 8
subdf(x, 2:3)
# a a
# 1 6 6
# 2 7 7
# 3 8 8
Here's an ugly solution
> tmp <- data.frame(b=4:6, a=6:8, a=6:8, check.names=FALSE)
> setNames(tmp[, -1], names(tmp)[-1])
a a
1 6 6
2 7 7
3 8 8
Looking at the code for [.data.frame gives this as part of the code
if (anyDuplicated(cols))
names(y) <- make.unique(cols)
and I couldn't see anything in the code that would allow one to skip that check. So it looks like we'll just have to write our own function. It's not very safe though and I'm sure a much better version could be created...
dropCols <- function(x, cols){
nm <- colnames(x)
x <- x[, -cols]
colnames(x) <- nm[-cols]
x
}
x <- data.frame(b= 4:6, a =6:8, a =6:8, check.names = FALSE)
#x[, -1]
dropCols(x, 1)
# a a
#1 6 6
#2 7 7
#3 8 8
per dirks tongue in cheek comment:
safe.data.frame <- function(dat, index) {
colnam <-colnames(dat)[index]
dat2 <- dat[, index]
colnames(dat2) <- colnam
dat2
}
safe.data.frame(x, -1)
I was hoping for something better :)

Resources