Making multiple named data frames with loop - r

In the process of learning. Didn't ask my first question well, so I'm trying again and doing my best to be more clear.
I'm trying to create a series of data frames for a reproducible question for my larger issue. I would like to make 4 data frames, each named differently by the year. Eventually I will merge these four data frames to explain where I am encountering my issue.
Here is the most recent solution. This runs, but instead creates a list of four data frames without any frames in the global directory.
datafrom <- list()
years <- c(2006,2008,2010,2012)
for (i in 1:length(years)) {
UniqueID <- 1:10 # <- Not all numeric - Kept as character vector
Name <- LETTERS[seq( from = 1, to = 10 )]
Entity_Type <- factor("This","That")
Data1 <- rnorm(10)
Data2 <- rnorm(10)
Data3 <- rnorm(10)
Data4 <- rnorm(10)
Year <- years[i]
datafrom[[i]] <- data.frame(UniqueID, Name, Entity_Type, Data1, Data2, Data3, Data4, Year)
}
I would like 4 separate data frames, each named datafrom2006, datafrom2008, etc.
Many thanks in advance for your patience with my learning.

I'll demonstrate a few (of many) techniques here, and I'll call them (1) brute force, (2) list-based, and (3) single long-form data.frame.
I'll add to the example the use of a function that you want to apply to each data.frame. Though contrived, it helps makes the point:
## some constants used throughout
years <- c(2006, 2008, 2010, 2012)
n <- 10
myfunc <- function(x) {
interestingPart <- x[ , grepl('^Data', colnames(x)) ]
sapply(interestingPart, mean)
}
Brute Force
Yes, you can create multiple like-named and same-structure data.frames from a loop, though it is typically frowned upon by many experienced (R?) programmers:
set.seed(42)
for (yr in years) {
tmpdf <- data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
assign(sprintf('datafrom%s', yr), tmpdf)
}
rm(yr, tmpdf)
ls()
## [1] "datafrom2006" "datafrom2008" "datafrom2010" "datafrom2012" "myfunc"
## [6] "n" "years"
head(datafrom2006, n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
In order to see the results for each data.frame, one would typically (though not always) do something like this:
myfunc(datafrom2006)
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
myfunc(datafrom2008)
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
myfunc(datafrom2010)
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
myfunc(datafrom2012)
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
List-Based
set.seed(42)
datafrom <- sapply(as.character(years), function(yr) {
data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
}, simplify=FALSE)
str(datafrom)
## List of 4
## $ 2006:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
## ..$ Name : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
## ..$ Entity_Type: Factor w/ 2 levels "that","this": 2 1 2 1 2 1 2 1 2 1
## ..$ Data1 : num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
## ..$ Data2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
## ..$ Data3 : num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
## ..$ Data4 : num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
## ..$ Year : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1
## $ 2008:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
#### ...snip...
head(datafrom[[1]], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
head(datafrom[['2008']], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 0.2059986 0.32192527 -0.3672346 -1.04311894 2008
## 2 2 B that -0.3610573 -0.78383894 0.1852306 -0.09018639 2008
However, with this you can test your function performance with just one:
myfunc(datafrom[[1]])
myfunc(datafrom[['2010']])
and then run the function on all of them very simply:
lapply(datafrom, myfunc)
## $`2006`
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
## $`2008`
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
## $`2010`
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
## $`2012`
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
Long-form Data
If instead you keep all of the data in the same data.frame, using your already-defined column of Year, you can still segment it for exploring individual years:
longdf <- do.call('rbind.data.frame', datafrom)
rownames(longdf) <- NULL
longdf[c(1,11,21,31),]
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.45545012 2006
## 11 1 A this 0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21 1 A this 1.5127070 1.3921164 1.2009654 -0.02509255 2010
## 31 1 A this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012
Simple subsets:
subset(longdf, Year == 2006), though subset has its goods and others.
by(longdf, longdf$Year, myfunc)
If using library(dplyr), try longdf %>% filter(Year == 2010) %>% myfunc()
(Side note: when trying to plot aggregate data, it's often easier when the data is in this form, especially when using ggplot2-like layering and aesthetics.)
Rationale Against "Brute Force"
In answer to your comment question, when making different variables with the same structure, it is easy to deduce that you will be doing the same thing to each of them, in turn or immediately-consecutively. In general programming principle, many try to generalize what they do so that it if it can be done once, it can be done an arbitrary number of times without (heavily) adjusting the code. For instance, compare what was necessary in applying myfunc in the two examples above.
Further, if you later want to aggregate the results from your calls to myfunc, it is more laborious in the "brute force" example (as you must capture each return and combine manually), whereas the other two techniques can use simpler summarizing functions (e.g., another lapply, or perhaps Reduce or Filter).

Related

How do I gsub the complete time string behind #

(this is my first question, if i need to improve anything about it, pls let me know!)
I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.
My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "#HH:MM").
I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "#HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "#HH" but keeping ":MM". for this I used
gsub("#([[:digit:]]|[_])*", "", df$timepoints)
, as found for a similar problem just with words here: remove all words that start with "#" from a string
The aim is to get (e.g.):
id
timepoints
3990
"7:16-7:23, 7:25-7:43"
or
id
timepoints
3990
"7:16-7:23", "7:25-7:43"
If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.
Any help would be greatly appreciated!
How about matching the strings you're interested in instead?
With base:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output (with a list column):
id timepoints new_timepoints
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23, 7:25-7:43
2 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
With tidyverse (in a long format for easy calculations!):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23
2 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:25-7:43
3 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
You can do something like this:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s[grepl("^\\d",s)]
})
}
and then apply that function to the timepoints column
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest(timepoints)
Output:
id timepoints
<chr> <chr>
1 3990 7:16-7:23
2 3990 7:25-7:43
3 3989 7:25-7:43
4 3004 7:30-7:39
5 3004 7:45-7:48
6 3004 7:49-7:54
You could also use unnest_wider() to get these as columns; for that I would adjust my f() to include the names of the timepoints:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s = s[grepl("^\\d",s)]
setNames(s, paste0("tp", 1:length(s)))
})
}
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest_wider(timepoints)
Output:
id tp1 tp2 tp3
<chr> <chr> <chr> <chr>
1 3990 7:16-7:23 7:25-7:43 NA
2 3989 7:25-7:43 NA NA
3 3004 7:30-7:39 7:45-7:48 7:49-7:54
Setting the data with the package data.table
library(data.table)
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.table(id, timepoints)
Note that I saved it as a data.table
Splitting the timepoints by comma and storing the value in the new_time column.
df[,new_time:=strsplit(timepoints, ",")]
Removing the string values that has #
df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[#]", x)]))]
Since the timepoints column has multiple commas in a row empty string("") exists I remove them
df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]
Now the new_time column looks like this
df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"
[[2]]
[1] "7:25-7:43"
[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"
If you want to have the new_time column to have whole strings
df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43" "7:25-7:43" "7:30-7:39, 7:45-7:48, 7:49-7:54"
1) list Split by comma and then grep out the components with a dash. No packages are used. This gives a list of character vectors as the timepoints column.
df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","),
grep, pattern = "-", value = TRUE)
df2
## id timepoints
## 1 3990 7:16-7:23, 7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ id : chr "3990" "3989" "3004"
## $ timepoints:List of 3
## ..$ : chr "7:16-7:23" "7:25-7:43"
## ..$ : chr "7:25-7:43"
## ..$ : chr "7:30-7:39" "7:45-7:48" "7:49-7:54"
2) character If you want a comma separated character string in each row add this:
transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
## id timepoints
## 1 3990 7:16-7:23,7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54
3) long form or if you prefer long form use this:
long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
## id timepoints
## 1 3990 7:16-7:23
## 2 3990 7:25-7:43
## 3 3989 7:25-7:43
## 4 3004 7:30-7:39
## 5 3004 7:45-7:48
## 6 3004 7:49-7:54
4) wide form or a wide form matrix:
nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
## seq
## id 1 2 3
## 3990 "7:16-7:23" "7:25-7:43" NA
## 3989 "7:25-7:43" NA NA
## 3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"

Changing class from character to integer but retaining all the datas inside

q7 <- dbGetQuery(conn,
"SELECT TailNum AS TailNum, AVG(ontime.DepDelay) AS avg_delay, ontime.Year AS Year, planes.Year AS yearmade
FROM planes JOIN ontime USING(tailnum)
WHERE ontime.Cancelled = 0 AND planes.Year != '' AND planes.Year != 'None' AND ontime.Diverted = 0 AND ontime.DepDelay > 0
GROUP BY TailNum
ORDER BY avg_delay")
Codes that i have tried:
q7 <- data.frame(
yearmade = q7.yearmade, stringsAsFactors = FALSE)
^ Dataframe
Hi! Basically I would like to create a new column where the Year would subtract the yearmade and be placed into a new column, but before I could do that, I found out that the data I draw from another table into this dataframe shows as character(yearmade), is there any way to change it but retain the original data inside?
First use as.numeric() to change yearmade into a numeric variable. Then you can simply compute the difference between Year and yearmade.
I believe this will work for you.
set.seed(1)
Year <- 2000:2022
yearmade <- sample(c('2000', '1999', '1998'), length(Year), replace = TRUE)
TailNum <- sample(c('N3738B', 'N3737C', 'N37342'), length(Year), replace = TRUE)
avg_delay <- 1:length(Year)
q7 <- data.frame(TailNum, avg_delay, Year, yearmade)
# compute difference and add to data frame
q7$year_diff <- q7$Year - as.numeric(q7$yearmade)
This retains the original data, but introduces a new column year_diff.
> str(q7)
'data.frame': 23 obs. of 5 variables:
$ TailNum : chr "N3738B" "N3738B" "N3737C" "N3738B" ...
$ avg_delay: int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
$ yearmade : chr "2000" "1998" "2000" "1999" ...
$ year_diff: num 0 3 2 4 4 7 8 8 9 11 ...

How can I convert a time shown as a character into time value with R?

I have a dataset calles marathon and I have tried to use lubridate and churn to convert the characters of marathon$Official.Time into time value in order to work on them. I would like the times to be shown in minutes (meaning that 2 hours are shown as 120 minutes).
data.frame': 5616 obs. of 11 variables:
$ Overall.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Category.Position: int 1 1 2 2 3 4 3 4 5 5 ...
$ Category : chr "MMS" "MMI" "MMI" "MMS" ...
$ Race.No : int 21080 14 2 21077 18 21 21078 21090 21084 12 ...
$ Country : chr "Kenya" "Kenya" "Ethiopia" "Kenya" ...
$ Official.Time : chr "2:12:12" "2:12:14" "2:12:20" "2:12:29" ...
I tried with:
library(lubridate)
times(marathon$Official.Time)
Or
library(chron)
chron(times=marathon$Official.Time)
as.difftime(marathon$Official.Time, units = "mins")
But I only get NA
You were almost there with difftime (which requires two times and gives you the difference). Instead, use as.difftime (which requires one "difference" - ie marathon time) and specify the format as hours:minutes:seconds.
> as.difftime("2:12:12", format="%H:%M:%S", units="mins")
Time difference of 132.2 mins
> as.numeric(as.difftime("2:12:12", format="%H:%M:%S", units="mins"))
[1] 132.2
No extra packages needed.
NOTE: #mathemetical.coffee's solution is ++gd better than these.
Pretty straightforward to kick it out manually:
library(stringi)
library(purrr)
df <- data.frame(Official.Time=c("2:12:12","2:12:14","2:12:20","2:12:29"),
stringsAsFactors=FALSE)
map(df$Official.Time, function(x) {
stri_split_fixed(x, ":")[[1]] %>%
as.numeric() %>%
`*`(c(60, 1, 1/60)) %>%
sum()
}) -> df$minutes
df
## Official.Time minutes
## 1 2:12:12 132.2
## 2 2:12:14 132.2333
## 3 2:12:20 132.3333
## 4 2:12:29 132.4833
You can also do it with just base R operations and w/o "piping":
df$minutes <- sapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, USE.NAMES=FALSE)
If "stuck" with base R then I'd prbly actually do:
vapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, double(1), USE.NAMES=FALSE)
to ensure type safety.
But, chron can also be used:
library(chron)
60 * 24 * as.numeric(times(df$Official.Time))
NOTE that lubridate has no times() function.

Paste / NoQuote - Not Working as Expected

I have a Data Frame c1 as below:
str(c1)
#'data.frame': 2312 obs. of 6 variables:
# $ dt : Date, format: "2014-04-01" "2014-04-01" "2014-04-01" ...
# $ base : Factor w/ 2 levels "AA","AB": 1 1 1 2 2 2 2 1 1 1 ...
# $ curr : Factor w/ 5 levels "BA","BB","BC",..: 2 3 5 1 2 3 4 2 3 5 ...
# $ trans: int 72 176 4365 234 144 352 16762 61 160 4276 ...
# $ amt : num 2.18e+09 5.55e+09 9.99e+09 3.75e+08 4.37e+09 ...
# $ rate : num 1.11e-04 1.22e-02 1.26 3.94 5.65e+03 ...
d = "c1"
d
# [1] "c1"
Now then I use d instead of the actual data frame name it does not work correctly -
i <- sapply( c1, is.factor)
i
# dt base curr trans amt rate
#FALSE TRUE TRUE FALSE FALSE FALSE
Correct!
i <- sapply( paste(d), is.factor)
i
# c1
#FALSE
Incorrect
i <- sapply( noquote(d), is.factor)
i
# c1
#FALSE
Incorrect
Is there a way to fix this?
Edit -
c1[i] <- lapply(c1[i], as.character)
Works
get(d)[i] <- lapply( get(d)[i], as.character)
Fails
for (j in 1:length(i)) { ifelse(is.factor(get(d)[j]),get(d)[i] <- as.character(get(d)[i])) }
Fails
Can get be used in every place or are there 3/4 ways to use get()
Thanks Again
If I understand correctly, you're looking for
xy <- data.frame(a = runif(3), b = letters[1:3])
sapply(get("xy"), is.factor)
mind you this is bad practice. If you're making up variable names on-the-fly, you should consider using other objects, like a list, to store a data.frame(s).
This works for now. Although its exceptionally bad to make sense of.
.eval <- function(evaltext,envir=sys.frame()) {
## evaluate a string as R code
eval(parse(text=evaltext), envir=envir)
}
.eval(paste( "i = sapply(",noquote(d),",is.factor)",sep=""))
.eval(paste( noquote(d),"[i] <- lapply(",noquote(d),"[i], as.character)",sep=""))
I am still looking for better alternatives. This is so bad that I cannot accept this as answer :-(
Thanks, Manish

a complex merge in R to flag unmatched observations?

I'm trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don't know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What's more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn't. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it's better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I'd be very grateful for any help or guidance.
If you are looking for which things in x$id are in y$id2, then you can use
x$id %in% y$id2
to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
x$match.y <- x$id %in% y$id2
to see what rows of x have a corresponding ID in y.
To see which observations are 1-to-1, you could do something like
y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
to filter out elements that appear more than once in y$id2. You can also add this to x:
x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.
The reason your merge failed was that you gave it two different structures (one a numeric matrix and the other a character matrix) for x and y. Using cbind when data.frame should be chosen is a common strategy for failure.
> str(x)
chr [1:110, 1:4] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "id" "X1" "month" "year"
> str(y)
num [1:11, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "id2" "Y1"
If you used the data.frame function (since dataframes are what merge is supposed to be working with) it would have succeeded:
> x <- data.frame(id, X1, month, year); y <- data.frame(id2,Y1)
> str( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
'data.frame': 111 obs. of 5 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ X1 : num 1.5063 2.5035 0.7889 -0.4907 -0.0446 ...
$ month: Factor w/ 10 levels "Apr","Aug","Dec",..: 6 6 2 2 10 10 9 9 8 8 ...
$ year : Factor w/ 5 levels "2001","2002",..: 3 3 4 4 5 5 1 1 2 2 ...
$ Y1 : num 1.449 1.449 -0.134 -0.134 -0.828 ...
> tail( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
id X1 month year Y1
106 96 -0.3869157 Dec 2004 NA
107 97 0.6373009 Jan 2005 NA
108 98 -0.7735626 Feb 2006 NA
109 99 -1.3537915 Mar 2001 NA
110 100 0.2626190 Apr 2002 NA
111 200 NA <NA> <NA> -1.509818
If you have duplicates in your 'x' argument, then you should get duplicates in the result. It's then your responsibility to use !duplicated in whatever manner you deem appropriate (either before or after the merge), but you cannot expect merge to be making decisions like that for you.

Resources