matching only if target 'cell' is NA - r

I have the following two data.frames opcat and polity.
opcat <- data.frame(country = rep(LETTERS[1:5]), date.ratification = c(2003,2004,2005,NA,NA), date.accession = c(NA,NA,NA,2000,2006))
opcat
polity <- data.frame(year = rep((2000:2007), 7), country = rep(LETTERS[1:7],8), polity.score = sample(10, 56, replace=TRUE))
polity <- polity[order(polity$country, polity$year),]
polity
I want to insert the polity.score of the dateframe polity to the data.frame opcat for the year in which a country 'ratified' (= date.ratified) or 'acceeded' (= date.accession).
for ratification
opcat$polity.score <- polity$polity.score[match(interaction(opcat$country, opcat$date.ratification), interaction(polity$country, polity$year))]
opcat
country date.ratification date.accession polity.score
1 A 2003 NA 10
2 B 2004 NA 2
3 C 2005 NA 10
4 D NA 2000 NA
5 E NA 2006 NA
for accesssion
opcat$polity.score <- polity$polity.score[match(interaction(opcat$country, opcat$date.accession), interaction(polity$country, polity$year))]
opcat
country date.ratification date.accession polity.score
1 A 2003 NA NA
2 B 2004 NA NA
3 C 2005 NA NA
4 D NA 2000 9
5 E NA 2006 7
A country has either a date for ratification or for accession (not both). Since the matching for accession would fill the results for ratification with NA, I tried the following modification:
opcat$polity.score[is.na(opcat$date.ratification)] <- polity$polity.score[match(interaction(opcat$country, opcat$date.accession), interaction(polity$country, polity$year))]
opcat
But this doesn't work. I get the error message " number of items to replace is not a multiple of replacement length". How can I match the scores into the same variable without overwriting them?
The final result should be
country date.ratification date.accession polity.score
1 A 2003 NA 10
2 B 2004 NA 2
3 C 2005 NA 10
4 D NA 2000 9
5 E NA 2006 7
I would think that this shouldn't be that difficult.
Many thanks.

Add the year to opcat, and perform a standard merge:
opcat$year <- with(opcat, ifelse(is.na(date.ratification), date.accession, date.ratification))
merge(opcat,polity)

Related

Can I iterate through an vector of variable names in an R For Loop?

Background:
I'm building a complex function which, in part, creates groups of heterogeneous size and assigns homogenous values to members of each group. To make writing the larger, more complex function simpler, I built this simple function which will allow me to do what I described and it works exactly as I need it to. To demonstrate its effects:
## Chunky4Loop.R | v1.0 - 2022.01.01
# A function that allows us to take a small number of values from an array of any size
# and paste them into a dataframe in chunks, which may or may not be equal in size
Chunky4Loop <-function(chunk, # Value noting how many chunks to iterate through. The value of chunk should be equal to the length of the arrays Quantity and Input
quantity, # Vectors noting the size of, or how many rows are contained within, each chunk
input) # The values that each chunk should take
{LoopTracker <- 0
output <- NA
for (i in 1:chunk){
if (i == 1){
output[1:quantity[i]] <- rep(input[i], quantity[i])
LoopTracker <- LoopTracker + quantity[i]
}
if (i > 1){
output[(LoopTracker + 1):(LoopTracker + quantity[i])] <- rep(input[i], quantity[i])
LoopTracker <- LoopTracker + quantity[i]
}
}
rm(LoopTracker, i)
df
return(output)
}
nGroups <- 3
nPeople <- c(7,3,1)
GroupName <- c("A","B","C")
rows <- 1:sum(nPeople)
cols <- c("Group", "ATK", "DMG")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# Group ATK DMG
#1 NA NA NA
#2 NA NA NA
#3 NA NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA NA NA
#7 NA NA NA
#8 NA NA NA
#9 NA NA NA
#10 NA NA NA
#11 NA NA NA
df$Group <- Chunky4Loop(chunk = nGroups,
quantity = nPeople,
input = GroupName)
df
# Group ATK DMG
#1 A NA NA
#2 A NA NA
#3 A NA NA
#4 A NA NA
#5 A NA NA
#6 A NA NA
#7 A NA NA
#8 B NA NA
#9 B NA NA
#10 B NA NA
#11 C NA NA
However, I have to run it over many variables, which is doable. I'd like to simplify it even further, though, by creating a vector of input variables, output columns, and then entering these two meta-variables into a For Loop like this:
# I specify the original values contained within the arrays
ATK <- c(7,3,1)
DMG <- c(7,3,1)
# I create an array of the input variables I just specified ...
inputs <- c('ATK', 'DMG')
# Then I create an array of columns to output to ...
outputs <- c('df$ATK', 'df$DMG')
# Then I run the Chunky4Loop function to keep this all compacted
for (i in 1:length(inputs)){
outputs[i] <- Chunky4Loop(chunk = nGroups,
quantity = nPeople,
input = inputs[i])
} # added missing close-curly-brace
I would ideally get this:
Group ATK DMG
1 A 7 7
2 A 7 7
3 A 7 7
4 A 7 7
5 A 7 7
6 A 7 7
7 A 7 7
8 B 3 3
9 B 3 3
10 B 3 3
11 C 1 1
But R will not recognize these inputs or outputs and will produce the message:
In outputs[i] <- Chunky4Loop(chunk = nGroups, quantity = nPeople, :
number of items to replace is not a multiple of replacement length
Can I make this work?

Complex join of longitudinal tables in R

I have ~16 .txt files that I need to turn into one, wide flat file. For each new file, time has passed, and some new variables are added. What I would like to do is append those new columns to the right side of the first table, joining by an identification variable. This gets complicated quickly, so here is an MRE:
library(dplyr)
id <- as.character(1:6)
first <- c("jeff", "jimmy", "andrew", "taj", "karl-anthony", "jamal")
last <- c("teague", "butler", "wiggins", "gibson", "towns", "crawford")
set.seed(1839)
a <- c(1:4, NA, NA)
b <- c(1:4, NA, NA)
c <- c(11:13, NA, 14, NA)
d <- c(11:13, NA, 14, NA)
e <- c(21, 22, NA, 24, NA, 26)
f <- c(21, 22, NA, 24, NA, 26)
Simulating the three different files:
df_1 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
a = a[c(1:3,5)],
b = b[c(1:3,5)]
)
df_2 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
c = c[c(1:3,5)],
d = d[c(1:3,5)]
)
df_3 <- data.frame(
id = id[c(1,2,4,6)],
first = first[c(1,2,4,6)],
last = last[c(1,2,4,6)],
e = e[c(1,2,4,6)],
f = f[c(1,2,4,6)]
)
df_goal <- data.frame(id, first, last, a, b, c, d, e, f)
df_goal is what I want, and here is what it looks like:
> df_goal
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 4 taj gibson 4 4 NA NA 24 24
5 5 karl-anthony towns NA NA 14 14 NA NA
6 6 jamal crawford NA NA NA NA 26 26
Note that these are very big files, and the columns are not always in the right order, so I cannot just say to join by keeping the first three columns.
If I do a full_join on all, I get the names repeated every time:
df_all <- df_1 %>%
full_join(df_2, by = "id") %>%
full_join(df_3, by = "id")
> df_all
id first.x last.x a b first.y last.y c d first last e f
1 1 jeff teague 1 1 jeff teague 11 11 jeff teague 21 21
2 2 jimmy butler 2 2 jimmy butler 12 12 jimmy butler 22 22
3 3 andrew wiggins 3 3 andrew wiggins 13 13 <NA> <NA> NA NA
4 5 karl-anthony towns NA NA karl-anthony towns 14 14 <NA> <NA> NA NA
5 4 <NA> <NA> NA NA <NA> <NA> NA NA taj gibson 24 24
6 6 <NA> <NA> NA NA <NA> <NA> NA NA jamal crawford 26 26
What I tried to do next. I wrote a for loop, and I got each data frame, selected just (a) the id column, and (b) columns whose names have not appeared in the df_all data frame yet, and (c) did a full_join:
dfs <- c("df_2", "df_3")
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i)[!names(get(i)) %in% names(df_all1)[-1]] %>%
full_join(df_all1, .)
}
> df_all1
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 <NA> <NA> NA NA NA NA 24 24
6 6 <NA> <NA> NA NA NA NA 26 26
Note that this means the cases that did not appear in the first file are missing the names (these represent key demographic variables in my data). I also tried going through row-by-row and doing a column join if the id was already present, and then doing a bind_row if it was not. This code threw an error:
df_all2 <- df_1
for (i in dfs) {
for (k in 1:nrow(get(i))) {
if (get(i)[k, "id"] %in% df_all2$id) {
df_all2 <- get(i)[k, !names(get(i)) %in% names(df_all2)[-1]] %>%
left_join(df_all2, ., by = "id")
} else {
df_all2 <- bind_rows(
df_all2,
get(i)[k, !names(get(i)) %in% names(df_all2)[-1]]
)
}
}
}
There has got to be a way to do a join with only select columns, but fill in missing information if necessary. Again, I am working with lots of files with lots of columns, so I cannot assume that I know the position of any columns; it has to be done by the column names.
I have also thought about just including a new variable that is the date of the file, stacking them all on top of one another ("long" format), and then using tidyr::spread and tidyr::gather, but I haven't found a solution yet.
I am not wedded to the tidyverse (base or data.table would be great, even some way to do a SQL join in R) or even R; I am open to a Python solution using pandas, as well.
Short version: How do I join new columns to an existing data set—by identification number—and fill in information from not-new columns, but since the case is new, need to be filled in?
Possible solution, per Psidom:
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i) %>%
full_join(
df_all1, .,
by = names(get(i))[names(get(i)) %in% names(df_all1)]
)
}
df_all1
Maybe a more efficient way to do this, though?
Using melt once you have a full_join df_all.
library(data.table)
df <- melt(setDT(df_all),
measure.vars = patterns("^first", "^last"))
df <- unique(df[,-c("id", "variable")])
df[!is.na(df$value1),]
a b c d e f value1 value2
1: 1 1 11 11 21 21 jeff teague
2: 2 2 12 12 22 22 jimmy butler
3: 3 3 13 13 NA NA andrew wiggins
4: NA NA 14 14 NA NA karl-anthony towns
5: NA NA NA NA 24 24 taj gibson
6: NA NA NA NA 26 26 jamal crawford
The most simple solution using dplyr is to omit the by parameter in the calls to full_join().
library(dplyr)
df_1 %>%
full_join(df_2) %>%
full_join(df_3)
Joining, by = c("id", "first", "last")
Joining, by = c("id", "first",
"last")
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 taj gibson NA NA NA NA 24 24
6 6 jamal crawford NA NA NA NA 26 26
Warning messages:
1: Column id joining factors with different levels, coercing to character vector
2: Column first joining factors with different levels, coercing to character vector
3: Column last joining factors with different levels, coercing to character vector
The documentation of the by parameter in ?full_join says: If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables.
So this is equivivalent to explicetely passing by = c("id", "first", "last") as proposed by Psidom.
If there are many data frames to join, the code below may save a lot of typing:
Reduce(full_join, list(df_1, df_2, df_3))
The result (inluding messages) is the same as above.

merge messy dataframes r

I have 2 data frames
df1=data.frame(Col1=c('2','4','CN','CANADA',NA),Col2=c('s1','s2','s3','s4','s5'))
> df1
Col1 Col2
1 2 s1
2 4 s2
3 CN s3
4 CANADA s4
5 <NA> s5
df2=data.frame(index=1:5,code=c('AB','CA','US','CN','UK'),name=c('ALBERTA','CANADA','USA','CHINA','UK'),REGION=c('NA','NA','NA','FE','EU'))
> df2
index code name REGION
1 1 AB ALBERTA NA
2 2 CA CANADA NA
3 3 US USA NA
4 4 CN CHINA FE
5 5 UK UK EU
I want
df3=data.frame(df1,code=c('CA','CN','CN','CA',NA),name=c('CANADA','CHINA','CHINA','CANADA',NA),REGION=c('NA','FE','FE','NA',NA))
Col1 Col2 code name REGION
1 2 s1 CA CANADA NA
2 4 s2 CN CHINA FE
3 CN s3 CN CHINA FE
4 CANADA s4 CA CANADA NA
5 <NA> s5 <NA> <NA> <NA>
I have calling it by values:
df1$code=df2[df2$index[df1$Col1],2]
which fills it in incorrectly, and merging twice
m1=merge(df1,df2,by.x='Col1',by.y='index',all.x=TRUE)
m2=merge(m1,df2,by.x='Col1',by.y='name',all.x=1)
I am sure I am missing something here. Thanks for your help
Maybe not a very nice solution but it works for this example:
ind <- sapply(df1$Col1, function(x)which(df2[,c("index", "code", "name")] == as.character(x),arr.ind = T)[1])
cbind(df1, df2[ind,])
Col1 Col2 index code name REGION
2 2 s1 2 CA CANADA NA
4 4 s2 4 CN CHINA FE
4.1 CN s3 4 CN CHINA FE
2.1 CANADA s4 2 CA CANADA NA
NA <NA> s5 NA <NA> <NA> <NA>
As far as I understand the problem, Col1 of df1 contains mixed information. So my approach would be to separate the different data types. Then it should be easy to merge correctly.
chr <- as.character(df1$Col1)
index_df1 <- chr
index_df1[!grepl("^[0-9]*$", chr)] <- NA
index_df1 <- as.numeric(index_df1)
code_df1 <- chr
code_df1[!grepl("^[A-Z]{2}$", chr)] <- NA
name_df1 <- chr
name_df1[!grepl("^[A-Z]{3,}$", chr)] <- NA
df1 <- data.frame(df1, index_df1, code_df1, name_df1)

filling missing values time series data in R

I am trying to expand yearly values in my panel data to year-quarter values. That is repeat the yearly values to every quarter.
For e.g., I am looking to get the repeated values of income for year-quarter 2000Q1, 2000Q2, 2000Q3, 2000Q4, 2001Q1, ... , 2001Q4. So the data frame would be id,year-quarter, income.
I use a two step approach but have some issues to handle. If the quarterly starting value is missing, then I would then need to the quarterly to be missing (NA) too.
Case 1:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, 1, NA, 30)
)
Case 2:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, NA, NA, 30)
)
In the first step, I expand the data to quarterly as was mentioned:
interpolating in R yearly time series data with quarterly values
So use a function such as:
expand <- function(x) {
years <- min(x$year):max(x$year)
quarters <- 1:4
grid <- expand.grid(quarter=quarters, year=years)
x$quarter <- 1
merged <- grid %>% left_join(x, by=c('year', 'quarter'))
merged$person <- x$person[1]
return(merged)
}
Then I used in
zoo::na.locf
dplyr::mutate.
quarterlydata <- annual_data %>% group_by(person) %>% do(expand(.))
testdata <- quarterlydata %>% group_by(person) %>% mutate(ynew=zoo::na.locf(y))
but havent had much luck as it copies forward to all missing values from the previous non-missing values. That is,
Case 1: it copies all values, So income of 1 for person 2 gets copied over to 2010 and 2011. When it must be copied over to just 2010, and 2011 should be NAs.
For case 2: I get
Error: incompatible size (%d), expecting %d (the group size) or 1.
Any thoughts on where I am missing?
For case 1 you are missing the year in your group_by. Since using the code that you have, the groupings for na.locf thinks that year is part of the grouping which na.locf must run over.
testdata <- quarterlydata %>%
group_by(person, year) %>%
mutate(ynew=zoo::na.locf(income, na.rm=FALSE))
With the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 1 1
3 2 2010 2 NA 1
4 3 2010 2 NA 1
5 4 2010 2 NA 1
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30
For case 2, as you might already infer from the code above, you must have na.rm set to FALSE otherwise the vector will drop off all NA which it could not extrapolate.
So using exactly the same code for case 2 we will have the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 NA NA
3 2 2010 2 NA NA
4 3 2010 2 NA NA
5 4 2010 2 NA NA
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30

Is there already a function to substract different variables in subsequent quarters?

I have an unbalanced quarterly panel data set with missing values. I want to substract variable A2 from A1 in subsequent quarters. Note that I do not want to get differences of A2, but substract DIFFERENT variables from each other. Differences should be calculated separately for every uid. Besides changing years like Q4 1999 and Q1 2000 are meant to be subsequent.
I am really not sure whether i should concatenate my time index here since packages like zoo only take one index. But that's not the problem here. Here is a some example data:
structure(list(uid = c(1, 1, 1, 2, 2, 3, 3, 3), tndx = c(1999.4,
2000.1, 2000.2, 1999.4, 2000.1, 2000.1, 2000.2, 2000.3), A1 = c(2,
2, 2, 10, 11, 1, 1, 1), A2 = c(3, 3, 3, 14, 14, 2, 100, 2)), .Names = c("uid",
"tndx", "A1", "A2"), row.names = c(NA, -8L), class = "data.frame")
# which results in
uid tndx A1 A2
1 1 1999.4 2 3
2 1 2000.1 2 3
3 1 2000.2 2 3
4 2 1999.4 10 14
5 2 2000.1 11 14
6 3 2000.1 1 2
7 3 2000.2 1 100
8 3 2000.3 1 2
If you prefer a separated index, use this example:
# Thx Andrie!
x2 <- data.frame(x, colsplit(x$tndx, "\\.", names=c("year", "qtr")))
Is there a good way to solve this with reshape2, plyr or even base or would you rather write a custom function?
Note, it is also possible that some uid occurs only once. Obviously you can't calculate a lagged difference then. Still I need to check for that and create an NA then.
We split it on the uid using by and within the function that operates on each set of rows for a single uid, we create a zoo object, z, using yearqtr class for the index. Then we merge the time series with an empty series having all the desired quarters including any missing intermediate quarters giving zm and perform the computation giving zz. Finally we convert to the data.frame form on the way out:
library(zoo)
to.yearqtr <- function(x) as.yearqtr(trunc(x) + (10*(x-trunc(x))-1)/4)
DF <- do.call("rbind", by(x, x$uid, function(x) {
# columns of x are: uid tndx A1 A2
z <- zoo(x[c("A1", "A2")], to.yearqtr(x$tndx))
zm <- merge(z, zoo(, seq(start(z), end(z), 1/4)))
zz <- with(zm, cbind(zm, `A1 - A2 lag` = A1 - lag(A2, -1)))
if (ncol(zz) <= ncol(z)) zz$`A1 - A2 lag` <- NA # append NA if col not added
data.frame(uid = x[1, 1], tndx = time(zz), coredata(zz), check.names = FALSE)
}))
which gives this:
> DF
uid tndx A1 A2 result A1 - A2 lagged
1.1 1 1999 Q4 2 3 NA NA
1.2 1 2000 Q1 2 2 NA -1
1.3 1 2000 Q2 2 3 NA 0
2.1 2 1999 Q4 2 4 NA NA
2.2 2 2000 Q1 NA NA NA NA
2.3 2 2000 Q2 NA NA NA NA
2.4 2 2000 Q3 NA NA NA NA
2.5 2 2000 Q4 NA NA NA NA
2.6 2 2001 Q1 3 4 NA NA
3.1 3 2000 Q1 1 2 NA NA
3.2 3 2000 Q2 1 NA NA -1
3.3 3 2000 Q3 1 2 NA NA
EDIT: Completely re-did the solution based on further discussion. Note that this not only adds an extra column but it also converts the index to "yearqtr" class and adds the extra missing rows.
EDIT: Some minor simplifications in the by function.
I wasn't entirely clear what you wnated because you didn't include a "right answer". If you want to subtract one lagged variable from another unlagged variable you cna do that with indexing that is offset. (You do need to pad the result if you wnat it to get put back into the dataframe.
x$A1lagA2 <- ave(x[, c("A1", "A2")], x$uid, FUN=function(z) {
with(z, c(NA, A1[2:NROW(z)] -A2[1:(NROW(z)-1)]) ) } )[[1]]
x
uid tndx A1 A2 A1lagA2
1 1 1999.4 2 3 NA
2 1 2000.1 2 3 -1
3 1 2000.2 2 3 -1
4 2 1999.4 10 14 NA
5 2 2000.1 11 14 -3
6 3 2000.1 1 2 NA
7 3 2000.2 1 100 -1
8 3 2000.3 1 2 -99
You do get annoying duplicate extra columns with ave() when it argument is multicolumn, but I just took the first one.

Resources