Merge unequal length data.frames by id in R - r

Sample data
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002,2003))
Merge doesn't give good solution merge(x,y,by="id"), which gives duplicates.
Now for the above sample data simple cbind works cbind(x,y) and this is what I'm after, just paring the year with corresponding id.
Problem arrises when the two data.frames do not match! So that the data.frame containing variable year is shorter. Someting like this:
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002))
So I need paring the two data.frames and the corresponding unmatched rows of data.frame x could be NA's so that I would remove that row.
Desired output for the shorter sample data:
id year dna
1 1 2001 232
2 1 2002 424
3 1 2003 5345
4 2 2005 45345
5 2 2006 45
6 7 2000 345
7 7 2001 4543
8 7 2002 345345

You should add a record number to each id so you can work with merge:
x <- transform(x, rec = ave(id, id, FUN = seq_along))
y <- transform(y, rec = ave(id, id, FUN = seq_along))
merge(x, y, c("id", "rec"))
# id rec dna year
# 1 1 1 232 2001
# 2 1 2 424 2002
# 3 1 3 5345 2003
# 4 2 1 45345 2005
# 5 2 2 45 2006
# 6 7 1 345 2000
# 7 7 2 4543 2001
# 8 7 3 345345 2002

Related

Trying to keep values of a column based on the unique values of two other columns

I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].

Remove both rows that duplicate in R

I'm trying to remove all rows that have a duplicate value. Hence, in the example I want to remove both rows that have a 2 and the three rows that have 6 under the x column. I have tried df[!duplicated(xy$x), ] however this still gives me the first row that duplicates, where I do not want either row.
x <- c(1,2,2,4,5,6,6,6)
y <- c(1888,1999,2000,2001,2004,2005,2010,2011)
xy <- as.data.frame(cbind(x,y))
xy
x y
1 1 1888
2 2 1999
3 2 2000
4 4 2001
5 5 2004
6 6 2005
7 6 2010
8 6 2011
What I want is
x y
1 1888
4 2001
5 2004
Any help is appreciated. I need to avoid specifying the value to get rid of since I am dealing with a dataframe with thousands of records.
we can do
xy[! xy$x %in% unique(xy[duplicated(xy$x), "x"]), ]
# x y
#1 1 1888
#4 4 2001
#5 5 2004
as
unique(xy[duplicated(xy$x), "x"])
gives the values of x that are duplicated. Then we can just filter those out.
You can count and include only the singletons
xy[1==ave(xy$x,xy$x,FUN=length),]
x y
1 1 1888
4 4 2001
5 5 2004
Or like this:
xy[xy$x %in% names(which(table(xy$x)==1)),]
x y
1 1 1888
4 4 2001
5 5 2004

How to join without losing information?

I have several data frames with the following structure:
january february march april
Id A B Id A B Id A B Id A B
1 4 4 1 2 3 3 9 7 1 4 3
2 3 5 2 2 7 2 2 4 4 6 2
3 6 8 4 9 9 2 3 5
4 7 8
I would like to bring them into one single data frame which contains ´NA´ for the missing ID' and there corresponding attributes. The results has might look like:
Id janA janB febA febB marA marB aprA aprB
1 4 4 2 3 NA NA 4 3
2 3 5 2 7 2 4 3 5
3 6 8 NA NA 9 7 NA NA
4 7 8 9 9 NA NA 6 2
Given some data:
ID<-c(1,2,3,4)
A<-c(4,3,6,7)
B<-c(4,5,8,8)
jan<-data.frame(ID,A,B)
ID<-c(1,2,4)
A<-c(2,2,9)
B<-c(3,7,9)
feb<-data.frame(ID,A,B)
ID<-c(3,2)
A<-c(9,2)
B<-c(7,4)
mar<-data.frame(ID,A,B)
ID<-c(1,4,2)
A<-c(4,6,3)
B<-c(6,2,5)
apr<-data.frame(ID,A,B)
What I have tried:
test <- rbind(jan, feb,mar,apr)
test <- rbind.fill(jan, feb, mar,apr)
You can use merge within Reduce.
First, let's prepare a list with the data and change the column names to janA, janB, febA, ...
list_df <- list(
jan = jan,
feb = feb,
mar = mar,
apr = apr
)
list_df <- lapply(names(list_df), function(name_month){
df_month <- list_df[[name_month]]
names(df_month)[-1] <- paste0(name_month, names(df_month)[-1])
df_month
})
Reduce will merge all of them.
Reduce(function(x, y) merge(x, y, by = "ID", all = TRUE), list_df)

Lapply in a dataframe over different variables using filters

I'm trying to calculate several new variables in my dataframe. Take initial values for example:
Say I have:
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
time geo var1 var2
1 1990 AT 1 7
2 1991 AT 2 8
3 1992 AT 3 9
4 1990 DE 4 10
5 1991 DE 5 11
6 1992 DE 6 12
And I want:
time geo var1 var2 var1_1990 var1_1991 var2_1990 var2_1991
1 1990 AT 1 7 1 2 7 8
2 1991 AT 2 8 1 2 7 8
3 1992 AT 3 9 1 2 7 8
4 1990 DE 4 10 4 5 10 11
5 1991 DE 5 11 4 5 10 11
6 1992 DE 6 12 4 5 10 11
So both time and the variable are changing for the new variables. Here is my attempt:
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
# ideally, I want code where I only have to change these two vectors
# and where it's possible to change their dimensions
for (i in initialyears){
lapply(initialvars,function(x){
rep(Dataset[time==i,x],each=length(unique(Dataset$time)))
})}
Which runs without error but yields nothing. I would like to assign the variable names in the example (eg. "var1_1990") and immediately make the new variables part of the dataframe. I would also like to avoid the for loop but I don't know how to wrap two lapply's around this function. Should I rather have the function use two arguments? Is the problem that the apply function does not carry the results into my environment? I've been stuck here for a while so I would be grateful for any help!
p.s.: I have the solution to do this combination by combination without apply and the likes but I'm trying to get away from copy and paste:
Dataset$var1_1990 <- c(rep(Dataset$var1[which(Dataset$time==1990)],
each=length(unique(Dataset$time))))
This can be done with subset(), reshape(), and merge():
merge(Dataset,reshape(subset(Dataset,time%in%c(1990,1991)),dir='w',idvar='geo',sep='_'));
## geo time var1 var2 var1_1990 var2_1990 var1_1991 var2_1991
## 1 AT 1990 1 7 1 7 2 8
## 2 AT 1991 2 8 1 7 2 8
## 3 AT 1992 3 9 1 7 2 8
## 4 DE 1990 4 10 4 10 5 11
## 5 DE 1991 5 11 4 10 5 11
## 6 DE 1992 6 12 4 10 5 11
The column order isn't exactly what you have in your question, but you can fix that up after-the-fact with an index operation, if necessary.
Here's a data.table method:
require(data.table)
dt <- as.data.table(Dataset)
in_cols = c("var1", "var2")
out_cols = do.call("paste", c(CJ(in_cols, unique(dt$time)), sep="_"))
dt[, (out_cols) := unlist(lapply(.SD, as.list), FALSE), by=geo, .SDcols=in_cols]
# time geo var1 var2 var1_1990 var1_1991 var1_1992 var2_1990 var2_1991 var2_1992
# 1: 1990 AT 1 7 1 2 3 7 8 9
# 2: 1991 AT 2 8 1 2 3 7 8 9
# 3: 1992 AT 3 9 1 2 3 7 8 9
# 4: 1990 DE 4 10 4 5 6 10 11 12
# 5: 1991 DE 5 11 4 5 6 10 11 12
# 6: 1992 DE 6 12 4 5 6 10 11 12
This assumes that the time variable is identical (and in the same order) for each geo value.
With dplyr and tidyr and using a custom function try the following:
Data
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
Code
library(dplyr); library(tidyr)
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
#create this function
myTranForm <- function(dataSet, varName, years){
temp <- dataSet %>% select(time, geo, eval(parse(text=varName))) %>%
filter(time %in% years) %>% mutate(time=paste(varName, time, sep="_"))
names(temp)[names(temp) %in% varName] <- "someRandomStringForVariableName"
temp <- temp %>% spread(time, someRandomStringForVariableName)
return(temp)
}
#Then lapply on intitialvars using the custom function
DatasetList <- lapply(intitialvars, function(x) myTranForm(Dataset, x, intitialyears))
#and loop over the data frames in the list
for(i in 1:length(intitialvars)){
Dataset <- left_join(Dataset, DatasetList[[i]])
}
Dataset

An increasing counter for occurence of new values in R

I am trying to make a counter which increases for each new change in another vector. E.g. I have several individuals that are observed over several weeks, and I want to know how many weeks they are observed. So I'll end up with a table like this:
Id year Week Weeks observed
1 2006 10 1
1 2006 10 1
1 2006 11 2
1 2006 11 2
1 2006 12 3
1 2006 13 4
1 2007 1 5
1 2007 2 6
1 2007 3 7
1 2007 4 8
1 2007 5 9
1 2007 6 10
2 2006 10 1
2 2006 10 1
2 2006 11 2
2 2006 11 2
2 2006 12 3
2 2006 13 4
2 2007 1 5
2 2007 2 6
2 2007 3 7
2 2007 4 8
2 2007 5 9
2 2007 6 10
Assuming you have your data in a data.frame called dat, you could use tapply and convert Phase to a factor then strip it of its levels to use the underlying integer values:
dat$newcounter <- unlist(tapply(dat$Phase, dat$Id,
function(x) unclass(as.factor(x))))
Obligatory data.table answer:
library(data.table)
dt<-as.data.table(dat)
dt[, newcounter := unclass(as.factor(Phase)), by = Id]
EDIT
To account for the newly phrased question, here is a possibility using data.table.
dt <- as.data.table(dat[, -4]) # Create data.table
setkeyv(dt, c("Id", "year", "Week")) # Create key for data.table
dt2 <- unique(dt) # Get only unique rows by key
dt3 <- dt2[, Weeks.observed := seq_len(.N), by = "Id"] # Create new variable
dt[dt3] # Merge data.tables back together

Resources