Match a Variable to Other Dataset Based using Multiple Overlapping Variables - r

I have two data sets with some overlapping variables. One dataset is basically a subset of the other but needs an additional variable added based on some of the overlapping variables. For example
varA <- c(rep(c("a","b"), each=5))
blah <- c(11:20)
varB <- c(1:10)
speed <- rnorm(10)
dataset1 <- data.frame(varA,blah,varB,speed)
varA.2 <- c("a","a","b","b")
varB.2 <- c(2,10,11,7)
speed.2 <- rep(NA, 4)
dataset2 <- data.frame(varA.2, varB.2, speed.2)
dataset2
I would like the "speed.2" variable to contain the speed values for the lines where varA and varB are matching between the two sets.
I've tried something with "merge" but am having issues.
Thank you!

May be:
colnames(dataset2) <- gsub("\\..*","", colnames(dataset2))
library(dplyr)
left_join(dataset2[,-3],dataset1[,-2])
# Joining by: c("varA", "varB")
# varA varB speed
#1 a 2 -1.3243815
#2 a 10 NA
#3 b 11 NA
#4 b 7 -0.6026936
Or without changing the column names.
merge(dataset1[,-2],dataset2[,-3], by.x=c("varA","varB"), by.y=c("varA.2", "varB.2"), all.y=TRUE)
# varA varB speed
# 1 a 2 -0.6797753
# 2 a 10 NA
# 3 b 7 -2.1838454
# 4 b 11 NA
Values in speed differ as the example was without using set.seed()

You can use 'match' function for "where varA and varB are matching"
dataset2$speed.2 = dataset1[match(paste(dataset2$varA.2,dataset2$varB.2),
paste(dataset1$varA, dataset1$varB)),]$speed
dataset2
varA.2 varB.2 speed.2
1 a 2 0.3917783
2 a 10 NA
3 b 11 NA
4 b 7 1.3265439
>

Related

How to fill in R data.frame with named vectors of different lengths?

I need to fill in R data.frame (or data.table) using named vectors as rows. The problem is that named vectors to be used as rows usually do not have all the variables. In other words, usually named vector has smaller length than the number of columns. Names of variables in the vectors coincide with column names of the dataframe:
df <- data.frame(matrix(NA, 2, 3))
colnames(df) <- c("A", "B", "C")
obs1 <- c(A=2, B=4)
obs2 <- c(A=3, C=10)
I want df as follows:
> df
A B C
1 2 4 NA
2 3 NA 10
So I want to fill in the first two rows with obs1 and obs2 respectively. When I try to do it, I get an error:
> df[1,] <- obs1
Error in `[<-.data.frame`(`*tmp*`, 1, , value = c(A = 2, B = 4)) :
replacement has 2 items, need 3
I suspect that similar question was already asked, but I could not find it. Does anybody know how to do it using data.frame or data.table?
We need to select the columns as well based on the names of 'obs1' and 'obs2'
df[1, names(obs1)] <- obs1
df[2, names(obs2)] <- obs2
-output
> df
A B C
1 2 4 NA
2 3 NA 10
When we do df[1,], it returns the first row with all the columns i.e. the length is 3 where as 'obs1' or 'obs2' have only a length of 2, thus getting the error in length
Also, creating a template dataset to fill is not really needed as we can use bind_rows which will automatically fill with NA for those columns not present
library(dplyr)
bind_rows(obs1, obs2)
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
solution with data.table;
library(data.table)
obs1 <- data.table(t(obs1))
obs2 <- data.table(t(obs2))
df <- rbindlist(list(obs1,obs2),fill=T)
df
output;
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10

merging on multiple columns R

I'm surprised if this isn't a duplicate, but I couldn't find the answer anywhere else.
I have two data frames, data1 and data2, that differ in one column, but the rest of the columns are the same. I would like to merge them on a unique identifying column, id. However, in the event an ID from data2 does not have a match in data1, I want the entry in data2 to be appended at the bottom, similar to plyr::rbind.fill() rather than renaming all the corresponding columns in data2 as column1.x and column1.y. I realize this isn't the clearest explanation, maybe I shouldn't be working on a Saturday. Here is code to create the two dataframes, and the desired output:
spp1 <- c('A','B','C')
spp2 <- c('B','C','D')
trait.1 <- rep(1.1,length(spp1))
trait.2 <- rep(2.0,length(spp2))
id_1 <- c(1,2,3)
id_2 <- c(2,9,7)
data1 <- data.frame(spp1,trait.1,id_1)
data2 <- data.frame(spp2,trait.2,id_2)
colnames(data1) <- c('spp','trait.1','id')
colnames(data2) <- c('spp','trait.2','id')
Desired output:
spp trait.1 trait.2 id
1 A 1.1 NA 1
2 B 1.1 2 2
3 C 1.1 NA 3
4 C NA 2 9
5 D NA 2 7
Try this:
library(dplyr)
full_join(data1, data2, by = c("id", "spp"))
Output:
spp trait.1 id trait.2
1 A 1.1 1 NA
2 B 1.1 2 2
3 C 1.1 3 NA
4 C NA 9 2
5 D NA 7 2
Alternatively, also merge would work:
merge(data1, data2, by = c("id", "spp"), all = TRUE)

Combine 2 tables with common variables but no common observations

I would like to match 2 Data sets (tables) which only have some (not all) variables in common but not any of those obs. - So actually I want to add dataset1 to dataset2, adding the column names of dataset2, while in empty fields of the table should be filled in with NA.
So what I did is, I tried the following function;
matchcol = function(x,y){
y = y[,match(colnames(x),colnames(y))]
colnames(y)=colnames(x)
return(y)
}
sum =matchcol(dataset1,dataset2)
data = rbind(dataset1,dataset2)
But I get; "Error: NA columns indexes not supported.
What can I do? What can I change in my code.
Thx!!
To use rbind you need to have the same column names, but with bind_rows from dplyr package you don't, try this:
library(dplyr)
data <- bind_rows(dataset1, dataset2)
example :
dataset1 <- data.frame(a= 1:5,b=6:10)
dataset2 <- data.frame(a= 11:15,c=16:20)
data <- bind_rows(dataset1,dataset2)
# a b c
# 1 1 6 NA
# 2 2 7 NA
# 3 3 8 NA
# 4 4 9 NA
# 5 5 10 NA
# 6 11 NA 16
# 7 12 NA 17
# 8 13 NA 18
# 9 14 NA 19
# 10 15 NA 20
If I understand your question right, it looks like dplyr::full_join is good for that:
library(dplyr)
dataset1 <- data.frame(Var_A = 1:10, Var_B = 100:109)
dataset2 <- data.frame(Var_A = 11:20, Var_C = 200:209)
dataset_new <- full_join(dataset1, dataset2)
dataset_new
This will automatically join the two datasets by common column names and add all other columns from both datasets. And empty fields are NAs.
Does that work for you?

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Subsetting with variable selection range

I have to make a set of selections that vary by the day on this dataset (dat), which is composed by species (sp), day (day, in POSIXct) and area (ar):
sp day ar
A 1-Jan-00 2
B 1-Jan-00 6
C 2-Jan-00 2
A 2-Jan-00 1
D 2-Jan-00 4
E 2-Jan-00 12
F 3-Jan-00 8
A 4-Jan-00 3
G 4-Jan-00 2
B 4-Jan-00 1
I need to subset where species "A" occurs. However, the areas to be selected will vary by day, given by this matrix (dat.ar):
day ar.select
1-Jan-00 (1,6)
2-Jan-00 (1,12)
3-Jan-00 (4,8)
4-Jan-00 (3,12)
More specifically, for areas where species "A" occurs, on 1-jan-00, I need only areas 1 and 6. For 2-jan-00, areas 1 and 12, and so on.
As an example, the desired output on this example for this selection is given below:
sp day ar
A 2-Jan-00 1
A 4-Jan-00 3
I haven't had much success getting a for loop, as I am still trying to learn the semantics in R. In summary, a rough idea of what must be done, but still struggling with the language. Here is a sketch of where I think this should go:
dat1 = with(dat,sapply(day[sp=="A" & dat.ar$day.s[i] ],
function(x) ar == (ar[sp=="A" & day == x]==dat.ar$ar.select[j])
final=dat[rowSums(dat1) > 0, ]
I believe I have to fit a for loop, that would go through dat.ar, specifying the areas to be selected in dat. But despite my efforts in trying to get for the for loop, I haven't gotten anywhere near. I am not even sure if combining an sapply and a for loop is the right way to go about this.
In case someone wishes to reproduce the problem:
sp=c("A","B","C","A","D","E","F","A","G","B")
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "3-Jan-00", "4-Jan-00", "4-Jan-00", "4-Jan-00")
day=as.POSIXct(day, format="%d-%b-%y")
ar=c(2,6,2,1,4,12,8,3,2,1)
dat= as.data.frame(cbind(sp, day, ar))
day.s=c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s=as.POSIXct(day.s, format="%d-%b-%y")
a.s=c(1,1,4,3)
a.e=c(6,12,8,12)
ar.select=paste(a.s, a.e, sep=",")
dat.ar=cbind(day.s, ar.select)
Any help is much appreciated.
You could merge your table of conditions to the original dataset and filter them conditionally. Consider a1 and a2 like your sp and day values, and obs to be like your ar value.
library(data.table)
dataset <- data.table(
a1 = c("A","B","C","B","A","A","A","A"),
a2 = c("P","Q","Q","Q","R","R","P","Q"),
obs = c(3,2,3,4,2,4,8,0)
)
constraints <- data.table(
a1 = c("A","B","C","A","B","C","A","B","C"),
a2 = c("P","P","P","Q","Q","Q","R","R","R"),
lower = c(1,2,3,4,3,2,3,2,5),
upper = c(6,4,5,7,5,6,5,3,7)
)
checkingdataset <- merge(dataset,constraints, by = c("a1","a2"), all.x = TRUE)
checkingdataset[obs <= upper & obs >= lower, obs.keep := TRUE]
# a1 a2 obs lower upper obs.keep
#1: A P 3 1 6 TRUE
#2: A P 8 1 6 NA
#3: A Q 0 4 7 NA
#4: A R 2 3 5 NA
#5: A R 4 3 5 TRUE
#6: B Q 2 3 5 NA
#7: B Q 4 3 5 TRUE
#8: C Q 3 2 6 TRUE
First, I would not use as.data.frame(cbind(...)) to make your data.frames. Second, I would create dat.ar in much the same structure that you have created dat. Third, I would then just use merge to get the result you are looking for.
dat <- data.frame(sp=c("A","B","C","A","D","E","F","A","G","B"),
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "2-Jan-00", "3-Jan-00", "4-Jan-00",
"4-Jan-00", "4-Jan-00"),
ar=c(2,6,2,1,4,12,8,3,2,1))
dat$day <- as.POSIXct(dat$day, format="%d-%b-%y")
day.s <- c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s <- as.POSIXct(day.s, format="%d-%b-%y")
a.s <- c(1,1,4,3)
a.e <- c(6,12,8,12)
ar.select <- paste(a.s, a.e, sep=",")
dat.ar <- data.frame(sp = "A", day = day.s, ar = ar.select)
dat.ar <- cbind(dat.ar[-3],
read.csv(text = as.character(dat.ar$ar), header = FALSE))
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")
dat.ar
# sp day variable ar
# 1 A 2000-01-01 V1 1
# 2 A 2000-01-02 V1 1
# 3 A 2000-01-03 V1 4
# 4 A 2000-01-04 V1 3
# 5 A 2000-01-01 V2 6
# 6 A 2000-01-02 V2 12
# 7 A 2000-01-03 V2 8
# 8 A 2000-01-04 V2 12
merge(dat, dat.ar)
# sp day ar variable
# 1 A 2000-01-02 1 V1
# 2 A 2000-01-04 3 V1
Of course, I would just suggest that you make your dat.ar object in a more friendly manner to begin with. Why paste values together if you are going to separate them out later anyway? ;)
dat.ar <- data.frame(sp = "A",
day = c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00"),
a.s = c(1,1,4,3), a.e = c(6,12,8,12))
dat.ar$day <- as.POSIXct(dat.ar$day, format="%d-%b-%y")
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")

Resources