How to match 1 column to 2 columns? - r

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!

Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)

I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

Related

combine datasets by the value of multiple columns

I'm trying to enter values based on the value of multiple columns from two datasets.
I have my main dataset (df1), with lists of a location and corresponding dates and df2 consists of a list of temperatures at all locations on every possible date. Eg:
df1
Location Date
A 2
B 1
C 1
D 3
B 3
df2
Location Date1Temp Date2Temp Date3Temp
A -5 -4 0
B 2 0 2
C 4 4 5
D 6 3 4
I would like to create a temperature variable in df1, according to the location and date of each observation. Preferably I would like to carry this out with all Temperature data in the same dataframe, but this can be separated and added 'by date' if necessary. With the example data, I would want this to create something like this:
Location Date Temp
A 2 -4
B 1 2
C 1 4
D 3 4
B 3 2
I've been playing around with merge and ifelse, but haven't figured anything out yet.
is it what you need?
library(reshape2)
library(magrittr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),d1t=c(-5,5,4,6),d2t=c(-4,0,4,3),d3t=c(0,2,5,4))
merge(df1,df2) %>% melt(id.vars=c("Location","Date"))
Here's how to do that with dplyr and tidyr.
Basically, you want to use gather to melt the DateXTemp columns from df2 into two columns. Then, you want to use gsub to remove the "Date" and "Temp" strings to get numbers that are comparable to what you have in df1. Since DateXTemp were initially characters, you need to transform the remaining numbers to numeric with as.numeric. I then use left_join to join the tables.
library(dplyr);library(tidyr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),Date1Temp=c(-5,5,4,6),
Date2Temp=c(-4,0,4,3),Date3Temp=c(0,2,5,4))
df2_new <- df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))
df1%>%left_join(df2_new)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2
EDIT
As suggested by #Sotos, you can do that in one piping like so:
df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))%>%
left_join(df1,.)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Dropping common rows in two different dataframes

I am a beginner using R. I have two different dataframes like the image called df-1 and df-2. I want to combine two dataframes and drop common rows. (Or I want to removal common rows and want to remain unique ID of rows.
Therefore, What I want to make is like df-3.
A merge is not appropriate because I don't need common rows.
df-1
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 B67 200302034466 1 20031204 3 1
3 C15 200302034455 1 20031223 3 1
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-2
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 K99 200402034466 1 20041204 2 3
3 Z75 200502034455 2 20021222 1 6
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-3
ID NUMBER FORM DATE CD AD
1 B67 200302034466 1 20031204 3 1
2 C15 200302034455 1 20031223 3 1
3 K99 200402034466 1 20041204 2 3
4 Z75 200502034455 2 20021222 1 6
Use rbind to merge df1 and df2 and then selecet unique values
df3 <- unique(rbind(df1,df2))
Can you just use unique on df3 to keep only unique rows? Or, in one line,
df3 <- unique(merge(df1, df2))
Also, avoid using brackets when naming variables - df(1) looks like "apply function df to 1"
If I'm interpreting your question correctly you want a dataframe with records that are present in only one of the original dataframes.
With dplyr:
library(dplyr)
df1_anti <- anti_join(df1, df2)
df2_anti <- anti_join(df2, df1)
df3 <- bind_rows(df1_anti, df2_anti)
df1_anti contains rows present in df1 but not in df2.
df2_anti contains rows present in df2 but not in df1.
df3 is the UNION the two dfs.

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Merging two data frames with different sizes and missing values

I'm having a problem merging two data frames in R.
The first one consists of 103731 obs of 6 variables. The variable that I have to use to merge has 77111 unique values and the rest are NAs with a value of 0. The second one contains the frequency of those variables plus the frequency of the NAs so a frame of 77112 obs for 2 variables.
The resulting frame I need to get is the first one joined with the frequency for the merging variable, so a df of 103731 obs with the frequency for each value of the merging variable (so with duplicates if freq > 1 and also for each NA (or 0)).
Can anybody help me?
The result I'm getting now contains a data frame of 1 894 919 obs and I used:
tot = merge(df1, df2, by = "mergingVar", all= F, sort = F);
Also I played a lot with 'all=' and none of the variations gave the right df.
why don't you just take the frequency table of your first table?
a <- data.frame(a = c(NA, NA, 2,2,3,3,3))
data.frame(table(a, useNA = 'ifany'))
a Freq
1 2 2
2 3 3
3 <NA> 2
or mutate from plyr
ddply(a, .(a), mutate, freq = length(a))
a freq
1 2 2
2 2 2
3 3 3
4 3 3
5 3 3
6 NA 2
7 NA 2

Resources