combine datasets by the value of multiple columns - r

I'm trying to enter values based on the value of multiple columns from two datasets.
I have my main dataset (df1), with lists of a location and corresponding dates and df2 consists of a list of temperatures at all locations on every possible date. Eg:
df1
Location Date
A 2
B 1
C 1
D 3
B 3
df2
Location Date1Temp Date2Temp Date3Temp
A -5 -4 0
B 2 0 2
C 4 4 5
D 6 3 4
I would like to create a temperature variable in df1, according to the location and date of each observation. Preferably I would like to carry this out with all Temperature data in the same dataframe, but this can be separated and added 'by date' if necessary. With the example data, I would want this to create something like this:
Location Date Temp
A 2 -4
B 1 2
C 1 4
D 3 4
B 3 2
I've been playing around with merge and ifelse, but haven't figured anything out yet.

is it what you need?
library(reshape2)
library(magrittr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),d1t=c(-5,5,4,6),d2t=c(-4,0,4,3),d3t=c(0,2,5,4))
merge(df1,df2) %>% melt(id.vars=c("Location","Date"))

Here's how to do that with dplyr and tidyr.
Basically, you want to use gather to melt the DateXTemp columns from df2 into two columns. Then, you want to use gsub to remove the "Date" and "Temp" strings to get numbers that are comparable to what you have in df1. Since DateXTemp were initially characters, you need to transform the remaining numbers to numeric with as.numeric. I then use left_join to join the tables.
library(dplyr);library(tidyr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),Date1Temp=c(-5,5,4,6),
Date2Temp=c(-4,0,4,3),Date3Temp=c(0,2,5,4))
df2_new <- df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))
df1%>%left_join(df2_new)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2
EDIT
As suggested by #Sotos, you can do that in one piping like so:
df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))%>%
left_join(df1,.)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2

Related

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Reordering the variables

After merging two data sets, I have a data with 300 variables (which some variables end with .x, some end with .y and some without any .x and .y) . How can I bring all variables which do not end in .x and .y to the first 100 columns of the data set . Also, I want to have col 101 onwards be arranged like (day.x,day.y,city.x,city.y, number.x,number.y and etc). That is, variables with same name, say city, but with different extension are adjacent/next to each other.
For example:
city.y<- c(1,2,3,5,5,7,7,NA,NA,3,4,5)
B<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
number.x<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
day.x<-c(1,3,4,5,6,7,8,1,NA,3,5,3)
Z<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
day.y<-c(4,5,6,7,8,7,8,1,2,3,5,NA)
number.y<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
school.x<-c("a","b","b","c","n","f","h","NA","F","G","z","h")
S<-c(5,2,3,4,5,6,5,NA,NA,5,6,6)
school.y<-c("a","b","b","c","m","g","h","NA","NA","G","H","T")
city.x<- c(1,2,3,7,5,8,7,5,6,7,5,1)
df<- data.frame(city.y,B,number.x,day.x,Z,day.y,number.y,school.x,S,school.y,city.x)
I want to reorder the variables in this format: B,S,Z,city.x,city.y,number.x,number.y,day.x,day.y and ...
Add one column to create more general use case:
df$ZZZZZ = 1:6
Then, load the dplyr package (for the chaining operator %>% and the select function):
library(dplyr)
Sorting will get each sub-grouping of columns in the right relative order:
names(df) = sort(names(df))
Now use a regular expression -matches("\\.[xy]$") to capture all the columns without ".x" or ".y" at the end and put those columns at the beginning. Then put all the other columns after them.
df = df %>% select(-matches("\\.[xy]$"), everything())
df
A B C ZZZZZ city.x city.y day.x day.y number.x number.y school.x school.y
1 1 3 1 1 1 1 4 3 a 5 a 1
2 2 4 2 2 3 2 5 4 b 2 b 2
...
11 4 NA 5 5 5 5 5 NA z 6 H 5
12 5 6 6 6 3 6 NA 6 h 6 T 1
If you like, you can also set your own suffixes in the merge function (rather than the default ".x" and ".y") like this:
merge(df1, df2, by="col", suffixes=c("_df1", "_df2"))
If you do that, you'll of course also need to adjust the regular expression that reorders the columns.
This should do it
extCols <- grepl("\\.", colnames(df))
df[, c(colnames(df)[(!extCols)],
sort(colnames(df)[extCols]))]

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

How to combine two data frames using dplyr or other packages?

I have two data frames:
df1 = data.frame(index=c(0,3,4),n1=c(1,2,3))
df1
# index n1
# 1 0 1
# 2 3 2
# 3 4 3
df2 = data.frame(index=c(1,2,3),n2=c(4,5,6))
df2
# index n2
# 1 1 4
# 2 2 5
# 3 3 6
I want to join these to:
index n
1 0 1
2 1 4
3 2 5
4 3 8 (index 3 in two df, so add 2 and 6 in each df)
5 4 3
6 5 0 (index 5 not exists in either df, so set 0)
7 6 0 (index 6 not exists in either df, so set 0)
The given data frames are just part of large dataset. Can I do it using dplyr or other packages in R?
Using data.table (would be efficient for bigger datasets). I am not changing the column names, as the rbindlist uses the name of the first dataset ie. in this case n from the second column (Don't know if it is a feature or bug). Once you join the datasets by rbindlist, group it by column index i.e. (by=index) and do the sum of n column (list(n=sum(n)) )
library(data.table)
rbindlist(list(data.frame(index=0:6,n=0), df1,df2))[,list(n=sum(n)), by=index]
index n
#1: 0 1
#2: 1 4
#3: 2 5
#4: 3 8
#5: 4 3
#6: 5 0
#7: 6 0
Or using dplyr. Here, the column names of all the datasets should be the same. So, I am changing it before binding the datasets using rbind_list. If the names are different, there will be multiple columns for each name. After joining the datasets, group it by index and then use summarize and do the sum of column n.
library(dplyr)
nm1 <- c("index", "n")
colnames(df1) <- colnames(df2) <- nm1
rbind_list(df1,df2, data.frame(index=0:6, n=0)) %>%
group_by(index) %>%
summarise(n=sum(n))
This is something you could do with the base functions aggregate and rbind
df1 = data.frame(index=c(0,3,4),n=c(1,2,3))
df2 = data.frame(index=c(1,2,3),n=c(4,5,6))
aggregate(n~index, rbind(df1, df2, data.frame(index=0:6, n=0)), sum)
which returns
index n
1 0 1
2 1 4
3 2 5
4 3 8
5 4 3
6 5 0
7 6 0
How about
names(df1) <- c("index", "n") # set colnames of df1 to target
df3 <- rbind(df1,setNames(df2, names(df1))) # set colnnames of df2 and join
df <- df3 %>% dplyr::arrange(index) # sort by index
Cheers.

How to match 1 column to 2 columns?

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!
Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)
I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

Resources