How to create ID (by) before merging in R? - r

I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".

What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?
You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]
You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4
A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Generate fixed length random id by year as character

I would like to create random id with fixed length 8
Here is sample data:
x <- data.frame(id=c(1,1,1,2,2,3,3,3,3,4,4), year=c(2001,2001,2001,2010,2010,2002,2002,2002,2002,2005,2005),x=seq(0,0.1,0.01))
My attempt:
x$new.id <- ave(x$id, x$year, FUN = function(x) rnorm(x,90000000,100000))
The random generated new.id should have equal id's for given id and year
There must be simple solution, yet I cannot find one. Thanks.
EDIT: Or otherwise how to create new 8 digit id for given number of rows.
Desired output: the column new.id should be class character
new.id year new.id
1 1 2001 89957391
2 1 2001 89957391
3 1 2001 89957391
4 2 2010 90331214
5 2 2010 90331214
6 3 2002 89995435
7 3 2002 89995435
8 3 2002 89995435
9 3 2002 89995435
10 4 2005 90058279
11 4 2005 90058279
You were pretty close with your coding approach (to use ave in that manner), though if you want to generate only one value per each group, you should pass 1 into rnorms n parameter.
The biggest problem as I see it here, is that you want to generate a random number of class integer (and then convert to character class) while rnorm returns double by definition.
So you could potentially do this (using round or floor or ceiling)
transform(x, new.id = ave(id,
year,
FUN = function(x) as.character(round(rnorm(1, 9e7, 1e5)))))
But it seems to me that more appropriate way would be to use sample instead
indx <- 1e7:(1e8 - 1)
transform(x, new.id = ave(id, year, FUN = function(x) as.character(sample(indx, 1))))
Edit: Now that I came to think about it a little more, it is possible that for a large enough data set you will have duplicated new.ids because you are independantly calling sample function each time. It seem to me that the best solution would be first creating a data set with new indexes per each id while generated by a single sample call and then merge it back to the data set. This Operation could be best done using the data.table package (because of it efficient joins and the ability to only add a single column while joining), something like the following should work
library(data.table)
y <- data.table(id = unique(x$id),
new.id = as.character(sample(indx, length(unique(x$id)))))
setkey(setDT(x), id) ; setkey(y, id)
x[y, new.id := i.new.id]
This will update you original data set by reference (without the need in <- assignment). You can convert back to data.frame (if you wish) by simply doing setDF(x).

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Resources