I have the following dataframes:
df1 <- data.frame(ProjectID=c(10,11,12,13),
Value1=c(101.25,102.85,102.95,103.15),
Value2=c(103.58,104.27,104.68,106.01))
df2 <- data.frame(ProjectID=c(10,10,11,11,11,12,13,13),
Value3=c(98.32,102.58,99.66,103.47,105.63,105.18,102.02,104.98))
I would like to create the following column df1$Value4, which pulls from df2$Value3 if the following conditions are met:
The ProjectIDs must match in df1 & df2
df2$Value3 must be in between df1$Value1 & df1$Value2
If the above 2 conditions are not met, input ""
I'm interested in using loops and if statements to accomplish this if possible. Any help is most appreciated.
The output should look like this:
df1 <- data.frame(ProjectID=c(10,11,12,13),
Value1=c(101.25,102.85,102.95,103.15),
Value2=c(103.58,104.27,104.68,106.01),
Value4=c(102.58,103.47,"",104.98))
This will merge the two data.frame and then remove the rows where Value3 is not between Value1 and Value2. The second merge will add back rows from df1 that do not satisfy the previous condition. And finally the last command will rename the column.
df3 <- merge(df1, df2)
df3 <- df3[df3$Value1 < df3$Value3 & df3$Value3 < df3$Value2, ]
df3 <- merge(df1, df3, all.x = TRUE)
colnames(df3)[colnames(df3) == "Value3"] <- "Value4"
df3
ProjectID Value1 Value2 Value4
1 10 101.25 103.58 102.58
2 11 102.85 104.27 103.47
3 12 102.95 104.68 NA
4 13 103.15 106.01 104.98
Doing it by loops and logic statements makes the code a bit long. I am sure dplyr statements could shorten this up. Additionally, I am not sure what you plan on doing with the output, but R will convert the Value4 field to a character data type because of the "". If you wish to do any kind of data manipulation afterwards, I would suggest using NAs instead of "". To do this, just replace the "" with NA in the code below. Anyway, the code you are looking for is:
df1$Value4 <- ""
for (i in 1:nrow(df1)) {
match_df2 <- df2$Value3[df2$ProjectID == df1$ProjectID[i]]
btwn <- c(df1$Value1[i], df1$Value2[i])
btwn <- sort(btwn)
match_v12 <- c()
for (j in 1:length(match_df2)) {
if (match_df2[j] >= btwn[1] & match_df2[j] <= btwn[2]) {
match_v12 <- rbind(match_v12, match_df2[j])
}
}
if (length(match_v12) == 0) {
df1$Value4[i] <- ""
} else {
df1$Value4[i] <- max(match_v12)
}
}
First create the empty Value4 field in df1 and populate it with an empty character string. The first loop statement will loop through each projectID in df1 and determine the matching location of ProjectIDs in df2. Those matching locations are stored in match_df2. Next, Value1 and Value2 are put into a vector called btwn to allow for sorting. In the example you gave, Value1 is always less than Value2, but I am not sure if that is always the case.
The next for loop checks to see if the matched Value3 values are in between Value1 and Value2. If Value3 is in between, it adds Value3 to a vector called match_v12. If multiple matches are found for a single ProjectID, than I assumed the max of the matched Value3s. You can change this to anything you like, I just put something down. Finally, if no matches are found, produce "" (This last part is redundant, but overall, not bad code).
Hope this helps
Related
In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
}
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
...
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.
Thanks!
I am comparing two csv files using R/Rstudio and I would like to compare them line by line, but in a specific order based on their columns. If my data looks like:
first <-read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
D123, 12000, another, 2.0, NA")
second csv:
second <- read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
E456, 45678, third, 2.0, ")
I'd like to have a for loop that looks something like:
for line in csv1:
if number exists in csv2:
if csv1$name == csv2$name:
if csv1$description == csv$description:
if csv1$manufacturer == csv2$manufacturer:
break
else:
add line to csv called changed, append a value for "changed" column to manufacturer
else:
add line to csv called changed, append a value for "changed" column to description
and so on
so that the output then looks like:
name number description version manufacturer changed
A123 12345 first piece 1.0 fakemanufacturer number
B107 00001 second 1.0 abcde parts no change
C204 20000 third newmanufacturer number, manufacturer
D123 12000 another 2.0 removed
E456 45678 third 2.0 added
and if at any point in this loop something doesn't match, I'd like to know where the mismatch was. The lines can match by number OR description. for example, given the 2 lines above, I would be able to tell that number changed between the two csv files. Thanks in advance for any help!!
It should be something like this, but as you have provided no data to test it I cannot vouch for my code:
cmpDF <- function(DF1, DF2){
DF2 <- DF2[DF2$number %in% DF1$number,] #keep only the rows of DF2 that are
#also in DF1
retChar <- character(nrow(DF1))
names(retChar) <- DF1$number #call the retChar vector with the number
# to be able to update it later
DF1 <- DF1[DF1$number %in% DF2$number,]#keep only the rows of DF1 that are
#also in DF2
# sort rows to make sure that equal rows have the same row number:
DF1 <- DF1[order(DF1$number),]
DF2 <- DF2[order(DF2$number),]
equals <- DF1 == DF2
identical <- rowSums(DF1 == DF2) == ncol(DF1) #here all elements are the same
retChar[as.character(DF1$number[identical])] <- "no change"
for(i in 1:ncol(DF1)){
if(colnames(DF1)[i] == "number") next
different <- !equals[,i]
retChar[as.character(DF1$number[different])] <- ifelse(nchar(retChar[as.character(DF1$number[different])]),
paste0(retChar[as.character(DF1$number[different])], colnames(DF1)[i], sep = ", "),
colnames(DF1)[i])
}
retChar[nchar(retChar) == 0] <- "number not in DF2"
return(retChar)
}
I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202
I have a dataframe with registration numbers in one column and correct registration number in another
a <- c("0c1234", "", "2468O")
b <- c("Oc1234", "Oc5678", "Oc9123")
df <- data.frame(a, b)
I wish to update row 1 as it was entered incorrectly, row 2 is blank so I would like to update the field. Row 3 has a different number, so I wish to keep this number, but make a new entry for this row (in another program, I just need to know that it needs to be inserted).
How do I produce this dataframe?
c <- c("update", "update", "insert")
df2 <- data.frame (a,b,c)
I have tried grepl and str_detect and also considered regex expressions with the grepl - ie check if the 4 number combination in column a is in column b but as yet have been unsuccessful
You can do this in this way:
df <- data.frame(a,b,stringsAsFactors = F)
for (i in seq(1,nrow(df))){
if (df$a[i] == '' || length(agrep(df$a[i],df$b[i])) > 0)
df$c[i] <- 'update'
else
df$c[i] <- 'insert'
}
df
## a b c
##1 0c1234 Oc1234 update
##2 Oc5678 update
##3 2468O Oc9123 insert
You can do something like this:
df$c <- ifelse(a == '', 'update', 'insert')
Your output will be as follows (desired df2 in your question):
a b c
1 0c1234 Oc1234 insert
2 Oc5678 update
3 2468O Oc9123 insert
This will only work, of course, if your original data frame has 'transactions' in proper order.
I have two R data frame with differing dimensions. However but data frames have an id column
df1:
nrow(df1)=22308
c1 c2 c3 pattern1.match
ENSMUSG00000000001_at 10.175115 10.175423 10.109524 0
ENSMUSG00000000003_at 2.133651 2.144733 2.106649 0
ENSMUSG00000000028_at 5.713781 5.714827 5.701983 0
df2:
Genes Pattern.Count
ENSMUSG00000000276 ENSMUSG00000000276_at 1
ENSMUSG00000000876 ENSMUSG00000000876_at 1
ENSMUSG00000001065 ENSMUSG00000001065_at 1
ENSMUSG00000001098 ENSMUSG00000001098_at 1
nrow(df2)=425
I would like to loop through df2, and find all genes that have pattern.count=1 and check it in df1$pattern1.match column.
Basically I would like to overwrite the fields GENES AND pattern1.match with the df2$Genes and df2$Pattern.Count. All the elements from df2$Pattern.Count are equal to one.
I wrote this function, but R freezes while looping through all these rows.
idcol <- ncol(df1)
return.frame.matches <- function(df1, df2, idcol) {
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2))
if(df1[i, 1] == df2[j, 1]) {
df1[i, idcol] = 1
break
}
}
return (df1)
}
Is there another way of doing that without almost killing the computer?
I'm not sure I get exactly what you are doing, but the following should at least get you closer.
The first column of df1 doesn't seem to have a name, are they rownames?
If so,
df1$Genes <- rownames(df1)
Then you could then do a merge to create a new dataframe with the genes you require:
merge(df1,subset(df2,Pattern.Count==1))
Note they are matching on the common column Genes. I'm not sure what you want to do with the pattern1.match column, but a subset on the df1 part of merge can incorporate conditions on that.
Edit
Going by the extra information in the comments,
df1$pattern1.match <- as.numeric(df1$Genes %in% df2$Genes)
should achieve what you are looking for.
Your sample data is not enough to play around with, but here is what I would start with:
dfm <- merge( df1, df2, by = idcol, all = TRUE )
dfm_pc <- subset( dfm, Pattern.Count == 1 )
I took the "idcol" from your code, don't see it in the data.