line by line csv compare using if statements in R - r

I am comparing two csv files using R/Rstudio and I would like to compare them line by line, but in a specific order based on their columns. If my data looks like:
first <-read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
D123, 12000, another, 2.0, NA")
second csv:
second <- read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
E456, 45678, third, 2.0, ")
I'd like to have a for loop that looks something like:
for line in csv1:
if number exists in csv2:
if csv1$name == csv2$name:
if csv1$description == csv$description:
if csv1$manufacturer == csv2$manufacturer:
break
else:
add line to csv called changed, append a value for "changed" column to manufacturer
else:
add line to csv called changed, append a value for "changed" column to description
and so on
so that the output then looks like:
name number description version manufacturer changed
A123 12345 first piece 1.0 fakemanufacturer number
B107 00001 second 1.0 abcde parts no change
C204 20000 third newmanufacturer number, manufacturer
D123 12000 another 2.0 removed
E456 45678 third 2.0 added
and if at any point in this loop something doesn't match, I'd like to know where the mismatch was. The lines can match by number OR description. for example, given the 2 lines above, I would be able to tell that number changed between the two csv files. Thanks in advance for any help!!

It should be something like this, but as you have provided no data to test it I cannot vouch for my code:
cmpDF <- function(DF1, DF2){
DF2 <- DF2[DF2$number %in% DF1$number,] #keep only the rows of DF2 that are
#also in DF1
retChar <- character(nrow(DF1))
names(retChar) <- DF1$number #call the retChar vector with the number
# to be able to update it later
DF1 <- DF1[DF1$number %in% DF2$number,]#keep only the rows of DF1 that are
#also in DF2
# sort rows to make sure that equal rows have the same row number:
DF1 <- DF1[order(DF1$number),]
DF2 <- DF2[order(DF2$number),]
equals <- DF1 == DF2
identical <- rowSums(DF1 == DF2) == ncol(DF1) #here all elements are the same
retChar[as.character(DF1$number[identical])] <- "no change"
for(i in 1:ncol(DF1)){
if(colnames(DF1)[i] == "number") next
different <- !equals[,i]
retChar[as.character(DF1$number[different])] <- ifelse(nchar(retChar[as.character(DF1$number[different])]),
paste0(retChar[as.character(DF1$number[different])], colnames(DF1)[i], sep = ", "),
colnames(DF1)[i])
}
retChar[nchar(retChar) == 0] <- "number not in DF2"
return(retChar)
}

Related

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Relating dataframes with If statements and loops

I have the following dataframes:
df1 <- data.frame(ProjectID=c(10,11,12,13),
Value1=c(101.25,102.85,102.95,103.15),
Value2=c(103.58,104.27,104.68,106.01))
df2 <- data.frame(ProjectID=c(10,10,11,11,11,12,13,13),
Value3=c(98.32,102.58,99.66,103.47,105.63,105.18,102.02,104.98))
I would like to create the following column df1$Value4, which pulls from df2$Value3 if the following conditions are met:
The ProjectIDs must match in df1 & df2
df2$Value3 must be in between df1$Value1 & df1$Value2
If the above 2 conditions are not met, input ""
I'm interested in using loops and if statements to accomplish this if possible. Any help is most appreciated.
The output should look like this:
df1 <- data.frame(ProjectID=c(10,11,12,13),
Value1=c(101.25,102.85,102.95,103.15),
Value2=c(103.58,104.27,104.68,106.01),
Value4=c(102.58,103.47,"",104.98))
This will merge the two data.frame and then remove the rows where Value3 is not between Value1 and Value2. The second merge will add back rows from df1 that do not satisfy the previous condition. And finally the last command will rename the column.
df3 <- merge(df1, df2)
df3 <- df3[df3$Value1 < df3$Value3 & df3$Value3 < df3$Value2, ]
df3 <- merge(df1, df3, all.x = TRUE)
colnames(df3)[colnames(df3) == "Value3"] <- "Value4"
df3
ProjectID Value1 Value2 Value4
1 10 101.25 103.58 102.58
2 11 102.85 104.27 103.47
3 12 102.95 104.68 NA
4 13 103.15 106.01 104.98
Doing it by loops and logic statements makes the code a bit long. I am sure dplyr statements could shorten this up. Additionally, I am not sure what you plan on doing with the output, but R will convert the Value4 field to a character data type because of the "". If you wish to do any kind of data manipulation afterwards, I would suggest using NAs instead of "". To do this, just replace the "" with NA in the code below. Anyway, the code you are looking for is:
df1$Value4 <- ""
for (i in 1:nrow(df1)) {
match_df2 <- df2$Value3[df2$ProjectID == df1$ProjectID[i]]
btwn <- c(df1$Value1[i], df1$Value2[i])
btwn <- sort(btwn)
match_v12 <- c()
for (j in 1:length(match_df2)) {
if (match_df2[j] >= btwn[1] & match_df2[j] <= btwn[2]) {
match_v12 <- rbind(match_v12, match_df2[j])
}
}
if (length(match_v12) == 0) {
df1$Value4[i] <- ""
} else {
df1$Value4[i] <- max(match_v12)
}
}
First create the empty Value4 field in df1 and populate it with an empty character string. The first loop statement will loop through each projectID in df1 and determine the matching location of ProjectIDs in df2. Those matching locations are stored in match_df2. Next, Value1 and Value2 are put into a vector called btwn to allow for sorting. In the example you gave, Value1 is always less than Value2, but I am not sure if that is always the case.
The next for loop checks to see if the matched Value3 values are in between Value1 and Value2. If Value3 is in between, it adds Value3 to a vector called match_v12. If multiple matches are found for a single ProjectID, than I assumed the max of the matched Value3s. You can change this to anything you like, I just put something down. Finally, if no matches are found, produce "" (This last part is redundant, but overall, not bad code).
Hope this helps

Selecting rows with partial match/mismatch in 2 columns

I am looking how to select rows in R which have partial matches or mismatches in two columns. My dataset (as an example) looks like this:
df = data.frame(plot1 = c("ABX_15", "BHE_05", "ABX_15"),
plot2 = c("AB6_15", "JKS_05", "JKS_05"),
value = c(0.4, 0.45, 0.34))
I want to create subsets containing only "matched" pairs of plot1 and plot2 for _05 and _15. So that would be either the first row or the second row in the example. I also need to select only rows which have a missmatch in plot1 and plot2, that would be row number three. Match an missmatch refer only to the second part of the plot name.
I've found solutions for partial selecting and for selecting certain rows according to columns but I could not combine both.
I am expecting 3 subsets of the dataset: One with matching _05 another with matching _15 and one with missmatches.
Another solution is using sub to strip everything before (and including) the underscore from the two variables and then compare those sub statements with == to create a logical index vector:
idx <- sub('.*\\_', '', df$plot1) == sub('.*\\_', '', df$plot2)
Now you can subset df with that vector. df[idx,] gives:
plot1 plot2 value
1 ABX_15 AB6_15 0.40
2 BHE_05 JKS_05 0.45
To get the mismatches, you can use df[!idx,]:
plot1 plot2 value
3 ABX_15 JKS_05 0.34
Per the update of your requirements, you can create indexes for matching on 15 or 05 as follows:
idx15 <- sub('.*\\_', '', df$plot1) == '15' & sub('.*\\_', '', df$plot2) == '15'
idx05 <- sub('.*\\_', '', df$plot1) == '05' & sub('.*\\_', '', df$plot2) == '05'
These can then be used to subset df as shown above (e.g. df[idx15,]). To get the mismatches: df[!idx05 & !idx15,] (or use the method from above).
Fist spilt the names by the pattern _. I'm using here the function str_split from the stringr package. The result is a list. You can extract now the second part of the name. After unlisting you can add the result to your dataframe df:
df$p1 <- unlist(lapply(str_split(df$plot1, "_"), "[", 2))
df$p2 <- unlist(lapply(str_split(df$plot2, "_"), "[", 2))
For a base R solution you can use strspilt function instead. Note that you have to make a character vector out of it.
unlist(lapply(strsplit(as.character(df$plot1), "_"), "[", 2))
and the result:
df[df$p1 == df$p2, ]
plot1 plot2 value p1 p2
1 ABX_15 AB6_15 0.40 15 15
2 BHE_05 JKS_05 0.45 05 05
For the mismatch use:
df[df$p1 != df$p2, ]
plot1 plot2 value p1 p2
3 ABX_15 JKS_05 0.34 05 15

check if column contains part of another column in r

I have a dataframe with registration numbers in one column and correct registration number in another
a <- c("0c1234", "", "2468O")
b <- c("Oc1234", "Oc5678", "Oc9123")
df <- data.frame(a, b)
I wish to update row 1 as it was entered incorrectly, row 2 is blank so I would like to update the field. Row 3 has a different number, so I wish to keep this number, but make a new entry for this row (in another program, I just need to know that it needs to be inserted).
How do I produce this dataframe?
c <- c("update", "update", "insert")
df2 <- data.frame (a,b,c)
I have tried grepl and str_detect and also considered regex expressions with the grepl - ie check if the 4 number combination in column a is in column b but as yet have been unsuccessful
You can do this in this way:
df <- data.frame(a,b,stringsAsFactors = F)
for (i in seq(1,nrow(df))){
if (df$a[i] == '' || length(agrep(df$a[i],df$b[i])) > 0)
df$c[i] <- 'update'
else
df$c[i] <- 'insert'
}
df
## a b c
##1 0c1234 Oc1234 update
##2 Oc5678 update
##3 2468O Oc9123 insert
You can do something like this:
df$c <- ifelse(a == '', 'update', 'insert')
Your output will be as follows (desired df2 in your question):
a b c
1 0c1234 Oc1234 insert
2 Oc5678 update
3 2468O Oc9123 insert
This will only work, of course, if your original data frame has 'transactions' in proper order.

Crafty ways to make super efficient R vector processing?

I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]

Resources