I have a dataframe with registration numbers in one column and correct registration number in another
a <- c("0c1234", "", "2468O")
b <- c("Oc1234", "Oc5678", "Oc9123")
df <- data.frame(a, b)
I wish to update row 1 as it was entered incorrectly, row 2 is blank so I would like to update the field. Row 3 has a different number, so I wish to keep this number, but make a new entry for this row (in another program, I just need to know that it needs to be inserted).
How do I produce this dataframe?
c <- c("update", "update", "insert")
df2 <- data.frame (a,b,c)
I have tried grepl and str_detect and also considered regex expressions with the grepl - ie check if the 4 number combination in column a is in column b but as yet have been unsuccessful
You can do this in this way:
df <- data.frame(a,b,stringsAsFactors = F)
for (i in seq(1,nrow(df))){
if (df$a[i] == '' || length(agrep(df$a[i],df$b[i])) > 0)
df$c[i] <- 'update'
else
df$c[i] <- 'insert'
}
df
## a b c
##1 0c1234 Oc1234 update
##2 Oc5678 update
##3 2468O Oc9123 insert
You can do something like this:
df$c <- ifelse(a == '', 'update', 'insert')
Your output will be as follows (desired df2 in your question):
a b c
1 0c1234 Oc1234 insert
2 Oc5678 update
3 2468O Oc9123 insert
This will only work, of course, if your original data frame has 'transactions' in proper order.
Related
In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
}
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
...
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.
Thanks!
I would like to make a new column in my data frame by using a conditional statement that would say "If Column_y contains Column_x then 1 else 0"
For example:
Event Name Winner Loser New Column
1 James James,Bob John,Steve 1
1 Bob James,Bob John,Steve 1
1 John James,Bob John,Steve 0
1 Steve James,Bob John,Steve 0
I want to have New Column<- "If Winner contains Name then 1 else 0"
Keep in mind this is for 100,000 rows and probably 700 unique names. When I try things like
df$NewColumn<-ifelse(grepl(df$Name,df$Winner)==TRUE,1,0)
or variations I get the "pattern has a length > 1" error.
I think you just want to compare the Name column against the Winner column:
df$NewColumn <- ifelse(df$Name == df$Winner, 1, 0)
Note that because df$Name == df$Winner is actually a boolean expression, you might also be able to simplify to:
df$NewColumn <- df$Name == df$Winner
In your example, exact string matching works. But I am assuming it does not hold true for your entire data.
Implementing the contains condition would be something like this:
library(dplyr)
library(purrr)
df = df %>%
dplyr::mutate(NewColumn = purrr::map2_dbl(.x=Winner,.y=Name,~ifelse(grepl(.y,.x),1,0)))
Adding an alternate solution with stringr:
df = df %>%
dplyr::mutate(NewColumn=ifelse(str_detect(Winner,Name),1,0))
Let me know if this works.
P.S.: str_detect is faster.
I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202
I have an example filter table as below and a big source data table. I need to do the merge using these two tables. If no column in the filter table contains ALL, use three columns to do the the merge (using Tran=1001, Acct=1 & Co=a to do the inner join with the data table).If one of them, ie Tran has ALL, use the remaining two columns to do the merge (using Acct=3 & Co=c to do the join). If two of them, ie Tran and Acct, have All, use the remaining one column to do the merge (using Co=b to do the join).
The real question is the number of columns is uncertain.
Can anyone help me with this?
Tran Acct Co
1001 1 a
1002 ALL ALL
ALL ALL b
ALL 4 ALL
1003 2 ALL
ALL 3 c
1004 ALL d
You're going to have to write a series of conditional statements using if, elseif and else. I'll use the %in% operator to check for this. The %in% operator returns a series of boolean values. The easiest way is to show through example:
> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 3, 4, 5, 6)
> x %in% y
[1] FALSE TRUE TRUE TRUE TRUE
Notice that it returns FALSE for the first value as the value of 1 in x is not in y. You can do the same for the "ALL" value in your data set. I assume you are going row by row as you seemed to imply in your question. Let me know if you need to check the whole column first (you can use the any function for that case). Here is an example of your first condition:
# Assume that df is your data.frame of data.
for (i in 1:length(df$Tran)) {
if (!("All" %in% df$Tran[i]) & !("ALL" %in% df$Acct[i]) & !("All" %in% df$Co[i])) {
# Do your merge here
}
if ( [Put your next condition here] ) {
# Do the appropriate merge for that condition
}
...
Note that I used the "!" operator to get the inverse of whatever %in% returns because you want it to be the case where ALL is NOT in the row. I realize now that you could have just done All != df$Tran[1] since you are going row by row, but %in% might be more useful if you end up going for the whole column.
Hope this helps!
Editing in a new method now that it's more clear what the need is. So we have to find the number of "ALL" values in each row and then merge a certain way depending on the number of them. There are a lot of methods, but here's one I like:
> test <- data.frame(a = "ALL", b = 2, c = "ALL", d = 3, e = "ALL")
> test
a b c d e
1 ALL 2 ALL 3 ALL
> table(test[1, ] == "ALL")["TRUE"]
TRUE
3
Basically, I'm looking at the first row, and getting the number that return TRUE when asked if it contains the string "ALL". From here you can set conditionals on this number. To automate over the entire data frame, throw it in a for loop and set "1" equal to "i" or whatever you sequence variable is.
To get which rows have "ALL" in it (which in converse would tell which rows do not have "ALL" in it as well), you can use grep on each row. Here's a short example:
> # Initializing a sample data frame.
> df <- data.frame(a = "1", b = "ALL", c = "ALL", d = "5", e = "ALL")
> print(df)
a b c d e
1 1 ALL ALL 5 ALL
>
> # Finding the column numbers that have "ALL" in it using grep.
> places <- grep("ALL", df[1, ])
> print(places)
[1] 2 3 5
>
> # Each number corresponds to the order of the columns in the data frame and can be returned as such.
> nameCols <- names(df)[places]
> print(nameCols)
[1] "b" "c" "e"
>
> # Likewise, you can find what columns did not have "ALL" in it by doing the opposite.
> nameColsNOT <- names(df)[-places]
> print(nameColsNOT)
[1] "a" "d"
Iterate this method through a loop for each row in your data frame and use the conditional method I outlined above. Please note that this requires your columns to all be of "character" class, which I assume is the case already.
How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)