R merging Partial match - r

there are a lot of answers about that, but I didn't find out with the problem that I am handle.
I have 2 dataframes:
df1:
df2:
setA <- read.table("df1.txt",sep="\t", header=TRUE)
setB <- read.table("df2.txt",sep="\t", header=TRUE)
So, I want the matching rows by column value:
library(data.table)
setC <-merge(setA, setB, by.x = "name", by.y = "name", all.x = FALSE)
And I get this output:
df3:
Because in df I have also de value 1, but separete with a ";". How can I get the desire output?
Thanks!!

In future please apply the function dput(df1) and dput(df2) and copy and paste the output from the console into your question.
Base R solution to two part question:
# First unstack the 1;7 row into two separate rows:
name_split <- strsplit(df1$name, ";")
# If the values of last vector uniquely identify each row in the dataframe:
df_ro <- data.frame(name = unlist(name_split),
last = rep(df1$last, sapply(name_split, length)),
stringsAsFactors = FALSE)
# Left join to achieve the same result as first solution
# without specifically naming each vector:
df1_ro <- merge(df1[,names(df1) != "name"], df_ro, by = "last", all.x = TRUE)
# Then perform an inner join preventing a name space collision:
df3 <- merge(df1_ro, setNames(df2, paste0(names(df2), ".x")),
by.x = "name", by.y = "name.x")
# If you wanted to perform an inner join on all intersecting columns (returning
# no results because values in last and colour are different then):
df3 <- merge(df1_ro, df2, by = intersect(names(df1_ro), names(df2)))
Data:
df1 <- data.frame(name = c("1;7", "3", "4", "5"),
last = c("p", "q", "r", "s"),
colour = c("a", "s", "d", "f"), stringsAsFactors = FALSE)
df2 <- data.frame(name = c("1", "2", "3", "4"),
last = c("a", "b", "c", "d"),
colour = c("p", "q", "r", "s"), stringsAsFactors = FALSE)

At the end I achieved with this solution:
co=open('NewFile.txt','w')
f=open('IndexFile.txt','r')
g=open('File.txt','r')
tabla1 = f.readlines()
tabla2 = g.readlines()
B=[]
for ln in tabla1:
B = ln.split('\t')[3]
for k, ln2 in enumerate(tabla2):
if B in ln2.split('\t')[3]:
xx=ln2
print(xx)
co.write(xx)
break
co.close()

Related

Overriding data.table key order causes incorrect merge results

In the following example I use a dplyr::arrange on a data.table with a key. This overrides the sort on that column:
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")
# lose order on datatable key
x <- dplyr::arrange(x, b)
y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")
res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)
# merge results are inconsistent
identical(res, res2)
I can see that if I ordered with x <- x[order(b)], I would maintain the sort on the key and the results would be consistent.
I am not sure why I cannot use dplyr::arrange and what relationship the sort key has with the merge. Any insight would be appreciated.
The problem is that with dplyr::arrange(x, b) you do not remove the sorted attribute from your data.table contrary to using x <- x[order(b)] or setorder(x, "b").
The data.table way would be to use setorder in the first place e.g.
library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)
The wrong results of joins on data.tables which have a key although they are not sorted by it, is a known bug (see also #5361 in data.table bug tracker).

Partical match string between columns for multiple dataframes

I have a list of dataframes (df1, df2, df3) for which I would like to match columns with another dataframe (df) and substitute strings only if there is a match. Match should be based on a string specified when running the function, specified as partial match, in other words here it only for fields containing string "TEXT" and should work on cases like TEXT123 and TEXTabc. I did not get very far myself...
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
list<-c(df1, df2, df3)
example for df1
partial_match <- function(column_A$df1, column_B, TEXT, df) {
df1_new <-df1
df1_new[, column_B] <- ifelse(grepl("TEXT.*", df1[, column_A]),
df[, column_B] - nchar(TEXT),
df[, column_B])
df1_new
}
Outcome for df1:
name column_A column_B
TEXT333 1 11
b 2 b
c 3 c
Here's one approach using a for loop. You were close! Note that I changed your reference dataframe name to dfs to avoid confusion with list().
Do you think you might encounter a situation where you might match multiple times in the same dataframe? If so, what I show below won't work without a couple more lines.
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
dfs <- list(df1, df2, df3)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
# loop over all dataframes in your list
for(i in 1:length(dfs)){
# get name that matches regex
val <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
# use name to update value from reference df
dfs[[i]][dfs[[i]]$name == val,"column_A"] <- df[df$name == val,"column_B"]
}
Updated answer that can account for multiple matches in the same df
for(i in 1:length(dfs)){
vals <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
for(val in vals){
dfs[[i]][dfs[[i]]$name == val, "column_A"] <- df[df$name == val,"column_B"]
}
}

how to duplicate rows by condition and replace content in R

I would like to dublicate rows of my data frame by testing a condition and then changing the contensts of variables.
My original data frame is this :
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
I would like to test each time that 2nd_decision is "partial" or "total", and if so, duplicate the row and replace the contents of the variables "decision" and "code" with "2nd_decision" and "2nd_code" ; also, I do not want to present any more the content of "2nd_decision" and "2nd_code" and keep the rest of my data frame as it was, like this:
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
x total BBB50 a
y partial BBB89 v
Thank you in advance
Is this what you want?
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
add_rows <- unique(df[, c("id", "X2nd_decision", "X2nd_code", "varx")])
colnames(add_rows) <- c("id", "decision", "code", "varx")
add_rows <- add_rows[!is.na(add_rows$decision), ]
library(plyr)
df_final <- rbind.fill(df, add_rows)
df_final
You can use mutate in combination with an ifelse statement.
Let's recreate your data first.
df <- data.frame(id = c("x", "y", "w", "x", "y"),
decision = c("partial", "refusal", "total", "total", "partial"),
code = c("AAA20", "AAA61", "AAA77", "BBB50", "BBB89"),
decision2 = c("total", "partial", NA, NA, NA),
varx = c("a", "v", "p", "a", "v"))
And here the code to test second decision and remove unwanted variable.
library(tidyverse)
dfnew <- df %>%
mutate(code = ifelse(decision2 == "total", "BBB50",
ifelse(decision2 == "partial", "BBB89", NA))) %>%
select(-decision2)

How do I update data frame fields in R?

I'm looking to update fields in one data table with information from another data table like this:
dt1$name <- dt2$name where dt1$id = dt2$id
In SQL pseudo code : update dt1 set name = dt2.name where dt1.id = dt2.id
As you can see I'm very new to R so every little helps.
Update
I think it was my fault - what I really want is to update an telephone number if the usernames from both dataframes match.
So how can I compare names and update a field if both names match?
Please help :)
dt1$phone<- dt2$phone where dt1$name = dt2$name
Joran's answer assumes dt1 and dt2 can be matched by position.
If it's not the case, you may need to merge() first:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(7, 3), name = c("f", "g"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "id", all.x = TRUE)
dt1$name <- ifelse( ! is.na(dt1$name.y), dt1$name.y, dt1$name.x)
dt1
(Edit per your update:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), phone = c("123-123", "456-456", NA), stringsAsFactors = FALSE)
dt2 <- data.frame(name = c("f", "g", "a"), phone = c(NA, "000-000", "789-789"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "name", all.x = TRUE)
dt1$new_phone <- ifelse( ! is.na(dt1$phone.y), dt1$phone.y, dt1$phone.x)
Try:
dt1$name <- ifelse(dt1$id == dt2$id, dt2$name, dt1$name)
Alternatively, maybe:
dt1$name[dt1$id == dt2$id] <- dt2$name[dt1$id == dt2$id]
If you're more comfortable working in SQL, you can use the sqldf package:
dt1 <- data.frame(id = c(1, 2, 3),
name = c("A", "B", "C"),
stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(2, 3, 4),
name = c("X", "Y", "Z"),
stringsAsFactors = FALSE)
library(sqldf)
sqldf("SELECT dt1.id,
CASE WHEN dt2.name IS NULL THEN dt1.name ELSE dt2.name END name
FROM dt1
LEFT JOIN dt2
ON dt1.id = dt2.id")
But, computationally, it's about 150 times slower than joran's solution, and quite a bit slower in the human time as well. However, if you are ever in a bind and just need to do something that you can do easily in SQL, it's an option.

How to concatenate multiple columns with a coma between them

I have the following data frame in r
ID COL.1 COL.2 COL.3 COL.4
1 a b
2 v b b
3 x a n h
4 t
I am new to R and I don't understand how to call the data fram in order to have this at the end, another problem is that i have more than 100 columns
stream <- c("1,a,b","2,v,b,b","3,x,a,n,h","4,t")
another problem is that I have more than 100 columns .
Try this
Reduce(function(...)paste(...,sep=","), df)
Where df is your data.frame
This might be what you're looking for, even though it's not elegant.
my_df <- data.frame(ID = seq(1, 4, by = 1),
COL.1 = c("a", "v", "x", "t"),
COL.2 = c("b", "b", "a", NULL),
COL.3 = c(NULL, "b", "n", NULL),
COL.4 = c(NULL, NULL, "h", NULL))
stream <- substring(paste(my_df$ID,
my_df$COL.1,
my_df$COL.2,
my_df$COL.3,
my_df$COL.4,
sep =","), 3)
stream <- gsub(",NA", "", stream)
stream <- gsub("NA,", "", stream)

Resources