How do I update data frame fields in R? - r

I'm looking to update fields in one data table with information from another data table like this:
dt1$name <- dt2$name where dt1$id = dt2$id
In SQL pseudo code : update dt1 set name = dt2.name where dt1.id = dt2.id
As you can see I'm very new to R so every little helps.
Update
I think it was my fault - what I really want is to update an telephone number if the usernames from both dataframes match.
So how can I compare names and update a field if both names match?
Please help :)
dt1$phone<- dt2$phone where dt1$name = dt2$name

Joran's answer assumes dt1 and dt2 can be matched by position.
If it's not the case, you may need to merge() first:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(7, 3), name = c("f", "g"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "id", all.x = TRUE)
dt1$name <- ifelse( ! is.na(dt1$name.y), dt1$name.y, dt1$name.x)
dt1
(Edit per your update:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), phone = c("123-123", "456-456", NA), stringsAsFactors = FALSE)
dt2 <- data.frame(name = c("f", "g", "a"), phone = c(NA, "000-000", "789-789"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "name", all.x = TRUE)
dt1$new_phone <- ifelse( ! is.na(dt1$phone.y), dt1$phone.y, dt1$phone.x)

Try:
dt1$name <- ifelse(dt1$id == dt2$id, dt2$name, dt1$name)
Alternatively, maybe:
dt1$name[dt1$id == dt2$id] <- dt2$name[dt1$id == dt2$id]

If you're more comfortable working in SQL, you can use the sqldf package:
dt1 <- data.frame(id = c(1, 2, 3),
name = c("A", "B", "C"),
stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(2, 3, 4),
name = c("X", "Y", "Z"),
stringsAsFactors = FALSE)
library(sqldf)
sqldf("SELECT dt1.id,
CASE WHEN dt2.name IS NULL THEN dt1.name ELSE dt2.name END name
FROM dt1
LEFT JOIN dt2
ON dt1.id = dt2.id")
But, computationally, it's about 150 times slower than joran's solution, and quite a bit slower in the human time as well. However, if you are ever in a bind and just need to do something that you can do easily in SQL, it's an option.

Related

Is it possible to use tidyselect helpers with the cols_only() function?

I have a .csv file like this (except that the real .csv file has many more columns):
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
I only want id1, id2, data1, and data2.
I can do this:
df <- read_csv("df.csv",
col_names = TRUE,
cols_only(id1 = col_character(),
id2 = col_character(),
data1 = col_integer(),
data2 = col_integer()))
But, as mentioned above, my real dataset has many more columns, so I'd like to use tidyselect helpers to only read in specified columns and ensure specified formats.
I tried this:
df2 <- read_csv("df.csv",
col_names = TRUE,
cols_only(starts_with("id") = col_character(),
starts_with("data") & !ends_with("s") = col_integer()))
But the error message indicates that there's a problem with the syntax. Is it possible to use tidyselect helpers in this way?
My proposal is around the houses somewhat but it pretty much does let you customise the read spec on a 'rules' rather than explicit basis
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
# read only 1 row to make a spec from with minimal read; really just to get the colnames
df_spec <- spec(read_csv("df.csv",
col_names = TRUE,
n_max = 1))
#alter the spec with base R functions startsWith / endsWith etc.
df_spec$cols <- imap(df_spec$cols,~{if(startsWith(.y,"id")){
col_character()
} else if(startsWith(.y,"data") &
!endsWith(.y,"s")){
col_integer()
} else {
col_skip()
}})
df <- read_csv("df.csv",
col_types = df_spec$cols)

Overriding data.table key order causes incorrect merge results

In the following example I use a dplyr::arrange on a data.table with a key. This overrides the sort on that column:
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")
# lose order on datatable key
x <- dplyr::arrange(x, b)
y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")
res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)
# merge results are inconsistent
identical(res, res2)
I can see that if I ordered with x <- x[order(b)], I would maintain the sort on the key and the results would be consistent.
I am not sure why I cannot use dplyr::arrange and what relationship the sort key has with the merge. Any insight would be appreciated.
The problem is that with dplyr::arrange(x, b) you do not remove the sorted attribute from your data.table contrary to using x <- x[order(b)] or setorder(x, "b").
The data.table way would be to use setorder in the first place e.g.
library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)
The wrong results of joins on data.tables which have a key although they are not sorted by it, is a known bug (see also #5361 in data.table bug tracker).

Specify nesting columns by using character vector in tidyr::complete

How can I define the columns I want to use for nesting in the tidyr::complete function?
one_of or as.name are not working.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(char_vec))
Error: `by` can't contain join column `char_vec` which is missing from RHS
Run `rlang::last_error()` to see where the error occurred.
An up to date solution with dplyr version 1.06 is !!!syms():
library(dplyr)
df %>%
complete(group, nesting(!!!syms(char_vec)))
Ok, I figured it out.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(!!as.symbol(char_vec)))

R merging Partial match

there are a lot of answers about that, but I didn't find out with the problem that I am handle.
I have 2 dataframes:
df1:
df2:
setA <- read.table("df1.txt",sep="\t", header=TRUE)
setB <- read.table("df2.txt",sep="\t", header=TRUE)
So, I want the matching rows by column value:
library(data.table)
setC <-merge(setA, setB, by.x = "name", by.y = "name", all.x = FALSE)
And I get this output:
df3:
Because in df I have also de value 1, but separete with a ";". How can I get the desire output?
Thanks!!
In future please apply the function dput(df1) and dput(df2) and copy and paste the output from the console into your question.
Base R solution to two part question:
# First unstack the 1;7 row into two separate rows:
name_split <- strsplit(df1$name, ";")
# If the values of last vector uniquely identify each row in the dataframe:
df_ro <- data.frame(name = unlist(name_split),
last = rep(df1$last, sapply(name_split, length)),
stringsAsFactors = FALSE)
# Left join to achieve the same result as first solution
# without specifically naming each vector:
df1_ro <- merge(df1[,names(df1) != "name"], df_ro, by = "last", all.x = TRUE)
# Then perform an inner join preventing a name space collision:
df3 <- merge(df1_ro, setNames(df2, paste0(names(df2), ".x")),
by.x = "name", by.y = "name.x")
# If you wanted to perform an inner join on all intersecting columns (returning
# no results because values in last and colour are different then):
df3 <- merge(df1_ro, df2, by = intersect(names(df1_ro), names(df2)))
Data:
df1 <- data.frame(name = c("1;7", "3", "4", "5"),
last = c("p", "q", "r", "s"),
colour = c("a", "s", "d", "f"), stringsAsFactors = FALSE)
df2 <- data.frame(name = c("1", "2", "3", "4"),
last = c("a", "b", "c", "d"),
colour = c("p", "q", "r", "s"), stringsAsFactors = FALSE)
At the end I achieved with this solution:
co=open('NewFile.txt','w')
f=open('IndexFile.txt','r')
g=open('File.txt','r')
tabla1 = f.readlines()
tabla2 = g.readlines()
B=[]
for ln in tabla1:
B = ln.split('\t')[3]
for k, ln2 in enumerate(tabla2):
if B in ln2.split('\t')[3]:
xx=ln2
print(xx)
co.write(xx)
break
co.close()

R: Melt with data.table into two columns

I would like to convert a data.table from wide to long format.
A "normal" melt works fine, but I would like to melt my data into two different columns.
I found information about this here:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html (Point 3)
I tried this with my code but somehow it is not working and until now I could not figure out the problem.
It would be great if you could explain me the mistake in the following example.
Thanks
#Create fake data
a = c("l1","l2","l2","l3")
b = c(10, 10, 20, 10)
c = c(30, 30, 30, 30)
d = c(40.2, 32.1, 24.1, 33.0)
e = c(1,2,3,4)
f = c(1.1, 1.2, 1.3, 1.5)
df <- data.frame(a,b,c,d,e,f)
colnames(df) <- c("fac_a", "fac_b", "fac_c", "m1", "m2.1", "m2.2")
#install.packages("data.table")
require(data.table)
TB <- setDT(df)
#Standard melt - works
TB.m1 = melt(TB, id.vars = c("fac_a", "fac_b", "fac_c"), measure.vars = c(4:ncol(TB)))
#Melt into two columns
colA = paste("m1", 4, sep = "")
colB = paste("m2", 5:ncol(TB), sep = "")
DT = melt(TB, measure = list(colA, colB), value.name = c("a", "b"))
#Not working, error: Error in melt.data.table(TB, measure = list(colA, colB), value.name = c("a", : One or more values in 'measure.vars' is invalid.

Resources