I have the following task: replace values of variable V1 in dataframe A with values fo the same variable in dataframe B. Next I simulate the dataframes:
set.seed(123)
A<-data.frame(id1=sample(1:10,10),id2=sample(1:10,10),V1=rnorm(10),V2=rnorm(10))
###create dataframe B
B<-A[sample(1:10,5),1:3]
###change values to be updated in df A
B$V1<-rnorm(5)
###create a row which is not in A, to make it more interesting
B<-rbind(B,c(11,12,rnorm(1)))
Now I provide a non optimal solution which I wish to make more cleaner
temp<-left_join(A,B,by=c("id1","id2"))
temp[!is.na(temp$V1.y),"V1.x"]<-temp[!is.na(temp$V1.y),"V1.y"]
A<-temp[,setdiff(colnames(temp),"V1.y")]
colnames(A)[colnames(A) %in% "V1.x"]<-"V1"
It would be desirable to avoid creating temporal objects and modify directly df A. Also the solution should be scalable to replace values in more than one column of A. I am think in something like
A[expression1,desired_cols]<-B[expression2,desired_cols]
where expression1 and expression2 are inteded to match indexes in both df and desired_cols are the names of columns to be replaced
We can use a join from data.table and update the columns of 'A' with the corresponding i. column of the second dataset ('B')
library(data.table)
setDT(A)[B, V1 := i.V1, on = .(id1, id2)]
If we are replacing multiple columns, make note of the columns to replace
nm1 <- names(A)[3:4]
nm2 <- paste0("i.", nm1)
setDT(A)[B, (nm1) := mget(nm2), on = .(id1, id2)]
Or if we use left_join, then coalesce would be better
library(dplyr)
left_join(A, B, by = c('id1', 'id2')) %>%
transmute(id1, id2, V1 = coalesce(V1.y, V1.x), V2)
Related
I have df1 which I would like to merge with df2 based on a common field id
id is always in the form of 21_2342_A_C (i.e. num_num_char_char). I want to merge df2 into df1 if either of the last two fields (sep="_") in id are switched.
So, if ID in df1 is 21_2342_A_C, then I want it to match if the entry in df2 is either 21_2342_A_C or 21_2342_C_A.
Is this possible using data.table? I've developed a cumbersome way involving creating two different columns and doing two different joins, but I was hoping there'd be a more elegant solution. I'll also happily take a non data.table solution.
This also includes creating two additional columns but only 1 merge:
dt <- data.table(
id = c("21_2342_A_C", "21_2342_C_A", "21_2342_A_B")
)
extract number and character part of id
sort character part
merge if number and character parts are same
remove merges on itself and/or duplicated merges (if row i is merged to row j then row j is merged on row i)
dt[, row_id := seq_len(.N)]
dt[, (c("id1", "id2")) := transpose(str_extract_all(dt$id, "([0-9]{2}_[0-9]{4})|([A-Z]_[A-Z])"))]
dt[, id2 := map_chr(str_split(id2, "_"), ~str_c(sort(.x), collapse = ""))]
res <- dt[dt, on = .(id1, id2)][row_id < i.row_id]
res[, c("row_id", "id1", "id2", "i.row_id") := NULL]
I also could not figure out how to do it without an intermediate id.
Here is my take:
df1 <- data.table(V1= "hello", id= "21_2342_A_C")
df2 <- data.table(V1= c("world1", "world2"), id= c("21_2342_A_C", "21_2342_C_A"))
sort_id <- function(x)
{
x <- unlist(tstrsplit(x, "_"))
return(paste0(c(x[1:2], sort(x[3:4])), collapse= "_"))
}
df1[, id2:= sort_id(id), id]
df2[, id2:= sort_id(id), id]
merge(df1,
df2,
"id2")
This is the question:
** df1 ==> You have one dataset:
with df1$IDs from 1:100, (each ID appears twice df$visit), a column called df1$weight and a column called df1$height
** df2==> another dataset:
with df2$IDs from 1:50 (each ID appears twice df$visit), and a column called df2$weight.
And you want to create a THIRD dataset where you will have:
exactly the same dataset as in df1 but for those IDs that are present in df2 you replace df1$weight for df2$weight. Obviously taking into account visit.
How would you do that?
Thanks!
We may do a join
library(data.table)
df3 <- copy(df1)
setDT(df3)[df2, weight := i.weight, on = .(IDs)]
If it is more than one column, we may do
setDT(df3)[df2, c('weight1', 'weight2') := .(i.weight1, i.weight2), on = .(IDs)]
If there are many columns, create an vector of those column names
nm1 <- names(df2)[1:5] # suppose if the first five column names wanted
nm2 <- paste0("i.", nm1) # for the corresponding column names from second data
setDT(df3)[df2, (nm1) := mget(nm2), on = .(IDs)]
I have two dataframes (A and B). B contains new values and A contains outdated values.
Each of these dataframes have one column representing the key and another one representing the value.
I want to add rows from B to A and then clean rows that contain duplicated keys from A (update A with the new values that are in B). Order doesn't really matter, I think it is easier in the other order : cleaning duplicates and then appending.
At the moment, I have done this script :
A <- bind_rows(B, A)
A <- A[!duplicated(A),]
The issue I have is that it doesn't clean rows because they are not real duplicates (value is different).
How could I handle this?
This is just a hunch because there's no example data provided, but I suspect a merge is a much safer approach than a row-bind:
Solution with data.table
library(data.table)
1 - Rename variables to prepare for a merge
setnames(A, old="value", new="value_A")
setnames(B, old="value", new="value_B")
2 - Merge, be sure to use the all arg
dt <- merge(A, B, by="key", all=TRUE)
3 - Use some rule for the update - for example: use value_B unless it's missing, in which case use value_A
dt[ , value := value_B]
dt[is.na(value), value := value_A]
Solution with Base R
names(A) <- c("key", "value_A")
names(B) <- c("key", "value_B")
df <- merge(A, B, by="key", all=TRUE)
df$value <- df$value_B
df[is.na(df$value), "value"] <- df[is.na(df$value), "value_A"]
Solution with dplyr/tidyverse
library(dplyr)
df <- full_join(A, B, by="key") %>%
mutate(value = ifelse(is.na(value_B), value_A, value_B))
Example Data
set.seed(1234)
A <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))
B <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.
I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))