Replace values in df1 with values from another dataset (df2) - r

This is the question:
** df1 ==> You have one dataset:
with df1$IDs from 1:100, (each ID appears twice df$visit), a column called df1$weight and a column called df1$height
** df2==> another dataset:
with df2$IDs from 1:50 (each ID appears twice df$visit), and a column called df2$weight.
And you want to create a THIRD dataset where you will have:
exactly the same dataset as in df1 but for those IDs that are present in df2 you replace df1$weight for df2$weight. Obviously taking into account visit.
How would you do that?
Thanks!

We may do a join
library(data.table)
df3 <- copy(df1)
setDT(df3)[df2, weight := i.weight, on = .(IDs)]
If it is more than one column, we may do
setDT(df3)[df2, c('weight1', 'weight2') := .(i.weight1, i.weight2), on = .(IDs)]
If there are many columns, create an vector of those column names
nm1 <- names(df2)[1:5] # suppose if the first five column names wanted
nm2 <- paste0("i.", nm1) # for the corresponding column names from second data
setDT(df3)[df2, (nm1) := mget(nm2), on = .(IDs)]

Related

replace values in dataframe based on indexes on second dataframe R

I have the following task: replace values of variable V1 in dataframe A with values fo the same variable in dataframe B. Next I simulate the dataframes:
set.seed(123)
A<-data.frame(id1=sample(1:10,10),id2=sample(1:10,10),V1=rnorm(10),V2=rnorm(10))
###create dataframe B
B<-A[sample(1:10,5),1:3]
###change values to be updated in df A
B$V1<-rnorm(5)
###create a row which is not in A, to make it more interesting
B<-rbind(B,c(11,12,rnorm(1)))
Now I provide a non optimal solution which I wish to make more cleaner
temp<-left_join(A,B,by=c("id1","id2"))
temp[!is.na(temp$V1.y),"V1.x"]<-temp[!is.na(temp$V1.y),"V1.y"]
A<-temp[,setdiff(colnames(temp),"V1.y")]
colnames(A)[colnames(A) %in% "V1.x"]<-"V1"
It would be desirable to avoid creating temporal objects and modify directly df A. Also the solution should be scalable to replace values in more than one column of A. I am think in something like
A[expression1,desired_cols]<-B[expression2,desired_cols]
where expression1 and expression2 are inteded to match indexes in both df and desired_cols are the names of columns to be replaced
We can use a join from data.table and update the columns of 'A' with the corresponding i. column of the second dataset ('B')
library(data.table)
setDT(A)[B, V1 := i.V1, on = .(id1, id2)]
If we are replacing multiple columns, make note of the columns to replace
nm1 <- names(A)[3:4]
nm2 <- paste0("i.", nm1)
setDT(A)[B, (nm1) := mget(nm2), on = .(id1, id2)]
Or if we use left_join, then coalesce would be better
library(dplyr)
left_join(A, B, by = c('id1', 'id2')) %>%
transmute(id1, id2, V1 = coalesce(V1.y, V1.x), V2)

Overwrite a column in one dataframe with one in another dataframe if the value for the entries in the second dataframe is among the top 50% in R

I want to combine the 'gr' columns from dataframes A and B, using the top 50% of entries based on the 'value' column in dataframe B.
Essentially, I want to overwrite the 'gr' variable in dataframe A with the one in dataframe B if the value in dataframe B is in the top 50%.
Importantly, the top-to-bottom order of the 'sample' column must remain the same.
Here is some example data:
dataframe_A <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("1","2","3","3","2","1","4","4","4","3","2","5","1","4","3","4","5","5","1","2","2","3","4","5","1","2","2","3","4","4","5","5","2","1","3","5","3","2","2"))
dataframe_B <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("5","3","3","5","5","5","5","3","5","3","3","3","3","3","3","3","2","1","2","1","1","1","2","2","2","1","2","1","1","1","2","1","1","4","4","4","4","4","4"),
value = c("20.06915","20.06915","19.53556","19.39911","19.06339","18.35938","18.34701","17.85767","17.60714","17.30706","17.08515","16.91452","16.72728","16.46812","15.85850","15.42839","14.92798","14.65943","14.53258","14.33954","14.33583","14.23938","14.19658","14.12557","14.03669","13.89811","13.78137","13.75599","13.51798","13.41058","13.17932","13.11952","12.67316","12.57049","11.88663","11.08443","10.75299","10.61885","10.40393"))
Thanks in advance!
Cheers
Using data.table:
library(data.table)
setDT(dataframe_A)
setDT(dataframe_B)
dataframe_B[, value := as.numeric(value)]
dataframe_A[dataframe_B[value > median(value)], on = "sample", gr := i.gr]
Using base R:
dataframe_B$value <- as.numeric(dataframe_B$value)
dfb2update <- dataframe_B[with(dataframe_B, value > median(value)), c("sample", "gr")]
rowsdfa2update <- which(dataframe_A$sample %in% dfb2update$sample)
dataframe_A$gr[rowsdfa2update] <- dfb2update[match(dataframe_A$sample[rowsdfa2update], dfb2update$sample), "gr"]

Select row by level of a factor

I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))

Subset columns based on list of column names and bring the column before it

I have a larger dataset following the same order, a unique date column, data, unique date column, date, etc. I am trying to subset not just the data column by name but the unique date column also. The code below selects columns based on a list of names, which is part of what I want but any ideas of how I can grab the column immediately before the subsetted column also?
Looking to end up with a DF containing Date1, Fire, Date3, Earth columns (using just the NameList).
Here is my reproducible code:
Cnames <- c("Date1","Fire","Date2","Water","Date3","Earth")
MAINDF <- data.frame(replicate(6,runif(120,-0.03,0.03)))
colnames(MAINDF) <- Cnames
NameList <- c("Fire","Earth")
NewDF <- MAINDF[,colnames(MAINDF) %in% NameList]
How about
NameList <- c("Fire","Earth")
idx <- match(NameList, names(MAINDF))
idx <- sort(c(idx-1, idx))
NewDF <- MAINDF[,idx]
Here we use match() to find the index of the desired column, and then we can use index subtraction to grab the column before it
Use which to get the column numbers from the names, and then it's just simple arithmetic:
col.num <- which(colnames(MAINDF) %in% NameList)
NewDF <- MAINDF[,sort(c(col.num, col.num - 1))]
Produces
Date1 Fire Date3 Earth
1 -0.010908003 0.007700453 -0.022778726 -0.016413307
2 0.022300509 0.021341360 0.014204445 -0.004492150
3 -0.021544992 0.014187158 -0.015174048 -0.000495121
4 -0.010600955 -0.006960160 -0.024535954 -0.024210771
5 -0.004694499 0.007198620 0.005543146 -0.021676692
6 -0.010623787 0.015977135 -0.027741109 -0.021102651
...

Resources