For lack of a better word, how do I apply a "patch" to a R data.frame? Suppose I have a master database with firm and outlet columns and an ownership shares variable that is 1 or 0 in this example, but could be any percentage.
// master
firm outlet shares.pre
1 five 1 0
2 one 1 1
3 red 1 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I want to let firm "one" sell outlet "1" to firm "red", which transaction I have in another data.frame
// delta
firm outlet shares.delta
1 one 1 -1
2 red 1 1
What is the most efficient way in R to apply this "patch" or transaction to my master database? The end result should look like this:
// preferably master, NOT a copy
firm outlet shares.post
1 five 1 0
2 one 1 0 <--- was 1
3 red 1 1 <--- was 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I am not particular about keeping the suffixes pre, post or delta. If they were all named shares that would be fine too, I simply want to "add" these data frames.
UPDATE: my current approach is this
update <- (master$firm %in% delta$firm) & (master$outlet %in% delta$outlet)
master[update,]$shares <- master[update,]$shares + delta$shares
Yes, I'm aware it does a vector scan to creat the Boolean update vector, and that the subsetting is also not very efficient. But the thing I don't like about it most is that I have to write out the matching columns.
Another way using data.table. Assuming you've loaded both your data in df1 and df2 data.frames,
require(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1, firm, outlet)
setkey(dt2, firm, outlet)
dt1 <- dt2[dt1]
dt1[is.na(dt1)] <- 0
dt1[, shares.post := shares.delta + shares.pre]
# firm outlet shares.delta shares.pre shares.post
# 1: five 1 0 0 0
# 2: five 2 0 0 0
# 3: one 1 -1 1 0
# 4: one 2 0 0 0
# 5: red 1 1 0 1
# 6: yellow 1 0 0 0
I'd give a more precise answer if you had provided a reproducible example, but here's one way:
Call your first data.frame dat and your second chg
Then you could merge the two:
dat <- merge(dat,chg)
And just subtract:
dat$shares <- with(dat, shares.pre + shares.delta )
Related
I have the following dataset:
df =
id Time A
1 3 0
1 5 1
1 6 1
2 8 0
2 9 0
2 12 1
I want to do two things: i) have a starting time of -1 across all ids, and ii) split the time into two columns; start and end while preserving the time at which the individual got the observation A (setting end as the reference point). The final result should look something like this:
df =
id start end A
1 -1 0 0
1 0 2 1
1 2 3 1
2 -1 0 0
2 0 1 0
2 1 4 1
This does the trick with this set. I wasn't 100% sure on the question from the description so tried to go off what I could see here. For future reference, please try pasting in dput(df) as the input data :)
df <- data.frame(id=c(rep(1,3),rep(2,3)),
Time=c(3,5,6,8,9,12),
A=c(0,1,1,0,0,1))
library(data.table)
dt <- as.data.table(df)
# diff(Time) finds the interval between points
# cumsum then adds this diff together to take in to account the previous time
# gaps
dt[, end := cumsum(c(0, diff(Time))), by=id]
# start is then just a shifted version of end, with the initial start filled as -1
dt[, start := shift(end, n=1, fill=-1), by=id]
out <- as.data.frame(dt)
out
I'm working on a market basket analysis project and have data that contains a variable that represents an ID, and another that contains only one item in a basket. There are ~50K unique items across users, but I have created a simple table below for illustration.
library(data.table)
dt = data.table(ID = c("1","1","1","1","2","2","2"),
product = c("Soda","Beer","Chips","Apple","Soda","Water","Juice"))
dt # original data looks like this
I am then using the dcast function to transform each product into it's own column with binary values, indicating that they were part of the order.
dcast.data.table(dt, ID ~ product, function(x) 1, fill = 0)
As I mentioned, I cannot use this method on the entire dataset due to memory limitations (since this would create 50K columns for each product). So, I am trying to find a solution where dcast only creates "product" columns based on the items only contained within ID ==1 (meaning, the columns "Juice" and Water" would be excluded). Also, I am working with a fairly large dataset of 34MM observations, so I am looking for an efficient solution where the data.table API can be leveraged and am specifically trying to avoid looping through products. I hope this question is clear. Thank you.
This works:
dcast(dt, ID + product ~ factor(product, levels=product[ID==1]), fun.agg=length)
Using 'product' as value column. Use 'value.var' to override
ID product Soda Beer Chips Apple NA
1: 1 Apple 0 0 0 1 0
2: 1 Beer 0 1 0 0 0
3: 1 Chips 0 0 1 0 0
4: 1 Soda 1 0 0 0 0
5: 2 Juice 0 0 0 0 1
6: 2 Soda 1 0 0 0 0
7: 2 Water 0 0 0 0 1
I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))
I've been trying to solve this issue for too long now. I have binary insect outbreak data in annual time series format for 300+ years (rows) and 70+ trees (columns).
I'd like to conditionally fill a dataframe / matrix / data table of the same dimensions with cumulative sums, and have it reset to 0 at the end of each outbreak period. I've found a wealth of similar questions / answers that I just can't seem to translate to my issue.
I'll have a snippet of a dataframe, e.g., that looks like this:
t1 t2 t3 t4 t5
2000 1 0 0 1 0
2001 1 0 0 0 1
2002 1 1 0 0 1
2003 0 1 0 1 1
2004 1 1 1 1 1
And I want to create a new df that looks like this:
t1 t2 t3 t4 t5
2000 1 0 0 1 0
2001 2 0 0 0 1
2002 3 1 0 0 2
2003 0 2 0 1 3
2004 1 3 1 2 4
I've felt I've gotten close with both the data.table and rle packages, although I've also been going in tons of circles as well (pretty sure I did it for a single column once, but now can't remember what I did, or why I couldn't get it to work in a loop for all columns...).
I've always gotten the following methods to work to some extent, usually just a single column, or add one 1 df on top of a shifted df, so a single column might look like 0 1 2 2 1 0 instead of 0 1 2 3 4 0. Some attempts, if this helps, have been variations on code looking like this:
setDT(dt)[, new := t1 + shift(t1, fill = 0)]
apply(
rle( matrix)$lengths
, 2, seq)
rle( matrix[,1])$lengths
for( i in 1:dim(dt)[1]) {
for( j in 1:dim(dt)[2]) {
cols <- names(dt) # tried in place of .SD with negative results
if( dt[i,j] == 1) {
dt[, new := .SD + shift(.SD, 1L, fill = 0, type = "lag", give.names = TRUE)]
} else { dt }
}
}
Some of the main SO sources I've used include these pages: data.table, dplyr, rle
Let me know if I'm missing any important info (I'm new!). & thank you so much for any help!
We can use rle with sequence from base R
df2 <- df1 #create a copy of df1
#loop through the columns of 'df2', apply the `rle`, get the 'sequence'
#of 'lengths' and multiply with the column values.
df2[] <- lapply(df2, function(x) sequence(rle(x)$lengths)*x)
df2
# t1 t2 t3 t4 t5
#2000 1 0 0 1 0
#2001 2 0 0 0 1
#2002 3 1 0 0 2
#2003 0 2 0 1 3
#2004 1 3 1 2 4
You can use data.table combined with the ave function to calculate the cumsum of each column grouped by the rleid of the column itself:
library(data.table)
setDT(dt)[, names(dt) := lapply(.SD, function(col) ave(col, rleid(col), FUN = cumsum))][]
# t1 t2 t3 t4 t5
#1: 1 0 0 1 0
#2: 2 0 0 0 1
#3: 3 1 0 0 2
#4: 0 2 0 1 3
#5: 1 3 1 2 4
I need some help with data manipulation in R. I have a long code which does this as a series of steps, but I am looking for a shorter way to do it.
Here is a data frame which has two columns - the first one is an ID and the other has pipe delimited data in it as shown below:
ID DATA
1 a
2 a|b
3 b|c
4 d|e
I need to convert this to this form:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I am hoping there is a simpler way to do this than the lengthy code I have.
Thanks in advance for your help.
This works on the supplied data. First read in your data:
pipdat <- read.table(stdin(),header=TRUE,stringsAsFactors=FALSE)
ID DATA
1 a
2 a|b
3 b|c
4 d|e
# leave a blank line at the end so it stops reading
Now here goes:
nr <- dim(pipdat)[1]
chrs <- strsplit(pipdat[,2],"[|]")
af <- unique(unlist(chrs))
whichlet <- function(a,fac) as.numeric(fac %in% a)
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
(That can be done in fewer lines, but it's handy to see what some of those steps do)
It produces:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I guessed from your post that you wanted ID as row names; if you need it to be a column of data that last line needs to be different.
I'd have used sapply instead of lapply, but you end up with the transpose of the desired matrix. That works if you replace the last line with:
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(ID=1:nr,af)
res
but it might be slower.
---
If you don't follow the line
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
just break it up from the innermost function outward:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
matrix(vres,byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
---
If you need ID as a column of data instead of row names, one way to do it is:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
cbind(ID=1:nr,matrix(vres,byrow=TRUE,nr=nr,dimnames=list(1:nr,af)))
or you could do
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(1:nr,af)
cbind(ID=1:nr,res)