The output from a loop is not correct - r

I have two tables FDate and Task as follows:
FDate
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B
4: 3 130 C
5: 4 180 <NA>
6: 5 200 A
7: 5 200 D
8: 6 230 <NA>
Task
Date Task
1 NA A
2 NA B
3 NA C
4 6 D
I want to write the Task name of same Date from table Task to table FDate. This is the code I try
for (i in 1:nrow(Task)) {
FDate$Task[FDate$Date %in% Task$Date[i]]<-Task$Task[i]
}
This is the output
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B
4: 3 130 C
5: 4 180 <NA>
6: 5 200 A
7: 5 200 D
8: 6 230 4
I expect the output is D, not 4. I can't find what is wrong?

The issue is that the column is factor which gets coerced to integer storage mode value. Convert it to character before looping
FDate$Task <- as.character(FDate$Task)
Task$Task <- as.character(Task$Task)
Better, would be to use stringsAsFactors = FALSE either while reading (read.csv/read.table) or if we are creating with data.frame as in both cases, the default option is stringsAsFactors = TRUE and it can create some issues similar to this.
Also, this can be done with a join (assuming the datasets are data.table
library(data.tabl)
FDate[na.omit(df2), Task := i.Task,on = .(Date)]
FDate
# Date Cycle Task
#1: 1 90 D
#2: 2 100 A
#3: 3 130 B
#4: 3 130 C
#5: 4 180 <NA>
#6: 5 200 A
#7: 5 200 D
#8: 6 230 D
NOTE: changed the second data.table identifier to 'df2' instead of 'Task' as there is a column 'Task' in each dataset

Related

Find non-overlapping values from two tables in R

I have two tables as follows:
library(data.table)
Input<-data.table("Date"=seq(1:10),"Cycle"=c(90,100,130,180,200,230,250,260,300,NA))
Date Cycle
1: 1 90
2: 2 100
3: 3 130
4: 4 180
5: 5 200
6: 6 230
7: 7 250
8: 8 260
9: 9 300
10: 10 320
FDate<-data.table("Date"=seq(1:9),"Cycle"=c(90,100,130,180,200,230,250,260,300),"Task"=c("D","A","B,C",NA,"A,D","D","C","D","A,C,D"))
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B,C
4: 4 180 <NA>
5: 5 200 A,D
6: 6 230 D
7: 7 250 C
8: 8 260 D
9: 9 300 A,C,D
I just want to have an output table with non-overlapped Date and corrresponding Cycle.
I tried with setdiff but it doesn't work. I expect my output like this
Date Cycle
10 320
When I tried this setdiff(FDate$Date,Input$Date)
it turns like this integer(0)
We can use fsetdiff from data.table by including only the common columns in both datasets
fsetdiff(Input, FDate[ , names(Input), with = FALSE])
# Date Cycle
#1: 10 320
Or a join as #Frank mentioned
Input[!FDate, on=.(Date)]
# Date Cycle
#1: 10 320
In the OP's code,
setdiff(FDate$Date,Input$Date)
the first argument is from the 'Date' column from 'FDate' All of the elements in that column is also in the master data 'Input$Date'. So, it returns integer(0)). If we do the reverse, it would return 10

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

Create Time Based User Sessions in R

I have a dataset which consists of three columns: user, action and time which is a log for user actions. the data looks like this:
user action time
1: 618663 34 1407160424
2: 617608 33 1407160425
3: 89514 34 1407160425
4: 71160 33 1407160425
5: 443464 32 1407160426
---
996: 146038 8 1407161349
997: 528997 9 1407161350
998: 804302 8 1407161351
999: 308922 8 1407161351
1000: 803763 8 1407161352
I want to separate sessions for each user based on action times. Actions done in certain period (for example one hour) are going to be assumed one session.
The simple solution is to use a for loop and compare action times for each user but that's not efficient and my data is very large.
Is there any method that can I use to overcome this problem?
I can group users but separate on users actions into different sessions is somehow difficult for me :-)
Try
library(data.table)
dt <- rbind(
data.table(user=1, action=1:10, time=c(1,5,10,11,15,20,22:25)),
data.table(user=2, action=1:5, time=c(1,3,10,11,12))
)
# dt[, session:=cumsum(c(T, !(diff(time)<=2))), by=user][]
# user action time session
# 1: 1 1 1 1
# 2: 1 2 5 2
# 3: 1 3 10 3
# 4: 1 4 11 3
# 5: 1 5 15 4
# 6: 1 6 20 5
# 7: 1 7 22 5
# 8: 1 8 23 5
# 9: 1 9 24 5
# 10: 1 10 25 5
# 11: 2 1 1 1
# 12: 2 2 3 1
# 13: 2 3 10 2
# 14: 2 4 11 2
# 15: 2 5 12 2
I used a difference of <=2 to collect sessions.

How to normalize multiple-values-column in a data table in R

I have a data.table as below:
order products value
1000 A|B 10
2000 B|C 20
3000 A|C 30
4000 B|C|D 5
5000 C|D 15
And I need to break the column products and transform/normalize to be used like this:
order prod.seq prod.name value
1000 1 A 10
1000 2 B 10
2000 1 B 20
2000 2 C 20
3000 1 A 30
3000 2 C 30
4000 1 B 5
4000 2 C 5
4000 3 D 5
5000 1 C 15
5000 2 D 15
I guess I can do it using a custom FOR/LOOP but I'd like to know a more advanced way to do that using apply,ddply methods. Any suggestions?
First, convert to a character/string:
DT[,products:=as.character(products)]
Then you can split the string:
DT[,{
x = strsplit(products,"\\|")[[1]]
list( prod.seq = seq_along(x), prod_name = x )
}, by=.(order,value)]
which gives
order value prod.seq prod_name
1: 1000 10 1 A
2: 1000 10 2 B
3: 2000 20 1 B
4: 2000 20 2 C
5: 3000 30 1 A
6: 3000 30 2 C
7: 4000 5 1 B
8: 4000 5 2 C
9: 4000 5 3 D
10: 5000 15 1 C
11: 5000 15 2 D
Here is the another option
library(splitstackshape)
out = cSplit(dat, "products", "|", direction = "long")
out[, prod.seq := seq_len(.N), by = value]
#> out
# order products value prod.seq
# 1: 1000 A 10 1
# 2: 1000 B 10 2
# 3: 2000 B 20 1
# 4: 2000 C 20 2
# 5: 3000 A 30 1
# 6: 3000 C 30 2
# 7: 4000 B 5 1
# 8: 4000 C 5 2
# 9: 4000 D 5 3
#10: 5000 C 15 1
#11: 5000 D 15 2
After cSplit step, using ddply
library(plyr)
ddply(out, .(value), mutate, prod.seq = seq_len(length(order)))
using dplyr
library(dplyr)
out %>% group_by(value) %>% mutate(prod.seq = row_number(order))
using lapply
rbindlist(lapply(split(out, out$value),
function(x){x$prod.seq = seq_len(length(x$order));x}))

Generating recursive ID by muli-variate group using data.table in R

I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3
This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.
From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]

Resources