I have this dataframe of retweets
set.seed(28100)
df <- data.frame(user_id = sample(1:8, 10, replace = TRUE),
timestamp = sample(1:1000, 10),
retweet = sample(999:1002, 10, replace=TRUE))
df <- df[with(df, order(retweet, -timestamp)),]
df
# user_id timestamp retweet
# 6 8 513 999
# 9 7 339 999
# 3 3 977 1000
# 2 3 395 1000
# 5 2 333 1000
# 4 5 793 1001
# 1 3 873 1002
# 8 2 638 1002
# 7 4 223 1002
# 10 6 72 1002
There is a unique id for each retweet. For each row I want to assign a rank to the user according to the inverse order of the chain or retweets. The rank should estimate the influence of each user: the longer the chain the highest the point for the early twitterer. In other words I want to rank-order each retweet chain based on the timestamp and assign higher points to those who retweeted it before. If two users have posted the same retweet at the same time they should be assign the same ranking.
Or in df
df$ranking <- c(1,2, 1,2,3, 1, 1,2,3,4)
aggregate(ranking~user_id, data=df, sum)
# user_id ranking
# 1 2 5
# 2 3 4
# 3 4 3
# 4 5 1
# 5 6 4
# 6 7 2
# 7 8 1
using data-table:
library(data.table)
setDT(df)[order(-timestamp), ranking2 := seq_len(.N), by = retweet]
df[, sum(ranking2), keyby = user_id]
# user_id V1
# 1: 2 5
# 2: 3 4
# 3: 4 3
# 4: 5 1
# 5: 6 4
# 6: 7 2
# 7: 8 1
Related
Lets say I have the following data frame called example:
ID amount_ID
3 1
3 2
3 3
3 4
3 5
9 1
9 2
9 3
9 4
3 1
3 2
3 3
As you can see when a new ID starts the amount_ID starts counting from 1 again.
How can I add a column where I want a timespan of 100 for the amount_ID where the last amount_ID is 100. It should look like this:
ID amount_ID timespan
3 1 20 (because 1*100/5=20)
3 2 40 (because 2*100/5=40)
3 3 60
3 4 80
3 5 100 (because 5*100/5=100)
9 1 25 (because 1*100/4=25)
9 2 50
9 3 75
9 4 100
3 1 33.33 (because 1*100/3=33.33)
3 2 66.67
3 3 100
I added some calculations so you can understand what I am trying to calculate.IMPORTANT: there are multiple rows with the same ID.
Thanks!
Multiple ways to do this :
You can either divide amount_ID by max value of amount_ID for unique occurrence of ID.
library(data.table)
setDT(df)[, timespan := amount_ID/max(amount_ID) * 100, rleid(ID)]
df
# ID amount_ID timespan
# 1: 3 1 20.0
# 2: 3 2 40.0
# 3: 3 3 60.0
# 4: 3 4 80.0
# 5: 3 5 100.0
# 6: 9 1 25.0
# 7: 9 2 50.0
# 8: 9 3 75.0
# 9: 9 4 100.0
#10: 3 1 33.3
#11: 3 2 66.7
#12: 3 3 100.0
Divide by number of rows in each group.
setDT(df)[, timespan := amount_ID/.N * 100, rleid(ID)]
Divide by last value of amount_ID
setDT(df)[, timespan := amount_ID/last(amount_ID) * 100, rleid(ID)]
We can use the same in dplyr as well.
library(dplyr)
Divide by max
df %>%
group_by(gr = rleid(ID)) %>%
mutate(timespan = amount_ID/max(amount_ID) * 100)
Divide by number of rows
df %>%
group_by(gr = rleid(ID)) %>%
mutate(timespan = amount_ID/n() * 100)
Divide by last value
df %>%
group_by(gr = rleid(ID)) %>%
mutate(timespan = amount_ID/last(amount_ID) * 100)
We can do this in base R
with(df, 100 *amount_ID/ave(amount_ID, with(rle(ID),
rep(seq_along(values), lengths)), FUN = max))
i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
I play with the following two simple datasets:
(myData <- data.frame(ID=c(1:7, 5), Sum=c(10, 20, 30, 40, 50, 60, 700, 200)))
# ID Sum
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 700
# 8 5 200
and
(myMap <- data.frame(ID=c(1:5, 7), Name=c("a", "b", "c", "d", "e", "g")))
# ID Name
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 7 g
I will map the data with the map, this way:
myData$Name<-myMap$Name[match(myData$ID, myMap$ID)]
However since there is no map entry for the ID == 6, the output is:
ID Sum Name
1 1 10 a
2 2 20 b
3 3 30 c
4 4 40 d
5 5 50 e
6 6 60 <NA>
7 7 700 g
8 5 200 e
What I am trying to do now: in the record where Name is NA, the Name should become ID.
My attempts:
myData$Dummy<-ifelse( is.na(myData$Name),myData$ID, myData$Name)
or
for (i in 1:length(myData$Name) )
if (is.na(myData$Name[i]))
{
x <- myData$ID[i]
# print(x)
myData$Name[i]<- as.factor(x)
print(myData$Name[i])
}
are wrong. Could you please give me a hint?
It's the fact that the column you think is character is really a factor. Either use stringsAsFactors=FALSE when creating the data frame or you'll need to account for it when manipulating the data. I've provided dplyr + piping and base R solutions below. Note the use of left_join (dplyr) or merge (base) vs your subset & matching:
library(dplyr)
myData <- read.csv(text="ID;Sum
1;10
2;20
3;30
4;40
5;50
6;60
7;700
5;200", sep=";")
myMap <- read.csv(text="ID;Name
1;a
2;b
3;c
4;d
5;e
7;g", sep=";")
# dplyr -------------------------------------------------------------------
myData %>%
left_join(myMap) %>%
mutate(Name=as.character(Name),
Name=ifelse(is.na(Name), ID, Name)) -> dplyr_myData
## Joining by: "ID"
dplyr_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 6 60 6
## 7 7 700 g
## 8 5 200 e
# base --------------------------------------------------------------------
base_myData <- merge(myData, myMap, all.x=TRUE)
base_myData$Name <- as.character(base_myData$Name)
base_myData$Name <- ifelse(is.na(base_myData$Name),
base_myData$ID, base_myData$Name)
base_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 5 200 e
## 7 6 60 6
## 8 7 700 g
An option using data.table
library(data.table)#1.9.5+
setkey(setDT(myData), ID)[myMap, Name:=i.Name][is.na(Name),
Name:= as.character(ID)]
# ID Sum Name
#1: 1 10 a
#2: 2 20 b
#3: 3 30 c
#4: 4 40 d
#5: 5 50 e
#6: 5 200 e
#7: 6 60 6
#8: 7 700 g
NOTE: As commented by #Arun, in the devel version v1.9.5, we can also set the key as an argument inside setDT, i.e. setDT(myData, key='ID')
I really need your help regarding a problem which may seem easy to solve for you.
Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.
In order to use the plm()-function I need all these files to convert to the following data structure:
ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...
I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from?
Would really appreciate if you could provide me with information how to solve this problem.
If my question is not clear to you, just ask.
Best regards and thanks in advance
The output should look like as follows:
mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-4L))
> mydata
V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23 3
4 13 24 34
The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function
long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
times = names(mydata), timevar = "id",
varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")
> long
id value time
1 V1 10 1
2 V1 11 2
3 V1 12 3
4 V1 13 4
5 V2 21 1
6 V2 22 2
7 V2 23 3
8 V2 24 4
9 V3 31 1
10 V3 32 2
11 V3 3 3
12 V3 34 4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
id time value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 13
5 2 1 21
6 2 2 22
7 2 3 23
8 2 4 24
9 3 1 31
10 3 2 32
11 3 3 3
12 3 4 34
Here is an alternative: Use Stacked from my "splitstackshape" package.
Here it is applied on #Metrics's sample data:
# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata),
id.vars="id", var.stubs="V", sep = "V")
# id .time_1 V
# 1: 1 1 10
# 2: 1 2 21
# 3: 1 3 31
# 4: 2 1 11
# 5: 2 2 22
# 6: 2 3 32
# 7: 3 1 12
# 8: 3 2 23
# 9: 3 3 3
# 10: 4 1 13
# 11: 4 2 24
# 12: 4 3 34
It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.
It still isn't faster than stack though (but at some point, stack starts to slow down).
See the system.times below:
reshape()
system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
times = names(x), timevar = "id",
varying = list(names(x)),
v.names="value",
new.row.names = 1:prod(dim(x)),
direction = "long"))
# user system elapsed
# 53.11 0.00 53.11
head(out)
# id value time
# 1 V1 0.003808635 1
# 2 V1 -0.018807416 2
# 3 V1 0.008875447 3
# 4 V1 0.001148695 4
# 5 V1 -0.019365004 5
# 6 V1 0.012436560 6
Stacked()
system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x),
id.vars="id", var.stubs="V",
sep = "V"))
# user system elapsed
# 0.30 0.00 0.29
out2
# id .time_1 V
# 1: 1 1 0.003808635
# 2: 1 10 -0.014184635
# 3: 1 100 -0.013341843
# 4: 1 101 0.006784138
# 5: 1 102 0.006463707
# ---
# 963868: 2317 95 0.009569451
# 963869: 2317 96 0.002497771
# 963870: 2317 97 0.009202519
# 963871: 2317 98 0.017007545
# 963872: 2317 99 -0.002495842
stack()
system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
# user system elapsed
# 0.09 0.00 0.09
head(out3)
# id values ind
# 1 1 0.003808635 V1
# 2 2 -0.018807416 V1
# 3 3 0.008875447 V1
# 4 4 0.001148695 V1
# 5 5 -0.019365004 V1
# 6 6 0.012436560 V1
[udpated: tried to clarify and simplify, corrected sample code and data.]
I've a set of measurements that are taken over a period of days. The range of numbers that can be captured in any measurement is 1-25 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I'd like a way to tally a count for how many events have passed since a specific number occurred regardless of the measurement column. I'd like it to reset the count after the number match as shown below.
V1,V2,Vn are the values captured.
Match1, Match2, Matchn are the counts since last encountered columns.
Note: Matchn counts are incremented regardless of which Vx column n is encountered.
Any help is much appreciated.
this is somewhat related to my earlier post here
Sample input
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,3,29,36),
V2 = c(2, 5, 12, 4,8,2,38,50,4,1)
)
code for creating Sample output
t$match1 <- c(1,2,3,4,1,2,3,4,5,1)
t$match2 <- c(1,1,2,3,4,5,1,2,3,4)
t$match3 <- c(1,2,3,1,2,3,4,5,1,2)
> t
Date V1 V2 match1 match2 match3
1: 2013-05-01 4 2 1 1 1
2: 2013-05-02 2 5 2 1 2
3: 2013-05-03 3 12 3 2 3
4: 2013-05-04 1 4 4 3 1
5: 2013-05-05 7 8 1 4 2
6: 2013-05-06 22 2 2 5 3
7: 2013-05-07 35 38 3 1 4
8: 2013-05-08 3 50 4 2 5
9: 2013-05-09 29 4 5 3 1
10: 2013-05-10 36 1 1 4 2
I think OP has a bunch of typos in it, and as far as I understand you want this:
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,52,29,36),
V2 = c(2, 5, 2, 4,8,47,38,50,4,1)
)
t[, inclusive.match.1 := 1:.N, by = cumsum(V1 == 1 | V2 == 1)]
t[, exclusive.match.1 := 1:.N, by = rev(cumsum(rev(V1 == 1 | V2 == 1)))]
t
# Date V1 V2 inclusive.match.1 exclusive.match.1
# 1: 2013-05-01 4 2 1 1
# 2: 2013-05-02 2 5 2 2
# 3: 2013-05-03 3 2 3 3
# 4: 2013-05-04 1 4 1 4
# 5: 2013-05-05 7 8 2 1
# 6: 2013-05-06 22 47 3 2
# 7: 2013-05-07 35 38 4 3
# 8: 2013-05-08 52 50 5 4
# 9: 2013-05-09 29 4 6 5
#10: 2013-05-10 36 1 1 6