Data Transformation in R for Panel Regression - r

I really need your help regarding a problem which may seem easy to solve for you.
Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.
In order to use the plm()-function I need all these files to convert to the following data structure:
ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...
I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from?
Would really appreciate if you could provide me with information how to solve this problem.
If my question is not clear to you, just ask.
Best regards and thanks in advance
The output should look like as follows:

mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-4L))
> mydata
V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23 3
4 13 24 34
The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function
long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
times = names(mydata), timevar = "id",
varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")
> long
id value time
1 V1 10 1
2 V1 11 2
3 V1 12 3
4 V1 13 4
5 V2 21 1
6 V2 22 2
7 V2 23 3
8 V2 24 4
9 V3 31 1
10 V3 32 2
11 V3 3 3
12 V3 34 4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
id time value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 13
5 2 1 21
6 2 2 22
7 2 3 23
8 2 4 24
9 3 1 31
10 3 2 32
11 3 3 3
12 3 4 34

Here is an alternative: Use Stacked from my "splitstackshape" package.
Here it is applied on #Metrics's sample data:
# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata),
id.vars="id", var.stubs="V", sep = "V")
# id .time_1 V
# 1: 1 1 10
# 2: 1 2 21
# 3: 1 3 31
# 4: 2 1 11
# 5: 2 2 22
# 6: 2 3 32
# 7: 3 1 12
# 8: 3 2 23
# 9: 3 3 3
# 10: 4 1 13
# 11: 4 2 24
# 12: 4 3 34
It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.
It still isn't faster than stack though (but at some point, stack starts to slow down).
See the system.times below:
reshape()
system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
times = names(x), timevar = "id",
varying = list(names(x)),
v.names="value",
new.row.names = 1:prod(dim(x)),
direction = "long"))
# user system elapsed
# 53.11 0.00 53.11
head(out)
# id value time
# 1 V1 0.003808635 1
# 2 V1 -0.018807416 2
# 3 V1 0.008875447 3
# 4 V1 0.001148695 4
# 5 V1 -0.019365004 5
# 6 V1 0.012436560 6
Stacked()
system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x),
id.vars="id", var.stubs="V",
sep = "V"))
# user system elapsed
# 0.30 0.00 0.29
out2
# id .time_1 V
# 1: 1 1 0.003808635
# 2: 1 10 -0.014184635
# 3: 1 100 -0.013341843
# 4: 1 101 0.006784138
# 5: 1 102 0.006463707
# ---
# 963868: 2317 95 0.009569451
# 963869: 2317 96 0.002497771
# 963870: 2317 97 0.009202519
# 963871: 2317 98 0.017007545
# 963872: 2317 99 -0.002495842
stack()
system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
# user system elapsed
# 0.09 0.00 0.09
head(out3)
# id values ind
# 1 1 0.003808635 V1
# 2 2 -0.018807416 V1
# 3 3 0.008875447 V1
# 4 4 0.001148695 V1
# 5 5 -0.019365004 V1
# 6 6 0.012436560 V1

Related

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

Replace values in set of columns based on condition

I have a dataframe like this
id v1 v2 v3 v4 v5 pos
1 11 12 11 10 10 3
2 17 11 22 40 23 4
1 11 22 50 10 10 2
I would like to change its values based on a condition related to pos to get:
id v1 v2 v3 v4 v5 pos
1 11 12 12 12 12 3
2 17 11 22 22 22 4
1 11 11 11 11 11 2
So basically values get the previous value and the variable pos defines from where should we start.
Thx!
An approach using some indexing, which should be efficient in running time.
Not super efficient in terms of memory however, due to making a copy the same size as the input data:
vars <- paste0("v",1:5)
nv <- dat[vars][cbind(seq_len(nrow(dat)), dat$pos-1)]
ow <- col(dat[vars]) >= dat$pos
dat[vars][ow] <- nv[row(ow)[ow]]
# id v1 v2 v3 v4 v5 pos
#1 1 11 12 12 12 12 3
#2 2 17 11 22 22 22 4
#3 1 11 11 11 11 11 2
Explanation:
Get the variables of interest:
vars <- paste0("v",1:5)
Get the new values to overwrite for each row:
nv <- dat[vars][cbind(seq_len(nrow(dat)), dat$pos-1)]
Make a logical matrix of the cells to overwrite
ow <- col(dat[vars]) >= dat$pos
Overwrite the cells using a row identifier to pick the appropriate value.
dat[vars][ow] <- nv[row(ow)[ow]]
Quick comparative timing using a larger dataset:
dat <- dat[rep(1:3,1e6),]
# indexing
# user system elapsed
# 1.36 0.31 1.68
# apply
# user system elapsed
# 77.30 0.83 78.41
# gather/spread
# user system elapsed
# 293.43 3.64 299.10
Here is one idea with gather and spread.
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(v, value, starts_with("v")) %>%
group_by(rowid) %>%
mutate(value = ifelse(row_number() >= (pos - 1), nth(value, (pos - 1)[[1]]), value)) %>%
spread(v, value) %>%
ungroup() %>%
select(names(dat))
dat2
# # A tibble: 3 x 7
# id v1 v2 v3 v4 v5 pos
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 11 12 12 12 12 3
# 2 2 17 11 22 22 22 4
# 3 1 11 11 11 11 11 2
DATA
dat <- read.table(text = "id v1 v2 v3 v4 v5 pos
1 11 12 11 10 10 3
2 17 11 22 40 23 4
1 11 22 50 10 10 2",
header = TRUE)
library(tidyverse)
Using apply from base R
data.frame(t(apply(df, 1, function(x)
c(x[1:x["pos"]], rep(x[x["pos"]], ncol(df) - x["pos"] - 2), x['pos']))))
# X1 X2 X3 X4 X5 X6
#1 1 11 12 12 12 3
#2 2 17 11 22 22 4
#3 1 11 11 11 11 2

R: Ordering one column conditionally on another and partial order value

I have this dataframe of retweets
set.seed(28100)
df <- data.frame(user_id = sample(1:8, 10, replace = TRUE),
timestamp = sample(1:1000, 10),
retweet = sample(999:1002, 10, replace=TRUE))
df <- df[with(df, order(retweet, -timestamp)),]
df
# user_id timestamp retweet
# 6 8 513 999
# 9 7 339 999
# 3 3 977 1000
# 2 3 395 1000
# 5 2 333 1000
# 4 5 793 1001
# 1 3 873 1002
# 8 2 638 1002
# 7 4 223 1002
# 10 6 72 1002
There is a unique id for each retweet. For each row I want to assign a rank to the user according to the inverse order of the chain or retweets. The rank should estimate the influence of each user: the longer the chain the highest the point for the early twitterer. In other words I want to rank-order each retweet chain based on the timestamp and assign higher points to those who retweeted it before. If two users have posted the same retweet at the same time they should be assign the same ranking.
Or in df
df$ranking <- c(1,2, 1,2,3, 1, 1,2,3,4)
aggregate(ranking~user_id, data=df, sum)
# user_id ranking
# 1 2 5
# 2 3 4
# 3 4 3
# 4 5 1
# 5 6 4
# 6 7 2
# 7 8 1
using data-table:
library(data.table)
setDT(df)[order(-timestamp), ranking2 := seq_len(.N), by = retweet]
df[, sum(ranking2), keyby = user_id]
# user_id V1
# 1: 2 5
# 2: 3 4
# 3: 4 3
# 4: 5 1
# 5: 6 4
# 6: 7 2
# 7: 8 1

Restructuring team- to individual-level data in R (while retaining team-level information)

My current data looks like this:
Person Team
10 100
11 100
12 100
10 200
11 200
14 200
15 200
I want to infer who knew one another, based on what teams they were on together. I also want a count of how many times a dyad was on a team together, and I want to keep track of the team identification codes that link each pair of people. In other words, I want to create a data set that looks like this:
Person1 Person2 Count Team1 Team2 Team3
10 11 2 100 200 NA
10 12 1 100 NA NA
11 12 1 100 NA NA
10 14 1 200 NA NA
10 15 1 200 NA NA
11 14 1 200 NA NA
11 15 1 200 NA NA
The resulting data set captures the relationships that can be inferred based on the teams that were outlined in the original data set. The "Count" variable reflects the number of instances that a pair of people was on a team together. The "Team1", "Team2", and "Team3" variables list the team ID(s) that link each pair of people to one another. It doesn't make a difference which person/team ID is listed first versus second. Teams range in size from 2 members to 8 members.
Here's a "data.table" solution that seems to get to where you want to get (albeit with quite a mouthful of code):
library(data.table)
dcast.data.table(
dcast.data.table(
as.data.table(d)[, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)],
time + Team ~ ind, value.var = "V1")[
, c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)],
Person1 + Person2 + count ~ time, value.var = "Team")
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
Update: Step-by-step version of the above
To understand what's happening above, here's a step-by-step approach:
## The following would be a long data.table with 4 columns:
## Team, V1, ind, and time
step1 <- as.data.table(d)[
, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)]
head(step1)
# Team V1 ind time
# 1: 100 10 Person1 1
# 2: 100 11 Person2 1
# 3: 100 10 Person1 2
# 4: 100 12 Person2 2
# 5: 100 11 Person1 3
# 6: 100 12 Person2 3
## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
# time Team Person1 Person2
# 1: 1 100 10 11
# 2: 1 200 10 11
# 3: 2 100 10 12
# 4: 2 200 10 14
# 5: 3 100 11 12
# 6: 3 200 10 15
# 7: 4 200 11 14
# 8: 5 200 11 15
# 9: 6 200 14 15
## Create a "count" column and a "time" column,
## grouped by "Person1" and "Person2".
## Count is for the count column.
## Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)),
by = list(Person1, Person2)]
step3
# time Team Person1 Person2 count
# 1: 1 100 10 11 2
# 2: 2 200 10 11 2
# 3: 1 100 10 12 1
# 4: 1 200 10 14 1
# 5: 1 100 11 12 1
# 6: 1 200 10 15 1
# 7: 1 200 11 14 1
# 8: 1 200 11 15 1
# 9: 1 200 14 15 1
## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time,
value.var = "Team")
out
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
Following #Gregor and using Gregor's data, I tried to add team columns. I could not produce what you requested, but this may be useful. Using full_join in the dev version of dplyr (dplyr 0.4), I did the following. I created a data frame for each team with all combinations of Person using combn and saved the data as the object, a. Then, I separated a by team and used full_join. In this way, I tried to create team columns, at least for team 100 and 200. I used rename to change column names and select to order the columns in your way.
library(dplyr)
group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y) %>%
select(Person1, Person2, Team1, Team2)
# Person1 Person2 Team1 Team2
#1 10 11 100 200
#2 10 12 100 NA
#3 11 12 100 NA
#4 10 14 NA 200
#5 10 15 NA 200
#6 11 14 NA 200
#7 11 15 NA 200
#8 14 15 NA 200
EDIT
I am sure there are better ways of doing this. But, this is the closest I can do. I tried to add the count using another join in this version.
group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
full_join(count(a, X1, X2), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y, Count = n) %>%
select(Person1, Person2, Count, Team1, Team2)
# Person1 Person2 Count Team1 Team2
#1 10 11 2 100 200
#2 10 12 1 100 NA
#3 11 12 1 100 NA
#4 10 14 1 NA 200
#5 10 15 1 NA 200
#6 11 14 1 NA 200
#7 11 15 1 NA 200
#8 14 15 1 NA 200
The counts are easy to obtain with a self-join, which I think is easiest to do using sqldf. (Note that I probably think sqldf is easiest because I'm not too good with data.table.) Editing to include #G. Grothendieck's suggestion:
# your data
dd <- structure(list(Person = c(10L, 11L, 12L, 10L, 11L, 14L, 15L),
Team = c(100L, 100L, 100L, 200L, 200L, 200L, 200L)), .Names = c("Person",
"Team"), class = "data.frame", row.names = c(NA, -7L))
library(sqldf)
dyads = sqldf("select dd1.Person Person1, dd2.Person Person2
, count(*) Count
, group_concat(dd1.Team) Teams
from dd dd1
inner join dd dd2
on dd1.Team = dd2.Team and dd1.Person < dd2.Person
group by dd1.Person, dd2.Person")
Person1 Person2 Count Teams
1 10 11 2 100,200
2 10 12 1 100
3 10 14 1 200
4 10 15 1 200
5 11 12 1 100
6 11 14 1 200
7 11 15 1 200
8 14 15 1 200
We can then split the string to get the columns you want.
library(stringr)
cbind(dyads, apply(str_split_fixed(dyads$Teams, ",",
n = max(str_count(dyads$Teams, pattern = ",")) + 1),
MARGIN = 2, FUN = as.numeric))
Person1 Person2 Count Teams 1 2
1 10 11 2 100,200 100 200
2 10 12 1 100 100 NA
3 10 14 1 200 200 NA
4 10 15 1 200 200 NA
5 11 12 1 100 100 NA
6 11 14 1 200 200 NA
7 11 15 1 200 200 NA
8 14 15 1 200 200 NA
I'll leave the column renaming to you.
Here is the general solution:
library(dplyr)
library(reshape2)
find.friends <- function(d,n=2) {
d$exist <- T
z <- dcast(d,Person~Team,value.var='exist')
# Person 100 200
# 1 10 TRUE TRUE
# 2 11 TRUE TRUE
# 3 12 TRUE NA
# 4 14 NA TRUE
# 5 15 NA TRUE
pairs.per.team <- sapply(
sort(unique(d$Team)),
function(team) {
non.na <- !is.na(z[,team])
if (sum(non.na)<n) return()
combns <- t(combn(z$Person[non.na],n))
cbind(combns,team)
}
)
df <- as.data.frame(do.call(rbind,pairs.per.team))
if (nrow(df)==0) return()
persons <- sprintf('Person%i',1:n)
colnames(df)[1:n] <- persons
# Person1 Person2 team
# 1 10 11 100
# 2 10 12 100
# 3 11 12 100
# 4 10 11 200
# 5 10 14 200
# 6 10 15 200
# 7 11 14 200
# 8 11 15 200
# 9 14 15 200
# Personally, I find the data frame above most suitable for further analysis.
# The following code is needed only to make the output compatible with the author's one
df2 <- df %>%
grouped_df(as.list(persons)) %>%
mutate(i.team=paste0('team',seq_along(team)))
# Person1 Person2 team i.team
# 1 10 11 100 team1
# 2 10 12 100 team1
# 3 11 12 100 team1
# 4 10 11 200 team2
# 5 10 14 200 team1
# 6 10 15 200 team1
# 7 11 14 200 team1
# 8 11 15 200 team1
# 9 14 15 200 team1
# count number of teams per pair
df2.count <- df %>%
grouped_df(as.list(persons)) %>%
summarize(cnt=length(team))
# reshape the data
df3 <- dcast(df2,
as.formula(sprintf('%s~i.team',paste(persons,collapse='+'))),
value.var='team'
)
df3$count <- df2.count$cnt
df3
}
Your data are:
d <- structure(list(Person = c("10", "11", "12", "10", "11", "14",
"15"), Team = c("100", "100", "100", "200", "200", "200", "200"
)), .Names = c("Person", "Team"), row.names = c(NA, -7L), class = "data.frame")
By using
find.friends(d,n=2)
you should obtain the desired output.
By changing n, you can also find triads, tetrads etc.

R data.table: Count events since last occurance (multiple, inclusive/exclusive)

[udpated: tried to clarify and simplify, corrected sample code and data.]
I've a set of measurements that are taken over a period of days. The range of numbers that can be captured in any measurement is 1-25 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I'd like a way to tally a count for how many events have passed since a specific number occurred regardless of the measurement column. I'd like it to reset the count after the number match as shown below.
V1,V2,Vn are the values captured.
Match1, Match2, Matchn are the counts since last encountered columns.
Note: Matchn counts are incremented regardless of which Vx column n is encountered.
Any help is much appreciated.
this is somewhat related to my earlier post here
Sample input
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,3,29,36),
V2 = c(2, 5, 12, 4,8,2,38,50,4,1)
)
code for creating Sample output
t$match1 <- c(1,2,3,4,1,2,3,4,5,1)
t$match2 <- c(1,1,2,3,4,5,1,2,3,4)
t$match3 <- c(1,2,3,1,2,3,4,5,1,2)
> t
Date V1 V2 match1 match2 match3
1: 2013-05-01 4 2 1 1 1
2: 2013-05-02 2 5 2 1 2
3: 2013-05-03 3 12 3 2 3
4: 2013-05-04 1 4 4 3 1
5: 2013-05-05 7 8 1 4 2
6: 2013-05-06 22 2 2 5 3
7: 2013-05-07 35 38 3 1 4
8: 2013-05-08 3 50 4 2 5
9: 2013-05-09 29 4 5 3 1
10: 2013-05-10 36 1 1 4 2
I think OP has a bunch of typos in it, and as far as I understand you want this:
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,52,29,36),
V2 = c(2, 5, 2, 4,8,47,38,50,4,1)
)
t[, inclusive.match.1 := 1:.N, by = cumsum(V1 == 1 | V2 == 1)]
t[, exclusive.match.1 := 1:.N, by = rev(cumsum(rev(V1 == 1 | V2 == 1)))]
t
# Date V1 V2 inclusive.match.1 exclusive.match.1
# 1: 2013-05-01 4 2 1 1
# 2: 2013-05-02 2 5 2 2
# 3: 2013-05-03 3 2 3 3
# 4: 2013-05-04 1 4 1 4
# 5: 2013-05-05 7 8 2 1
# 6: 2013-05-06 22 47 3 2
# 7: 2013-05-07 35 38 4 3
# 8: 2013-05-08 52 50 5 4
# 9: 2013-05-09 29 4 6 5
#10: 2013-05-10 36 1 1 6

Resources