Adding additional observation in panel data in R - r

I am trying to add additional years to my panel data. Just wondering if you guys have any ideas of quick way of doing it. Keep in mind my real data is T=6, i=4000.
# Here is my input
data = data.frame(time=c(30,40,50,30,40,50,30,40,50),
id=c(1,1,1,2,2,2,3,3,3),
d=c(1,4,7,8,14,2,41,11,61))
# declare panel data individ and time
pd = pdata.frame(data, c("id","time"), drop.index=FALSE)
#this is what I want out...
data.out = data.frame(time=c(30,40,50,60,30,40,50,60,30,40,50,60),
id=c(1,1,1,1,2,2,2,2,3,3,3,3),
d=c(1,4,7,8,9,14,2,41,50,11,61,70))
# declare panel data individ and time
pd.data.out = pdata.frame(data.out, c("id","time"), drop.index=FALSE)

I am not quite sure what you are doing but this might help:
data = data.frame(time=c(30,40,50,30,40,50,30,40,50),
id=c(1,1,1,2,2,2,3,3,3),
d=c(1,4,7,8,14,2,41,11,61))
newdata = data.frame(time=c(60,60,60),
id=c(1,2,3),
d=c(9,50,70))
combodata = rbind(data,newdata)
data.out = combodata[order(combodata$id,combodata$time), ]
rownames(data.out) = NULL
to produce
> data.out
time id d
1 30 1 1
2 40 1 4
3 50 1 7
4 60 1 9
5 30 2 8
6 40 2 14
7 50 2 2
8 60 2 50
9 30 3 41
10 40 3 11
11 50 3 61
12 60 3 70
and I think this is what you want for time and id, though d is marginally different. If the rows do not need to be ordered then the last three lines of the code can be condensed to
data.out = rbind(data,newdata)

Got it... just create new time and id data.frame and merge into it.
time = rep(c(unique(as.numeric(as.character(pd$time))),max(as.numeric(as.character(pd$time)))+10), length(unique(pd$id)))
id = rep( unique(pd$id), each=max(as.numeric(as.character(pd$id)))+1)
data2 = data.frame(time, id)
data.out = merge(data2, pd, all.x=T)
data.out = data.out[with(data.out, order(id,time) ), ]

Related

R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

Wide to long with many different columns

I have used pivot_longer before but this time I have a much more complex wide dataframe and I cannot sort it out. The example code will provide you a reproducible dataframe. I haven't dealt with such thing before so I'm not sure it's correct to try to format this type of df in long format?
df <- data.frame(
ID = as.numeric(c("7","8","10","11","13","15","16")),
AGE = as.character(c("45 – 54","25 – 34","25 – 34","25 – 34","25 – 34","18 – 24","35 – 44")),
GENDER = as.character(c("Female","Female","Male","Female","Other","Male","Female")),
SD = as.numeric(c("3","0","0","0","3","2","0")),
GAMING = as.numeric(c("0","0","0","0","2","2","0")),
HW = as.numeric(c("2","2","0","2","2","2","2")),
R1_1 = as.numeric(c("10","34","69","53","79","55","28")),
M1_1 = as.numeric(c("65","32","64","53","87","55","27")),
P1_1 = as.numeric(c("65","38","67","54","88","44","26")),
R1_2 = as.numeric(c("15","57","37","54","75","91","37")),
M1_2 = as.numeric(c("90","26","42","56","74","90","37")),
P1_2 = as.numeric(c("90","44","33","54","79","95","37")),
R1_3 = as.numeric(c("5","47","80","27","61","19","57")),
M1_3 = as.numeric(c("30","71","80","34","71","15","57")),
P1_3 = as.numeric(c("30","36","81","35","62","8","56")),
R2_1 = as.numeric(c("10","39","75","31","71","80","59")),
M2_1 = as.numeric(c("90","51","74","15","70","75","61")),
P2_1 = as.numeric(c("90","52","35","34","69","83","60")),
R2_2 = as.numeric(c("10","45","31","54","39","95","77")),
M2_2 = as.numeric(c("60","70","40","78","5","97","75")),
P2_2 = as.numeric(c("60","40","41","58","9","97","76")),
R2_3 = as.numeric(c("5","38","78","45","25","16","22")),
M2_3 = as.numeric(c("30","34","84","62","33","52","20")),
P2_3 = as.numeric(c("30","34","82","45","32","16","22")),
R3_1 = as.numeric(c("10","40","41","42","62","89","41")),
M3_1 = as.numeric(c("90","67","37","40","27","89","42")),
P3_1 = as.numeric(c("90","34","51","44","38","84","43")),
R3_2 = as.numeric(c("10","37","20","54","8","93","69")),
M3_2 = as.numeric(c("60","38","21","62","5","95","71")),
P3_2 = as.numeric(c("60","38","23","65","14","92","69")),
R3_3 = as.numeric(c("5","30","62","11","60","32","52")),
M3_3 = as.numeric(c("30","67","34","55","45","25","45")),
P3_3 = as.numeric(c("30","28","41","24","53","23","52")),
R1_4 = as.numeric(c("10","40","61","17","39","72","25")),
M1_4 = as.numeric(c("45","20","63","25","62","70","23")),
P1_4 = as.numeric(c("45","52","56","16","26","72","27")),
R2_4 = as.numeric(c("5","21","70","33","80","68","30")),
M2_4 = as.numeric(c("35","21","69","27","85","69","23")),
P2_4 = as.numeric(c("35","32","34","25","79","63","29")),
R3_4 = as.numeric(c("10","29","68","21","8","71","41")),
M3_4 = as.numeric(c("50","37","66","28","33","65","41")),
P3_4 = as.numeric(c("50","38","47","28","24","71","41"))
)
I would like to sort it out like in the following table
the new column names are extracted from the old ones such that (example) in R1_1:
R is the namer of the column containing the value previously stored
in R1_1
1 (the first character after 'R' in R1_1) is the value used
in column Speed
1 (last character of 'R1_1') is the value used in
column Sound
basically each row corresponds to 1 question answered by 1 person, and each question was answered through 3 different ratings (R, M, P)
thank you!
If I understood you correctly, the following should work:
df %>%
pivot_longer(
cols = matches('[RMP]\\d_\\d'),
names_to = c('RMP', 'Speed', 'Sound'),
values_to = 'Data',
names_pattern = '([RMP])(\\d)_(\\d)'
) %>%
pivot_wider(names_from = RMP, values_from = Data)
This assumes that both “speed” and “sound” are single-digit values. If there’s the possibility of multiple digits, the occurrences of \\d in the patterns above need to be replaced by \\d+.
Solution using our good ol' workhorse reshape. At first we grep the names with a "Wd_d" pattern, as well as their suffixes "d_d" for following use in reshape.
nm <- names(df[grep("_\\d", names(df))])
times <- unique(substr(nm, 2, 4))
res <- reshape(df, idvar="ID", varying=7:42, v.names=unique(substr(nm, 1, 1)),
times=times,direction="long")
Getting us close to the result, we just need to strsplit the newly created "time" variable at the "_" and rbind it to the former.
res <- cbind(res, setNames(type.convert(do.call(rbind.data.frame,
strsplit(res$time, "_"))),
c("Speed", "Sound")))
res <- res[order(res$AGE), ] ## some ordering
Result
head(res)
# ID AGE GENDER SD GAMING HW time R M P Speed Sound
# 15.1_1 15 18 – 24 Male 2 2 2 1_1 55 44 55 1 1
# 15.1_2 15 18 – 24 Male 2 2 2 1_2 90 95 91 1 2
# 15.1_3 15 18 – 24 Male 2 2 2 1_3 15 8 19 1 3
# 15.2_1 15 18 – 24 Male 2 2 2 2_1 75 83 80 2 1
# 15.2_2 15 18 – 24 Male 2 2 2 2_2 97 97 95 2 2
# 15.2_3 15 18 – 24 Male 2 2 2 2_3 52 16 16 2 3

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Grouping R variables based on sub-groups

I have a data formatted as
PERSON_A PERSON_B MEET LEAVE
That describes basically when a PERSON_A met a PERSON_B at time MEET and they said "bye" to each other at moment LEAVE. The time is expressed in seconds, and there is a small part of the data on http://pastie.org/2825794 (simple.dat).
What I need is to count the number of meetings grouping it by day. At the moment, I have a code that works, the appearance is not beautiful. Anyway, I'd like a help in order to transform it in a code that reflects the grouping Im trying to do, e.g, using ddply, etc. Therefore, my main aim is to learn from this case. Probably there are many mistakes in this code regarding good practices in R.
library(plyr)
data = read.table("simple.dat", stringsAsFactors=FALSE)
names(data)=c('PERSON_A','PERSON_B','MEET','LEAVE')
attach(data)
min_interval = min(MEET)
max_interval = max(LEAVE)
interval = max_interval - min_interval
day = 86400
number_of_days = floor(interval/day)
g = data.frame(MEETINGS=c(0:number_of_days)) # just to store the result
g[,1] = 0
start_offset = min_interval # start of the first day
for (interval in c(0:number_of_days)) {
end_offset = start_offset + day
meetings = (length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_A) + length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_B))
g[interval+1, ] = meetings
start_offset = end_offset # start next day
}
g
This code iterates over the days (intervals of 86400 seconds) and stores the number of meetings on the dataframe g. The correct output (shown bellow) of this code when executed on the linked dataset gives for each line (day) the number o meetings.
MEETINGS
1 38
2 10
3 16
4 18
5 24
6 6
7 4
8 10
9 28
10 14
11 22
12 2
13 .. 44 0 # I simplified the output here
45 2
Anyway, I know that I could use ddply to get the number of meetings for each pair o nodes:
contacts <- ddply(data, .(PERSON_A, PERSON_B), summarise
, CONTACTS = length(c(PERSON_A, PERSON_B)) /2
)
but there is a huge hill for me between this and the result I need.
As a end note, I read How to make a great R reproducible example? and tried my best :)
Thanks,
try this:
> d2 <- transform(data, m = floor(MEET/86400) + 1, l = floor(LEAVE/86400) + 1)
> d3 <- subset(d2, m == l)
> table(d3$m) * 2
1 2 3 4 5 6 7 8 9 10 11 12 45
38 10 16 18 24 6 4 10 28 14 22 2 2
floor(x/(60*60*24)) is a quick way to convert second into day.

Resources