merging time series and cross-section time series dataframes - r

I am having an issue merging 2 dataframes in R - one is a time series cross-section (i.e. a panel) and the other is a simple time series. Suppose I have two dataframes, df1 and df2, that I would like to merge. The panel dataframe df1 is given by
id year var1
1 80 3
1 81 5
1 82 7
1 83 9
2 80 5
2 81 5
2 82 7
2 83 5
3 80 9
3 81 9
3 82 7
3 83 3
while the time series dataframe df2 is given by
year var2
80 10
81 15
82 17
83 19
I would like to merge df1 and df2 into a third dataframe df, while preserving the time series cross-section row ordering of df1. However, when I use the command
df <- merge(df1, df2, by="year")
the new dataframe clusters the observations by year.
year id var1 var2
80 1 3 10
80 2 5 10
80 3 9 10
81 1 5 15
81 2 5 15
81 3 9 15
82 1 7 17
82 2 7 17
82 3 7 17
83 1 9 19
83 2 5 19
83 3 3 19
Does anyone know how I can make the row ordering in df the same as in df1? Thanks in advance!

Related

Creating a new dataset for each combination of rows in groups

I'm trying to create a dataset for each combination of rows from separate groups. Ideally, one row from each group would be selected and there would be a dataset for every combination. I have a dataset of that looks similar in structure to the sample below:
Name Group Stat1 Stat2
1 1 a 63 38
2 2 a 33 62
3 3 b 3 66
4 4 b 57 67
5 5 c 42 69
6 6 c 47 14
7 7 c 16 10
8 8 d 21 46
9 9 d 72 1
Trying to get the end result of the first dataset to look like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 8 d 21 46
With the second data dataset looking like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 9 d 72 1
Until every combination has been exhausted. I've tried strategies using apply functions and combn but cannot seem to get the result I want. This does not seem too challenging to me conceptually, so I'm not sure what I'm missing.
Any help would be greatly appreciated! Thanks in advance!
Lots of ways to approach this. A simple solution is to just generate all 4 row combos, then subset to those with all distinct Group values. I named your data df and assumed Name would be unique row id. If that's not true, you could replace df$Name with 1:nrow(df)
# All 4 row combos of row ids
combs <- combn(df$Name, 4)
# Match group labels to row ids
g <- matrix(df$Group[combs], nrow = 4)
# 4 row combs filtered to all distinct group vals
combs <- combs[,apply(g, 2, function(i) all(!duplicated(i)))]
# For each 4 row combo, extract rows from the dataframe
final_list <- apply(combs, 2, function(i) df[i,])
final_list[1:3]
[[1]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
8 8 d 21 46
[[2]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
9 9 d 72 1
[[3]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
6 6 c 47 14
8 8 d 21 46

R: Creating a vector with certain values from another vector

So I have a csv file with column headers ID, Score, and Age.
So in R I have,
data <- read.csv(file.choose(), header=T)
attach(data)
I would like to create two new vectors with people's scores whos age are below 70 and above 70 years old. I thought there was a nice a quick way to do this but I cant find it any where. Thanks for any help
Example of what data looks like
ID, Score, Age
1, 20, 77
2, 32, 65
.... etc
And I am trying to make 2 vectors where it consists of all peoples scores who are younger than 70 and all peoples scores who are older than 70
Assuming data looks like this:
Score Age
1 1 29
2 5 39
3 8 40
4 3 89
5 5 31
6 6 23
7 7 75
8 3 3
9 2 23
10 6 54
.. . ..
you can use
df_old <- data[data$Age >= 70,]
df_young <- data[data$Age < 70,]
which gives you
> df_old
Score Age
4 3 89
7 7 75
11 7 97
13 3 101
16 5 89
18 5 89
19 4 96
20 3 97
21 8 75
and
> df_young
Score Age
1 1 29
2 5 39
3 8 40
5 5 31
6 6 23
8 3 3
9 2 23
10 6 54
12 4 23
14 2 23
15 4 45
17 7 53
PS: if you only want the scores and not the age, you could use this:
df_old <- data[data$Age >= 70, "Score"]
df_young <- data[data$Age < 70, "Score"]

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

R: how to use expand.grid to generate combinations based on group

I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2

Mutate dataframes by matching ids in r [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 7 years ago.
I have three data frames:
df1:
id score1
1 50
2 23
3 40
4 68
5 82
6 38
df2:
id score2
1 33
2 23
4 64
5 12
6 32
df3:
id score3
1 50
2 23
3 40
4 68
5 82
I want to mutate the three scores to a dataframe like this, using NA to denote the missing value
id score1 score2 score3
1 50 33 50
2 23 23 23
3 40 NA 40
4 68 64 68
5 82 12 82
6 38 32 NA
Or like this, deleting the NA values:
id score1 score2 score3
1 50 33 50
2 23 23 23
4 68 64 68
5 82 12 82
However, mutate (in dplyer) does not take different length. So I can not mutate. How can I do that?
You can try
Reduce(function(...) merge(..., by='id'), list(df1, df2, df3))
# id score1 score2 score3
#1 1 50 33 50
#2 2 23 23 23
#3 4 68 64 68
#4 5 82 12 82
If you have many dataset object names with pattern 'df' followed by number
Reduce(function(...) merge(..., by='id'), mget(paste0('df',1:3)))
Or instead of paste0('df', 1:3), you can use ls(pattern='df\\d+') as commented by #DavidArenburg

Resources