I have two dataframes with different number of lines and columns, such as:
a (12981 lines and 3 columns)
Year Month Day
1980 1 1
1980 1 2
1980 1 3
1980 1 4
1980 1 5
...
1980 1 31
1980 2 1
1980 2 2
1980 2 3
1980 2 4
1980 2 5
...
b (426 lines and 3 columns)
Year Month Value
1980 1 356
1980 2 389
1980 3 378
1980 4 450
1980 5 500
...
1981 2 450
I want to add "Value" column (from b ) to a to get something like this:
a_withValues (12981 lines with 4 columns)
Year Month Day Value
1980 1 1 356
1980 1 2 356
1980 1 3 356
1980 1 4 356
1980 1 5 356
...
1980 1 31 356
1980 2 1 389
1980 2 2 389
1980 2 3 389
1980 2 4 389
1980 2 5 389
...
In other words if a$Year and a$Month are equal to b$Year and b$Month I want to add (for a new column in a) the corresponding value from b$Value.
There is a base R solution to this, just use the function merge. By default it will choose columns with matching names, so in your case it will work out of the box
a <- expand.grid(year=1980, month=1:2, day=1:30)
b <- data.frame(year=1980, month=1:2, value=c(356,389))
a_with_b <- merge(a,b)
Here:
> head(a)
year month day
1 1980 1 1
2 1980 2 1
3 1980 1 2
4 1980 2 2
5 1980 1 3
6 1980 2 3
> head(b)
year month value
1 1980 1 356
2 1980 2 389
> head(a_with_b)
year month day value
1 1980 1 1 356
2 1980 1 8 356
3 1980 1 2 356
4 1980 1 9 356
5 1980 1 3 356
6 1980 1 10 356
What you are looking for is a join of the data.frames (at least to my understanding). That includes matching keys of the two items and then adding the values as another column.
You can achieve merging the two datasets like this, using data.table:
library(data.table)
dt1 <- data.table(Year = 1980,
Month = 1:3,
Day = 1)
dt1
# Year Month Day
# 1: 1980 1 1
# 2: 1980 2 1
# 3: 1980 3 1
dt2 <- data.table(Year = 1980,
Month = 1:3,
Value = runif(3, 100, 1000))
dt2
# Year Month Value
# 1: 1980 1 389.7436
# 2: 1980 2 902.0029
# 3: 1980 3 663.6313
merge(dt1, dt2, by = c("Year", "Month"), all.x = T)[order(Year, Month)]
# Year Month Day Value
# 1: 1980 1 1 389.7436
# 2: 1980 2 1 902.0029
# 3: 1980 3 1 663.6313
If you just want to create another column in one data.table (note, data.tables are similar to a data.frames in many aspects) without any matching, you can do it like this:
dt1$Value <- dt2$Value
Related
I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10
set.seed(123)
df <- data.frame(loc.id = rep(c(1:3), each = 4*10),
year = rep(rep(c(1980:1983), each = 10), times = 3),
day = rep(1:10, times = 3*4),
x = sample(123:200, 4*3*10, replace = T))
I want to add one more column x.mv which is 3 days moving average of x for each loc.id and year combination
df %>% group_by(loc.id,year) %>% mutate(x.mv = zoo::rollmean(x, 3, fill = "NA", align = "right"))
loc.id year day x x.mv
<int> <int> <int> <int> <dbl>
1 1 1980 1 145 NA
2 1 1980 2 184 NA
3 1 1980 3 154 161
4 1 1980 4 191 176.
5 1 1980 5 196 180.
6 1 1980 6 126 171
7 1 1980 7 164 162
8 1 1980 8 192 161.
9 1 1980 9 166 174
10 1 1980 10 158 172
What I want to do is to replace the NAs in the x.mv column with x. I tried this:
df %>% group_by(loc.id,year) %>% mutate(x.mv = zoo::rollmean(x, 3, fill = x[1:2], align = "right"))
loc.id year day x x.mv
<int> <int> <int> <int> <dbl>
1 1 1980 1 145 145
2 1 1980 2 184 145
3 1 1980 3 154 161
4 1 1980 4 191 176.
5 1 1980 5 196 180.
6 1 1980 6 126 171
7 1 1980 7 164 162
8 1 1980 8 192 161.
9 1 1980 9 166 174
10 1 1980 10 158 172
But what it is doing instead is filling the NAs with the first value of x instead of the corresponding value of x. How do I fix it?
skip the fill argument and pad manually:
df %>%
group_by(loc.id,year) %>%
mutate(x.mv = c(x[1:2],zoo::rollmean(x, 3, align = "right"))) %>%
ungroup
# # A tibble: 120 x 5
# loc.id year day x x.mv
# <int> <int> <int> <int> <dbl>
# 1 1 1980 1 145 145.0000
# 2 1 1980 2 184 184.0000
# 3 1 1980 3 154 161.0000
# 4 1 1980 4 191 176.3333
# 5 1 1980 5 196 180.3333
# 6 1 1980 6 126 171.0000
# 7 1 1980 7 164 162.0000
# 8 1 1980 8 192 160.6667
# 9 1 1980 9 166 174.0000
# 10 1 1980 10 158 172.0000
# # ... with 110 more rows
You might want to use dplyr::cummean(x[1:2]) instead of x[1:2], to have an average for the second value already, or in this case, use #g-grothendieck's suggestion in the comments and rewrite your mutate call as mutate(x.mv = rollapplyr(x, 3, mean, partial = TRUE)).
I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.
I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.
Here is a descriptive example of my problem (with some very basic reproducible numbers):
first dataset
a <- c(1990)
b <- c(1980:1981)
c <- c(1994:1995)
aa <- rep("A", 1)
bb <- rep("B", 2)
cc <- rep("C", 2)
df1 <- data.frame(comp=factor(c(aa, bb, cc)))
df2 <- data.frame(year=factor(c(a, b, c)))
other.columns <- rep("other_columns", length(df1))
df_f <- cbind(df1, df2, other.columns ) # first dataset
second dataset
z <- c(10:12)
x <- c(7:12)
xx <- c(1:9)
v <- c(2:9)
w <- rep(1990, length(z))
e <- rep(1980, length(x))
ee <- rep (1981, length(xx))
r <- rep(1995, length(v))
t <- rep("A", length(z))
y <- rep("B", length(x) + length(xx))
u <- rep("C", length(v))
df3 <- data.frame(month=factor(c(z, x, xx, v)))
df4 <- data.frame(year=factor(c(w, e, ee, r)))
df5 <- data.frame(comp=factor(c(t, y, u)))
df_m <- cbind(df5, df4, df3) # second dataset
Output:
> df_m
comp year month
1 A 1990 10
2 A 1990 11
3 A 1990 12
4 B 1980 7
5 B 1980 8
6 B 1980 9
7 B 1980 10
8 B 1980 11
9 B 1980 12
10 B 1981 1
11 B 1981 2
12 B 1981 3
13 B 1981 4
14 B 1981 5
15 B 1981 6
16 B 1981 7
17 B 1981 8
18 B 1981 9
19 C 1995 2
20 C 1995 3
21 C 1995 4
22 C 1995 5
23 C 1995 6
24 C 1995 7
25 C 1995 8
26 C 1995 9
> df_f
comp year other.columns
1 A 1990 other_columns
2 B 1980 other_columns
3 B 1981 other_columns
4 C 1994 other_columns
5 C 1995 other_columns
I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.
The wanted output clarifies the problem and the goal:
Wanted output:
comp year month comp year other.columns
1 A 1990 10 A 1990 other_columns
2 A 1990 11 A 1990 other_columns
3 A 1990 12 A 1990 other_columns
4 B 1980 7 B 1980 other_columns
5 B 1980 8 B 1980 other_columns
6 B 1980 9 B 1980 other_columns
7 B 1980 10 B 1980 other_columns
8 B 1980 11 B 1980 other_columns
9 B 1980 12 B 1980 other_columns
10 B 1981 1 B 1980 other_columns
11 B 1981 2 B 1980 other_columns
12 B 1981 3 B 1980 other_columns
13 B 1981 4 B 1980 other_columns
14 B 1981 5 B 1980 other_columns
15 B 1981 6 B 1980 other_columns
16 B 1981 7 B 1981 other_columns
17 B 1981 8 B 1981 other_columns
18 B 1981 9 B 1981 other_columns
19 C 1995 2 C 1994 other_columns
20 C 1995 3 C 1994 other_columns
21 C 1995 4 C 1994 other_columns
22 C 1995 5 C 1994 other_columns
23 C 1995 6 C 1994 other_columns
24 C 1995 7 C 1995 other_columns
25 C 1995 8 C 1995 other_columns
26 C 1995 9 C 1995 other_columns
Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.
The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.
Before the tables can be combined, the numeric columns must be converted to be numeric:
library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))
The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.
Then I add the column to be used for matching:
df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))
And in the last step, I join the two tables:
df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))
The argument by determines which columns of the two data frames should be matched. The output agrees with your result:
## comp year month match_year other.columns
## 1 A 1990 10 1990 other_columns
## 2 A 1990 11 1990 other_columns
## 3 A 1990 12 1990 other_columns
## 4 B 1980 7 1980 other_columns
## 5 B 1980 8 1980 other_columns
## 6 B 1980 9 1980 other_columns
## 7 B 1980 10 1980 other_columns
## 8 B 1980 11 1980 other_columns
## 9 B 1980 12 1980 other_columns
## 10 B 1981 1 1980 other_columns
## 11 B 1981 2 1980 other_columns
## 12 B 1981 3 1980 other_columns
## 13 B 1981 4 1980 other_columns
## 14 B 1981 5 1980 other_columns
## 15 B 1981 6 1980 other_columns
## 16 B 1981 7 1981 other_columns
## 17 B 1981 8 1981 other_columns
## 18 B 1981 9 1981 other_columns
## 19 C 1995 2 1994 other_columns
## 20 C 1995 3 1994 other_columns
## 21 C 1995 4 1994 other_columns
## 22 C 1995 5 1994 other_columns
## 23 C 1995 6 1994 other_columns
## 24 C 1995 7 1995 other_columns
## 25 C 1995 8 1995 other_columns
## 26 C 1995 9 1995 other_columns
I have data organized by country-year, with a ID for a dyadic relationship. I want to organize this by dyad-year.
Here is how my data is organized:
dyadic_id country_codes year
1 1 200 1990
2 1 20 1990
3 1 200 1991
4 1 20 1991
5 2 300 1990
6 2 10 1990
7 3 100 1990
8 3 10 1990
9 4 500 1991
10 4 200 1991
Here's how I want my data to be organized:
dyadic_id_want country_codes_1 country_codes_2 year_want
1 1 200 20 1990
2 1 200 20 1991
3 2 300 10 1990
4 3 100 10 1990
5 4 500 200 1991
Here is reproducible code:
dyadic_id<-c(1,1,1,1,2,2,3,3,4,4)
country_codes<-c(200,20,200,20,300,10,100,10,500,200)
year<-c(1990,1990,1991,1991,1990,1990,1990,1990,1991,1991)
mydf<-as.data.frame(cbind(dyadic_id,country_codes,year))
I want mydf to look like df_i_want
dyadic_id_want<-c(1,1,2,3,4)
country_codes_1<-c(200,200,300,100,500)
country_codes_2<-c(20,20,10,10,200)
year_want<-c(1990,1991,1990,1990,1991)
my_df_i_want<-as.data.frame(cbind(dyadic_id_want,country_codes_1,country_codes_2,year_want))
We can reshape from 'long' to 'wide' using different methods. Two are described below.
Using 'data.table', we convert the 'data.frame', to 'data.table' (setDT(mydf)), create a sequence column ('ind'), grouped by 'dyadic_id' and 'year'. Then, we convert the dataset from 'long' to 'wide' format using dcast.
library(data.table)
setDT(mydf)[, ind:= 1:.N, by = .(dyadic_id, year)]
dcast(mydf, dyadic_id+year~ paste('country_codes', ind, sep='_'), value.var='country_codes')
# dyadic_id year country_codes_1 country_codes_2
#1: 1 1990 200 20
#2: 1 1991 200 20
#3: 2 1990 300 10
#4: 3 1990 100 10
#5: 4 1991 500 200
Or using dplyr/tidyr, we do the same i.e. grouping by 'dyadic_id', 'year', create a 'ind' column (mutate(...), and use spread from tidyr to reshape to 'wide' format.
library(dplyr)
library(tidyr)
mydf %>%
group_by(dyadic_id, year) %>%
mutate(ind= paste0('country_codes', row_number())) %>%
spread(ind, country_codes)
# dyadic_id year country_codes1 country_codes2
# (dbl) (dbl) (dbl) (dbl)
#1 1 1990 200 20
#2 1 1991 200 20
#3 2 1990 300 10
#4 3 1990 100 10
#5 4 1991 500 200
I have a data frame like this
V1 V2 V3 V4 month year
1 -1 9 1 1 1989
1 -1 9 1 1 1989
4 -1 9 1 2 1989
3 2 7 3 1 1990
4 4 8 2 2 1990
3 6 9 2 2 1990
4 7 0 2 2 1990
5 8 4 2 2 1990
where the first 4 rows indicate the value of the quantity A in the cell 1,2,3,4 and the last two columns give the month and the year. What I would like to do is to calculate the monthly average of A for every cell and so to end up with a list
V1
1989
<A>jen <A>feb ..
1 4
1990
<A>jen <A>feb ..
3 4
V2
V3
Many thanks
I was still hoping for something a little bit more precise in your question as to what your desired output is exactly, but since you haven't updated that part, I'll provide two options.
Option 1
aggregate seems to be a pretty straightforward tool for this task, particularly if sticking with a "wide" format would be fine for your needs.
aggregate(. ~ year + month, mydf, mean)
# year month V1 V2 V3 V4
# 1 1989 1 1 -1.00 9.00 1
# 2 1990 1 3 2.00 7.00 3
# 3 1989 2 4 -1.00 9.00 1
# 4 1990 2 4 6.25 5.25 2
Option 2
If you prefer your data in a "long" format, you should explore the "reshape2" package which can handle the reshaping and aggregating in just a few steps.
library(reshape2)
mydfL <- melt(mydf, id.vars = c("year", "month"))
## The next step is purely cosmetic...
mydfL$month <- factor(month.abb[mydfL$month], month.abb, ordered = TRUE)
head(mydfL)
# year month variable value
# 1 1989 Jan V1 1
# 2 1989 Jan V1 1
# 3 1989 Feb V1 4
# 4 1990 Jan V1 3
# 5 1990 Feb V1 4
# 6 1990 Feb V1 3
## This is the actual aggregation and reshaping step...
out <- dcast(mydfL, variable + year ~ month,
value.var = "value", fun.aggregate = mean)
out
# variable year Jan Feb
# 1 V1 1989 1 4.00
# 2 V1 1990 3 4.00
# 3 V2 1989 -1 -1.00
# 4 V2 1990 2 6.25
# 5 V3 1989 9 9.00
# 6 V3 1990 7 5.25
# 7 V4 1989 1 1.00
# 8 V4 1990 3 2.00