Merging complementary rows of a dataframe with R

Merging complementary rows of a dataframe with R - r

I have such a data frame
0 weekday day month year hour basal bolus carb period.h
1 Tuesday 01 03 2016 0.0 0.25 NA NA 0
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
4 Tuesday 01 03 2016 12.0 0.30 NA NA 12
5 Tuesday 01 03 2016 17.0 0.50 NA NA 17
6 Tuesday 01 03 2016 17.6 NA NA 33 17
7 Tuesday 01 03 2016 17.6 NA 1.35 NA 17
8 Tuesday 01 03 2016 18.6 NA NA 44 18
9 Tuesday 01 03 2016 18.6 NA 1.80 NA 18
10 Tuesday 01 03 2016 18.9 NA NA 17 18
11 Tuesday 01 03 2016 18.9 NA 0.70 NA 18
12 Tuesday 01 03 2016 22.0 0.40 NA NA 22
13 Wednesday 02 03 2016 0.0 0.25 NA NA 0
14 Wednesday 02 03 2016 9.7 NA NA 39 9
15 Wednesday 02 03 2016 9.7 NA 2.65 NA 9
16 Wednesday 02 03 2016 11.2 NA NA 13 11
17 Wednesday 02 03 2016 11.2 NA 0.30 NA 11
18 Wednesday 02 03 2016 12.0 0.30 NA NA 12
19 Wednesday 02 03 2016 12.0 NA NA 16 12
20 Wednesday 02 03 2016 12.0 NA 0.65 NA 12
If you look at the lines 2 and 3, you notice that they correspond exactly to the same day & time: just for the line #2 the "carb" is not NA, and the "bolus" is not NA (These are data about diabete).
I want to merge such lines into a single one:
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
->
2 Tuesday 01 03 2016 10.9 NA 4.15 67 10
I could of course do a brutal double loop over each line, but I look for a cleverer and faster way.

You can group your data frame by the common identifier columns weekday, day, month, year, hour, period.h here and then sort and take the first element from the remaining columns which you would like to merge, sort() function by default will remove NAs in the vector to be sorted and thus you will end up with non-NA elements for each column within each group; if all elements in a column are NA, sort(col)[1] returns NA:
library(dplyr)
df %>%
group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(sort(.)[1]))
# weekday day month year hour period.h basal bolus carb
# <fctr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
# 1 Tuesday 1 3 2016 0.0 0 0.25 NA NA
# 2 Tuesday 1 3 2016 10.9 10 NA 4.15 67
# 3 Tuesday 1 3 2016 12.0 12 0.30 NA NA
# 4 Tuesday 1 3 2016 17.0 17 0.50 NA NA
# 5 Tuesday 1 3 2016 17.6 17 NA 1.35 33
# 6 Tuesday 1 3 2016 18.6 18 NA 1.80 44
# 7 Tuesday 1 3 2016 18.9 18 NA 0.70 17
# 8 Tuesday 1 3 2016 22.0 22 0.40 NA NA
# 9 Wednesday 2 3 2016 0.0 0 0.25 NA NA
# 10 Wednesday 2 3 2016 9.7 9 NA 2.65 39
# 11 Wednesday 2 3 2016 11.2 11 NA 0.30 13
# 12 Wednesday 2 3 2016 12.0 12 0.30 0.65 16
Instead of sort(), maybe a more appropriate function to use here is na.omit():
df %>% group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(na.omit(.)[1]))

Related

Long to wide: compacting rows

It is my first post.
I'm a beginner with R.
I have a df like this:
date value
2018-01-01 123
2018-02-01 12
2018-03-01 23
...
2019-01-01 3
2019-02-01 21
2019-03-01 2
...
2020-01-01 31
2020-02-01 23
2020-03-01 32
...
I want to transform it in:
year ene feb mar ...
2018 123 12 23 ...
2019 3 21 2 ...
2020 31 23 32 ...
I try
df <- mutate (df,year=year(as.Date(date)), month=month(as.Date(date), label=TRUE,abbr=TRUE))
I got:
date value month year
2018-01-01 123 ene 2018
2018-02-01 12 feb 2018
2018-03-01 23 mar 2018
...
2019-01-01 3 ene 2019
2019-02-01 21 feb 2019
2019-03-01 2 mar 2019
...
2020-01-01 31 ene 2020
2020-02-01 23 feb 2020
2020-03-01 32 mar 2020
...
Then I do:
pivot_wider(df, names_from="month", values_from=value)
I got:
date year ene feb mar ...
2018-01-01 2018 123 NA NA ...
2018-02-01 2018 NA 12 NA ...
2018-03-01 2018 NA NA 23 ...
...
2019-01-01 2019 3 NA NA ...
2019-02-01 2019 NA 21 NA ...
2019-03-01 2019 NA NA 2 ...
...
2020-01-01 2020 31 NA NA ...
2020-02-01 2020 NA 23 NA ...
2020-03-01 2020 NA NA 32 ...
I need "compress" rows to up, grouping by "year", but i don't know how do it.
I'm close to solution, but I can't find it.
Thanks in advance!

This should do it:
library(tidyverse)
library(lubridate)
df %>%
mutate(month = lubridate::month(as.Date(date), label=TRUE, abbr=TRUE),
year = lubridate::year(as.Date(date))) %>%
select(value, month, year) %>%
pivot_wider(id_cols = year, names_from = month, values_from = value)
Which returns:
# A tibble: 3 × 4
year Jan Feb Mar
<dbl> <int> <int> <int>
1 2018 123 12 23
2 2019 3 21 2
3 2020 31 23 32

merge of 2 data frames based on several columns defining 1 variable in r

I have 2 data frame. Codes are: year, pd, treatm and rep.
Variablea are LAI in the first data frame, cimer, himv, nőv are in the second.
I would like to add variable LAI to the other variables/ columns.
I am not sure how to set the correct ordeing of LAI data, while 1 data has 4 codes to define.
Could You help me to solve this problem, please?
Thank You very much!
Data frames are:
> sample1
year treatm pd rep LAI
1 2020 1 A 1 2.58
2 2020 1 A 2 2.08
3 2020 1 A 3 2.48
4 2020 1 A 4 2.98
5 2020 2 A 1 3.34
6 2020 2 A 2 3.11
7 2020 2 A 3 3.20
8 2020 2 A 4 2.56
9 2020 1 B 1 2.14
10 2020 1 B 2 2.17
11 2020 1 B 3 2.24
12 2020 1 B 4 2.29
13 2020 2 B 1 3.41
14 2020 2 B 2 3.12
15 2020 2 B 3 2.81
16 2020 2 B 4 2.63
17 2021 1 A 1 2.15
18 2021 1 A 2 2.25
19 2021 1 A 3 2.52
20 2021 1 A 4 2.57
21 2021 2 A 1 2.95
22 2021 2 A 2 2.82
23 2021 2 A 3 3.11
24 2021 2 A 4 3.04
25 2021 1 B 1 3.25
26 2021 1 B 2 2.33
27 2021 1 B 3 2.75
28 2021 1 B 4 3.09
29 2021 2 B 1 3.18
30 2021 2 B 2 2.75
31 2021 2 B 3 3.21
32 2021 2 B 4 3.57
> sample2
year.pd.treatm.rep.cimer.himv.nőv
1 2020,A,1,1,92,93,94
2 2020,A,2,1,91,92,93
3 2020,B,1,1,72,73,75
4 2020,B,2,1,73,74,75
5 2020,A,1,2,95,96,100
6 2020,A,2,2,90,91,94
7 2020,B,1,2,74,76,78
8 2020,B,2,2,71,72,74
9 2020,A,1,3,94,95,96
10 2020,A,2,3,92,93,96
11 2020,B,1,3,76,77,77
12 2020,B,2,3,74,75,76
13 2020,A,1,4,90,91,97
14 2020,A,2,4,90,91,94
15 2020,B,1,4,74,75,NA
16 2020,B,2,4,73,75,NA
17 2021,A,1,1,92,93,94
18 2021,A,2,1,91,92,93
19 2021,B,1,1,72,73,75
20 2021,B,2,1,73,74,75
21 2021,A,1,2,95,96,100
22 2021,A,2,2,90,91,94
23 2021,B,1,2,74,76,78
24 2021,B,2,2,71,72,74
25 2021,A,1,3,94,95,96
26 2021,A,2,3,92,93,96
27 2021,B,1,3,76,77,77
28 2021,B,2,3,74,75,76
29 2021,A,1,4,90,91,97
30 2021,A,2,4,90,91,94
31 2021,B,1,4,74,75,NA
32 2021,B,2,4,73,75,NA

You can use inner_join from dply
library(tidyverse)
inner_join(sample2,sample1, by=c("year","pd", "treatm", "rep"))
Output (first six lines)
year pd treatm rep cimer himv nov LAI
1: 2020 A 1 1 92 93 94 2.58
2: 2020 A 2 1 91 92 93 3.34
3: 2020 B 1 1 72 73 75 2.14
4: 2020 B 2 1 73 74 75 3.41
5: 2020 A 1 2 95 96 100 2.08
6: 2020 A 2 2 90 91 94 3.11
You can also use data.table
sample2[sample1, on=.(year,pd,treatm,rep)]

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.
Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22
12
100
30
25
Farm 1
2020
NA
12
120
20
20
Farm 1
2019
16
6
80
10
16
Farm 1
2019
12
NA
50
NA
11
Farm 1
2018
8
NA
NA
16
NA
Farm 1
2018
NA
NA
10
13
12
Farm 2
2020
31
28
27
10
14
Farm 2
2020
NA
13
31
20
NA
Farm 2
2019
3
31
NA
20
43
Farm 2
2019
20
50
43
17
42
Farm 2
2018
39
33
NA
48
10
Farm 2
2018
34
20
28
12
12
Farm 3
2020
27
NA
37
30
42
Farm 3
2020
50
9
NA
NA
NA
Farm 3
2019
NA
19
NA
20
16
Farm 3
2019
NA
2
NA
NA
7
Farm 3
2018
NA
NA
5
27
NA
Farm 3
2018
NA
7
43
49
42
If I were to use aggregate(.~Farms + Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.
Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e.
2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 + 20 + 15) / (3) = 15.
The ideal output will be as follow:
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22 (avg = 22/1 as one entry is NA)
12
110
25
22.5
Farm 1
2019
14
6
65
10
13.5
Farm 1
2018
8
N.A. (as it's all NA)
10
14.5
12
Farm 2
2020
31
20.5
29
15
14
Farm 2
2019
11.5
40.5
43
18.5
42.5
Farm 2
2018
36.5
26.5
28
30
11
Farm 3
2020
...
...
...
...
...
Farm 3
2019
...
...
...
...
...
Farm 3
2018
...
...
...
...
...

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0

warning: non-list contrasts argument ignored

I am running a gamm using the 'mgcv' package in R:
additive.model.saturated <- gamm(log.titer ~ condition +
Age_month_selective + Season.2 +
s(capture.month, bs = "cc", k = 12) +
s(capture.year, bs = "ps", k = 5),
random=list(Animal.ID=~1), data = data)
However, I keep getting the warning message below. I can not seem to figure out why I am getting this warning and how to adjust my analysis to resolve any mistakes the warning is suggesting I need to fix.
Warning message:
In model.matrix.default(~b$groups[[n.levels - i + 1]] - 1, contrasts.arg = c("contr.treatment", :
non-list contrasts argument ignored
A summary and subset of the data is included below:
#summary:
'data.frame': 1263 obs. of 6 variables:
$ log.titer : num 0 0 0 0 0 ...
$ condition : num 5 3.5 3.75 3.25 4 3.5 3.25 2.5 3.25 2.75 ...
$ Age_month_selective: int 39 57 63 68 75 83 27 44 39 51 ...
$ Season.2 : Factor w/ 2 levels "dry","wet": 1 2 1 2 1 2 1 2 1 1 ...
$ capture.month : int 6 12 6 11 6 2 6 11 6 6 ...
$ capture.year : int 2008 2009 2010 2010 2011 2012 2008 2009 2009 2010 ...
#data subset
log.titer condition Age_month_selective Season.2 capture.month capture.year Animal.ID
1 0.000000 5.00 39 dry 6 2008 B1
2 0.000000 3.50 57 wet 12 2009 B1
3 0.000000 3.75 63 dry 6 2010 B1
4 0.000000 3.25 68 wet 11 2010 B1
5 0.000000 4.00 75 dry 6 2011 B1
6 1.447158 3.50 83 wet 2 2012 B1
7 1.334454 3.25 27 dry 6 2008 B10
8 0.000000 2.50 44 wet 11 2009 B10
9 0.000000 3.25 39 dry 6 2009 B10
10 0.000000 2.75 51 dry 6 2010 B10
11 0.000000 2.50 56 wet 11 2010 B10
12 0.000000 2.00 63 dry 6 2011 B10
13 0.000000 2.50 71 wet 2 2012 B10
14 0.000000 4.50 63 dry 6 2008 B11
15 1.363612 3.75 80 wet 11 2009 B11
16 1.365488 4.75 76 dry 7 2009 B11
17 0.000000 3.75 87 dry 6 2010 B11
18 0.000000 4.00 95 wet 2 2011 B11
19 1.447158 3.25 99 dry 6 2011 B11
20 0.000000 4.75 51 dry 6 2008 B12
21 0.000000 4.25 68 wet 11 2009 B12
22 0.000000 4.25 68 wet 11 2009 B12
23 0.000000 3.50 75 dry 6 2010 B12
24 0.000000 3.75 80 wet 11 2010 B12
25 1.414973 2.00 92 wet 11 2011 B12

R Creating new data.table with specified rows of a single column from an old data.table

I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?

Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.

First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging complementary rows of a dataframe with R - r

Related

Long to wide: compacting rows

merge of 2 data frames based on several columns defining 1 variable in r

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

warning: non-list contrasts argument ignored

R Creating new data.table with specified rows of a single column from an old data.table

Categories

Resources