A bit of code golf I am facing and struggling quite a bit. I had a hold to a complex dataset in long format, which I need in wide for analysis. I managed to convert easily. However, there is redundancy in the dataset after the convertion because of how the data was filled. So here is a MWE with the problem I am facing:
id <- c("ana","ana","ana", "brad","ana","brad","brad","brad", "matt", "matt", "matt")
hour <- c(0, 0, 24, 0, 48, 24, NA, 72, 0 , 24, 48 )
assessment <- c("memory", "memory", "attention", "verbal", "attention", "memory", "attention","attention", "memory", "attention", "attention")
value <- c(0.000,NA,0.895,0.000,15.000, 3, 5, NA,2, 4,5 )
mydata<-data.frame(id, hour, assessment, value)
Results in:
> mydata
id hour assessment value
1 ana 0 memory 0.000
2 ana 0 memory NA
3 ana 24 attention 0.895
4 brad 0 verbal 0.000
5 ana 48 attention 15.000
6 brad 24 memory 3.000
7 brad NA attention 5.000
8 brad 72 attention NA
9 matt 0 memory 2.000
10 matt 24 attention 4.000
11 matt 48 attention 5.000
and after:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
mutate(i1=row_number()) %>%
spread(assessment, value)
gets to:
Source: local data frame [11 x 6]
Groups: id [3]
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 0 2 NA NA NA
3 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 NA NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
8 brad NA 3 5.000 NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Note that ana has two entries for hour 0 and memory; and brad has one entry with zero and another with missing. That missing should be considered as zero as well, that was a typing error of whoever collected the data.
The table below shows how ana's and brad's entries should be. Repetitions for the same id and hour (including NA) should be collapsed/merged (look at lines 1 and 5 below).
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 5.000 NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Question:
How do I reduce the duplicates for each subject+hour in such a dataset, so that it will look like the previous table?
One option is to replace the NA with 0, get the distinct rows and then proceed as in the OP's code
mydata %>%
mutate_at(vars(hour, value), funs(replace(., is.na(.), 0))) %>%
arrange(id, hour, desc(value)) %>%
distinct() %>%
group_by(id, hour, assessment) %>%
spread(assessment, value)
Related
I'd like to convert chemical formulas to a data frame containing columns for 1) the mineral name, 2) the chemical formula and 3) a set of columns for each element that is extracted from the formula. I am given the first two columns and I can extract the number of elements from each formula using CHNOSZ::makeup(). However, I'm not familiar working with lists and not sure how to rbind() the lists back into a data frame that contains everything I'm looking for (i.e. see 1-3 above).
Here is what I have so far - appreciate any help (including a link to a good tutorial on how to convert data from nested lists into dataframes).
library(tidyverse)
library(CHNOSZ)
formulas <- structure(list(Mineral = c("Abelsonite", "Abernathyite", "Abhurite",
"Abswurmbachite", "Acanthite", "Acetamide"), Composition = c("C31H32N4Ni",
"K(UO2)(AsO4)4(H2O)", "Sn3O(OH)2Cl2", "CuMn6(SiO4)O8", "Ag2S",
"CH3CONH2")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
test <- formulas %>%
select(Composition) %>%
map(CHNOSZ::makeup) %>%
flatten
test2 <- do.call(rbind,test)
> test2
As H K O U
[1,] 31 32 4 1 31
[2,] 4 2 1 19 1
[3,] 2 2 3 3 2
[4,] 1 6 12 1 1
[5,] 2 1 2 1 2
[6,] 2 5 1 1 2
which is not right.
You could do something like this
library(tidyverse)
library(CNOSZ)
test <- formulas %>%
mutate(res = map(Composition, ~stack(makeup(.x)))) %>%
unnest(cols = res) %>%
spread(ind, values)
## A tibble: 6 x 17
# Mineral Composition C H N Ni As K O U Cl
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Abelso… C31H32N4Ni 31 32 4 1 NA NA NA NA NA
#2 Aberna… K(UO2)(AsO… NA 2 NA NA 4 1 19 1 NA
#3 Abhuri… Sn3O(OH)2C… NA 2 NA NA NA NA 3 NA 2
#4 Abswur… CuMn6(SiO4… NA NA NA NA NA NA 12 NA NA
#5 Acanth… Ag2S NA NA NA NA NA NA NA NA NA
#6 Acetam… CH3CONH2 2 5 1 NA NA NA 1 NA NA
## … with 6 more variables: Sn <dbl>, Cu <dbl>, Mn <dbl>, Si <dbl>, Ag <dbl>,
## S <dbl>
This question already has answers here:
Replace NA values by row means
(3 answers)
Closed 4 years ago.
In case 2017 is NA and columns of 2015 and 2016 have value, I want to assign average of them to 2017 based on the same row.
Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288
...
Here is that I tried, didn't work!
ind <- which(is.na(df), arr.ind=TRUE)
df[ind] <- rowMeans(df, na.rm = TRUE)[ind[,1]]
Also if we have values in 2015 and 2017 columns and 2016 is NA, I want to assign average of them to the column of 2016 based on the same row. Any help would be appreciated!
Disclaimer: I'm not entirely clear on what your expected output is. My solution below is based on the assumption that you want to replace NA values with either the mean of all values for every year or with the mean value of all values for every Index.
Here is a tidyverse option first spreading from wide to long, replacing NAs with the mean value per year, and finally converting back from long to wide.
library(tidyverse)
df %>%
gather(year, value, -Index) %>%
group_by(year) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 115507293. 6355698. 10107023.
#2 2 13000000. 223472356. 186197372.
#3 4 115507293. 223472356. 186197372.
#4 5 115507293. 223472356. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 115507293. 223472356. 32032920.
#7 8 115507293. 223472356. 20000000.
#8 9 2521880. 5061370. 7044288.
Note that here we replace NAs with mean value per year. If instead you want to replace NAs with the mean value per Index value, simply replace group_by(year) with group_by(Index):
df %>%
gather(year, value, -Index) %>%
group_by(Index) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
## Groups: Index [8]
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 8231360. 6355698. 10107023.
#2 2 13000000. 13000000. 13000000.
#3 4 NaN NaN NaN
#4 5 8000000. 8000000. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 32032920. 32032920. 32032920.
#7 8 20000000. 20000000. 20000000.
#8 9 2521880. 5061370. 7044288.
Update
To only replace NAs in column 2017 with the row average based on the 2015,2016 values you can do
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")
df %>%
mutate(`2017` = ifelse(is.na(`2017`), 0.5 * (`2015` + `2016`), `2017`))
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <int> <int> <dbl>
#1 1 NA 6355698 10107023.
#2 2 13000000 73050000 43025000.
#3 4 NA NA NA
#4 5 10500000 NA 8000000.
#5 6 331000000 659000000 1040000000.
#6 7 55500000 NA 32032920.
#7 8 NA NA 20000000.
#8 9 2521880 5061370 7044288.
Sample data
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 NA NA
4 NA NA NA
5 NA NA 8000000
6 331000000 659000000 1040000000
7 NA NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")
I have a data frame, a reproducible example is as follows:
structure(list(subscriberid = c(1177460837L, 1177460837L, 1177460837L,
1146526049L, 1146526049L, 1146526049L), variable = c("3134",
"4550", "4550", "5160", "2530", "2530"), value = c(1, 2, 2, 1,
2, 2), gender = c(2, 2, 2, 1, 2, 2), cwe = c(NA, 50L, 50L, NA,
30L, 30L), hw = c(NA, 48L, 48L, NA, 26L, 26L), resp = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), cna = c(3L, 1L, 1L, 3L, 1L, 1L)), .Names = c("subscriberid",
"variable", "value", "gender", "cwe", "hw", "resp", "cna"), row.names = c(4L,
5L, 6L, 9L, 10L, 11L), class = "data.frame")
The actual data frame looks like this:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 48 NA 1
6 1177460837 4550 2 2 50 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 26 NA 1
11 1146526049 2530 2 2 30 26 NA 1
In the above df, row 5 and 6 are exactly the same. From row 5, I want to remove 48 and row 6 I want to remove 50. Essentially, I want to retain only one age in a row and set the other to NA. I tried using a for loop but that sets column values in the column that I refer in both the rows to NA.
for (i in 1:nrow(test)) {
test$hw[i] <- ifelse(!is.na(test$cwe[i]) & !is.na(test$hw[i]), NA, test$hw[i])
}
I am trying to set an if condition to identify if both the rows are same, then I want to iteratively remove one of the values from the first row and remove the other from the second.
The desired output is as follows:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 NA NA 1
6 1177460837 4550 2 2 NA 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 NA NA 1
11 1146526049 2530 2 2 NA 26 NA 1
You can use a combination of which() and duplicated() to receive duplicated rows.
Because you need to change values twice of the rows, you have to create a copy of the dataframe. Note that this will only work if the identical rows are always consecutive.
dfNA <- df
dfNA$hw[which(duplicated(df))-1] <- NA
dfNA$cwe[which(duplicated(df))] <- NA
dfNA
# subscriberid variable value gender cwe hw resp cna
#4 1177460837 3134 1 2 NA NA NA 3
#5 1177460837 4550 2 2 50 NA NA 1
#6 1177460837 4550 2 2 NA 48 NA 1
#9 1146526049 5160 1 1 NA NA NA 3
#10 1146526049 2530 2 2 30 NA NA 1
#11 1146526049 2530 2 2 NA 26 NA 1
A possible solution :
# create a logical vector indicating if current row is identical to previous one
# N.B.: do.call("paste",c(DF,sep="\r")) is used internally by "duplicated.data.frame" function
rowStrings <- do.call("paste", c(DF, sep = "\r"))
currRowIsEqualToPrev <- rowStrings[-1] == rowStrings[-length(rowStrings)]
# set first row hw = NA and second identical row cwe = NA
DF[c(FALSE,currRowIsEqualToPrev),'hw'] <- NA
DF[c(currRowIsEqualToPrev,FALSE),'cwe'] <- NA
> DF
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 NA 48 NA 1
6 1177460837 4550 2 2 50 NA NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 NA 26 NA 1
11 1146526049 2530 2 2 30 NA NA 1
Using lead and lag from dplyr package:
library(dplyr)
df1 %>%
group_by(subscriberid, variable) %>%
mutate(cwe = if_else(lead(cwe) == cwe, cwe, NA_integer_),
hw = if_else(lag(hw) == hw, hw, NA_integer_)) %>%
ungroup()
# # A tibble: 6 x 8
# subscriberid variable value gender cwe hw resp cna
# <int> <int> <int> <int> <int> <int> <lgl> <int>
# 1 1177460837 3134 1 2 NA NA NA 3
# 2 1177460837 4550 2 2 50 NA NA 1
# 3 1177460837 4550 2 2 NA 48 NA 1
# 4 1146526049 5160 1 1 NA NA NA 3
# 5 1146526049 2530 2 2 30 NA NA 1
# 6 1146526049 2530 2 2 NA 26 NA 1
I took a shot at it. This relies on using group_by from dplyr to find duplicate rows. This method assumes that rows can be reliably be identified as identical by using the subscriberid, variable, value, gender, resp, and cna columns alone.
Because it is operating within groups only, it will work even if a preceding non-identical row contains the same value for cwe (I did check this, but I would also confirm it for yourself if I were you).
library(dplyr)
ndf <- df %>%
group_by(subscriberid, variable, value, gender, resp, cna) %>%
mutate(cwe = na_if(cwe, lag(cwe)),
hw = na_if(hw, lead(hw))) %>%
ungroup()
Output:
# A tibble: 6 x 8
subscriberid variable value gender cwe hw resp cna
<int> <chr> <dbl> <dbl> <int> <int> <int> <int>
1 1177460837 3134 1. 2. NA NA NA 3
2 1177460837 4550 2. 2. 50 NA NA 1
3 1177460837 4550 2. 2. NA 48 NA 1
4 1146526049 5160 1. 1. NA NA NA 3
5 1146526049 2530 2. 2. 30 NA NA 1
6 1146526049 2530 2. 2. NA 26 NA 1
For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.
p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)
So now the dataset looks like this.
> d
p type date
1 1 15 2014-02-03
2 1 20 2014-02-04
3 1 20 2014-02-06
4 2 15 2014-01-28
5 2 20 2014-02-03
6 2 15 2014-03-03
7 2 20 2014-03-13
8 3 20 2014-04-03
9 3 15 2014-04-09
10 3 15 2014-12-03
Now, I'd like to create three new columns.
indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)
I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.
group_by(p)%>%
for(i in i:length(date)){
*if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
indicator = 1
first_date =
days =* #Have no idea how to check in this part
} else {
indicator = 0
first_date = NA
days = NA
}
The expected output is as below.
p type date ind first_date days
1 1 15 2014-02-03 1 2014-02-04 1 # = 2014-02-04 - 2014-02-03
2 1 20 2014-02-04 NA <NA> NA
3 1 20 2014-02-06 NA <NA> NA
4 2 15 2014-01-28 1 2014-02-03 6 # = 2014-02-03 - 2014-01-28
5 2 20 2014-02-03 NA <NA> NA
6 2 15 2014-03-03 0 <NA> NA # since (2014-03-13 - 2014-03-03) > 7
7 2 20 2014-03-13 NA <NA> NA
8 3 20 2014-04-03 NA <NA> NA #I don't care about the value for type 20 lines
9 3 15 2014-04-09 0 <NA> NA
10 3 15 2014-12-03 0 <NA> NA
So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.
I found one way in doing this:
d[,group:= cumsum(type ==15)]
However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?
I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(rownum = 1:n()) %>%
spread(type, date, sep="_") %>%
group_by(p) %>%
mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
mutate(days = first_date - type_15) %>%
gather("type", "date", type_15, type_20) %>%
filter(!is.na(date)) %>%
arrange(p, date) %>%
select(p, type, date, ind, first_date, days)
# p type date ind first_date days
# <dbl> <chr> <date> <dbl> <date> <time>
#1 1 type_15 2014-02-03 1 2014-02-04 1 days
#2 1 type_20 2014-02-04 NA <NA> NA days
#3 1 type_20 2014-02-06 NA <NA> NA days
#4 2 type_15 2014-01-28 1 2014-02-03 6 days
#5 2 type_20 2014-02-03 NA <NA> NA days
#6 2 type_15 2014-03-03 0 <NA> NA days
#7 2 type_20 2014-03-13 NA <NA> NA days
#8 3 type_20 2014-04-03 NA <NA> NA days
#9 3 type_15 2014-04-09 0 <NA> NA days
#10 3 type_15 2014-12-03 0 <NA> NA days
Let me try to explain what I'm doing:
First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.
The next three mutate lines add the columns outlined in 2 and 3 in the question.
Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.
I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.
If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...
Packages you'll need
library(dplyr)
library(purrr)
Set up data (as per question)
p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)
Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.
# Function to manage ind
ind_manager <- function(type, date, dates_20) {
if (type == 20)
return (NA_integer_)
checks <- map_lgl(dates_20, between, date, date + 7)
return (as.integer(any(checks)))
}
# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
if (is.na(ind) || ind != 1)
return (NA_character_)
dates_20 <- dates_20[order(dates_20)]
as.character(dates_20[which.max(date < dates_20)])
}
Save a vector of dates where type == 20 to be used as comparisons
dates_20 <- d$date[d$type == 20]
The final mutate() call
# mutate() call to create variables
d %>%
mutate(
ind = map2_int(type, date, ind_manager, dates_20),
first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
days = as.integer(first_date - date)
)
#> p type date ind first_date days
#> 1 1 15 2014-02-03 1 2014-02-04 1
#> 2 1 20 2014-02-04 NA <NA> NA
#> 3 1 20 2014-02-06 NA <NA> NA
#> 4 2 15 2014-01-28 1 2014-02-03 6
#> 5 2 20 2014-02-03 NA <NA> NA
#> 6 2 15 2014-03-03 0 <NA> NA
#> 7 2 20 2014-03-13 NA <NA> NA
#> 8 3 20 2014-04-03 NA <NA> NA
#> 9 3 15 2014-04-09 0 <NA> NA
#> 10 3 15 2014-12-03 0 <NA> NA
Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.
The first step is to define the pieces:
d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
.Names = c("p", "type", "date"),
row.names = c(NA, -10L), class = "data.frame")
id <- with(d, {
id <- ave(type, p, FUN = function(x) cumsum(x == 15))
factor(paste0(p, id), unique(paste0(p, id)))
})
sp <- split(d, id)
So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.
The first two pieces are
sp[1:2]
# $`11`
# p type date
# 1 1 15 2014-02-03
# 2 1 20 2014-02-04
# 3 1 20 2014-02-06
#
# $`21`
# p type date
# 4 2 15 2014-01-28
# 5 2 20 2014-02-03
And we can apply the function below on each one
first_date(sp[[1]])
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
first_date(sp[[2]])
# p type date ind first_date days
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
Or all at once with a loop
(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA
# 7 2 20 2014-03-13 NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA
You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window
(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)
# p type date ind first_date days ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA 1 2014-03-13 10
# 7 2 20 2014-03-13 NA <NA> NA NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA 0 <NA> NA
first_date <- function(data, window = 7) {
nr <- nrow(data)
## check at least one type 15 and > 1 row
ty15 <- data$type == 15
dt15 <- data$date[ty15]
if (!any(ty15) | nr == 1L)
return(cbind(data, ind = ifelse(any(ty15), 0, NA),
first_date = NA, days = NA))
## first date vector
dts <- rep(min(data$date[!ty15]), nr)
dts[!ty15] <- NA
## days from the type 15 date
days <- as.numeric(data$date[!ty15] - min(dt15))
days <- c(days, rep(NA, nr - length(days)))
## convert to NA if criteria not met
to_na <- days > window | is.na(dts)
days[to_na] <- dts[to_na] <- NA
## ind vector -- 1 or 0 if type 15, NA otherwise
ind <- rep(NA, nr)
ind[ty15] <- as.integer(!is.na(dts[ty15]))
## combine
cbind(data, ind = ind, first_date = dts, days = days)
}
I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.