The problem starts from the difficulty of explaining it.
I have a data set that has a time dimension, my ID variables change name over time making it difficult to calculate e.g. percentage changes over time by ID variable.
ID YR Value
01 2004 100
02 2005 50
03 2005 50
04 2005 10
I need to calculate pct. Change in Value over time by ID. The problem is in Yr 2005 the ID variable 01 is split into three IDs (02,03,04), such that one has to aggregate the values for the three IDs in 2005 to get the corresponding value for ID 01 in 2005. The percent change of ID 01 is NOT 50/100, rather sum(50,50,10)/100.
I have data.frame of IDs only matching the changes over time, it looks like this:
x2004 x2005
01 01
01 02
01 03
I used group_by from dplyr to create matching between IDs in the two years
group_by(x2004) %>%
summarize(onetomany = paste(sort(unique(x2005)),collapse=", "))
Which gave me a data.frame of the form
cv2004 onetomany
1 1 1, 2, 3
Where I can see which IDs belong to the same group, and that is where I stopped the percentage calculation.
I totally understand that the problem it self is not easy to understand. This is a common problem in trade statistics, commodity codes change name over time but not content, and one has to keep track of the changes to get the picture of developments in trade over time by commodity. Any suggestion is appreciated.
df <- data.frame("ID" = c("01", "02", "03", "04"),
"YR" = c(2004, 2005, 2005, 2005),
"Value" = c(100, 50, 50, 10))
df %>% group_by(YR) %>% summarise(sum = sum(Value))
# A tibble: 2 x 2
YR sum
<dbl> <dbl>
1 2004 100
2 2005 110
Related
I'm in doubt about how to define age ranges to calculate Incidence Ratio in a cohort dataset. More specifically, my data comprises individuals who entered into a specific cohort between 2008-2018, and, furthermore, this dataset was used as a reference to merge with hospitalization information from another source of data by the id number.
My data looks like this below
id = c(1:5)
year_of_entry = c(2008, 2009, 2011, 2015, 2016)
age_of_entry = c(8,10,40,20,30)
year_birth = c(2000, 1999, 1971, 1995, 1986)
hospitalization_year = c(2009, NA, 2015, 2017, NA)
age_hospitalization = c(9, NA, 44, 22, NA)
data = data.frame(
id = id,
'Age of Entry' = age_of_entry,
'Year of Birth' = year_birth,
'Hospitalization Year' = hospitalization_year,
'Age of Hospitalization' = age_hospitalization
)
>data
id Age.of.Entry Year.of.Birth Hospitalization.Year Age.of.Hospitalization
1 8 2000 2009 9
2 10 1999 NA NA
3 40 1971 2015 44
4 20 1995 2017 22
5 30 1986 NA NA
This way, my next step is to run a linear regression to study determinants for admissions by age groups (i.e 0-10; 11-20; 21-30; 31-40; 41-50), but I'm not so sure about the criteria that I need to use in order to create these age groups regarding the fact that we have people who entered into the cohort in different periods, at different ages and was admitted in different periods of time. Additionally, as you can see in the example above, in my dataset I also have some individuals who have not been admitted.
Can anyone help me to solve that?
I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.
The data I want to extract:
ID
monthly value
1 - 1997
Product of Monthly Values in Year 1997
1 - 1998
Product of Monthly Values in Year 1998
1 - 1999
Product of Monthly Values in Year 1999
...
...
2 - 1997
Product of Monthly Values in Year 1997
2 - 1998
Product of Monthly Values in Year 1998
2 - 1999
Product of Monthly Values in Year 1999
...
...
The dataset which is the source:
ID
monthly value
1 - 1997
Monthly Value 1 in Year 1997
1 - 1997
Monthly Value 2 in Year 1997
1 - 1997
Monthly Value 3 in Year 1997
...
...
2 - 1997
Monthly Value 1 in Year 1997
2 - 1997
Monthly Value 2 in Year 1997
2 - 1997
Monthly Value 3 in Year 1997
...
...
I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.
for (i in 1:nrow(yearlyreturns)){
yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
}
I don't know how to get from here to a vectorised function, which takes less time.
Is this possible to do in R?
Something like this:
library(dplyr)
df %>%
mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>%
group_by(ID, monthly_value) %>%
summarise()
ID monthly_value
<chr> <chr>
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997
data:
structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997",
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997",
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997",
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997",
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA,
-6L))
Based on the for loop code, this may be a done with a join
library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount")
:= .(prod(change) -1, .N), on = .(ID), by = .EACHI]
In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.
I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2
Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.
I want to split column into multiple columns by matching the patterns
test <- data.frame("id" = c("Albertson's Inc.","Albertson's Inc."), "V3" = c("Reiterates FY 2004, Significant Developments, 2 June 2004, 53 words, (English)(Document MULTI00020050122e06201fkk)","EBITDA Hits Four Year Low, Stock Diagnostics, 16:00 GMT, 9 June 2004, 245 words, (English)(Document STODIA0020040609e0690006g)"), stringsAsFactors = F)
So far the code I'm using to get desired result is like
library(stringr)
df <- as.data.frame(str_match(test$V3, "^(.*)GMT,(.*),(.*)words,(.*)Document (.*)$")[,-1], stringsAsFactors = F)
I'm having two issues with above code
First it does not show results when GMT is missing secondly I want "id" column in the output df as well, any suggestion or different approach should I use for results please share thanks to all the moderators programmers for such a helpful forum.
not 100% sure about your "GTM" problem. here is my try:
your rep data:
test <- data.frame("id" = c("Albertson's Inc.","Albertson's Inc."), "V3" = c("Reiterates FY 2004, Significant Developments, 2 June 2004, 53 words, (English)(Document MULTI00020050122e06201fkk)","EBITDA Hits Four Year Low, Stock Diagnostics, 16:00 GMT, 9 June 2004, 245 words, (English)(Document STODIA0020040609e0690006g)"), stringsAsFactors = F)
code:
library(tidyverse)
test$V3 %>% map(~str_split(.,",(?!\\s*\\d{1,2}:\\d{1,2})|(?<=\\))(?=\\()") %>% unlist %>% trimws) %>%
do.call(rbind,.) %>%
cbind(test["id"],.)
result:
# id 1 2 3 4 5 6
# 1 Albertson's Inc. Reiterates FY 2004 Significant Developments 2 June 2004 53 words (English) (Document MULTI00020050122e06201fkk)
# 2 Albertson's Inc. EBITDA Hits Four Year Low Stock Diagnostics, 16:00 GMT 9 June 2004 245 words (English) (Document STODIA0020040609e0690006g)