I would like to convert a column (or create a new one) which is year-day of year to m/d/y. Originally I had year and day-of-year as two separate columns, but I concatenated (paste) them together because I thought I would need the year included with the day of year because of leap years. I am not opposed to using an additional package such as date.
Here is my data:
dat <- structure(list(doy = c(320, 350, 309, 310, 328, 321, 301, 338,
304, 304, 308), year = structure(1:11, .Label = c("2000", "2001",
"2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009",
"2010"), class = "factor"), conc = c("2000-320", "2001-350",
"2002-309", "2003-310", "2004-328", "2005-321", "2006-301", "2007-338",
"2008-304", "2009-304", "2010-308")), row.names = c(NA, -11L), class = "data.frame", .Names = c("doy",
"year", "conc"))
And looks like:
doy year conc
1 320 2000 2000-320
2 350 2001 2001-350
3 309 2002 2002-309
4 310 2003 2003-310
5 328 2004 2004-328
6 321 2005 2005-321
7 301 2006 2006-301
8 338 2007 2007-338
9 304 2008 2008-304
10 304 2009 2009-304
11 308 2010 2010-308
-cherrytree
No additional packages necessary:
within(dat, dtime <- as.POSIXct(conc, format='%Y-%j'))
Something like this works.
as.Date( paste(as.character(dat$year), "-01-01",sep="")) + dat$doy - 1
Just adds the day of the year (minus one) to Jan 1 of the year.
Related
I have thousands of rows of data which look like this
df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))
wherein I want to count the chunk of contribution of the countries for each thing (represented by thing_code) per year. The categories I want for counting are:
Vietnam (local country in this example)
SEAsian (all other southeast asian countries except Vietnam)
Non-local (other countries except Vietnam and SEAsian)
I want to be able to come up with something like this:
# thing_codeyear location freq percentage
# X123 2001 Vietnam 2 1
# Y123 2004 Vietnam 1 0.25
# Y123 2004 Non-local 2 0.5
# Y123 2004 SEAsian 1 0.25
# Z123 2004 Non-local 1 0.25
# Z123 2004 Vietnam 2 0.5
# Z123 2004 SEAsian 1 0.25
# A456 2007 Vietnam 2 0.4
# A456 2007 Non-local 3 0.6
freq will be like a counter for abovementioned categories and percentage will just be the percent of each category's contribution.
So far, my code looks like
Vietnam <- df %>% filter(str_detect(country, "Vietnam"))
thing_code_year <- subset(Vietnam, select=c(thing_code, year))
freq <- table(thing_code_year)
frequency <- as.data.frame(freq)
frequency <- frequency %>% filter(Freq!=0)
but this only gives me the number for Vietnam and will probably take me a long time to obtain those for other categories.
This should give your desired output. You can use case_when to create a new variable that specifies the location using the logic you described above. Next you group_by the code, year, and newly created location to calculate the frequency of each category in location (Vietnam, SEAsian, Non-local). Then you can group_by by code and year to calculate the percentage/proportion of the categories in location.
library(dplyr)
df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))
SEAsian <- c("Vietnam", "Singapore", "Cambodia")
df %>%
mutate(location = case_when(
country == "Vietnam" ~ "Vietnam",
country %in% SEAsian[SEAsian != "Vietnam"] ~ "SEAsian",
!country %in% SEAsian ~ "Non-local"
)) %>%
group_by(thing_code, year, location) %>%
summarise(freq = n()) %>%
group_by(thing_code, year) %>%
mutate(percentage = freq/sum(freq))
Output:
thing_code year location freq percentage
<fct> <fct> <chr> <int> <dbl>
1 A456 2007 Non-local 3 0.6
2 A456 2007 Vietnam 2 0.4
3 X123 2001 Vietnam 2 1
4 Y123 2004 Non-local 2 0.5
5 Y123 2004 SEAsian 1 0.25
6 Y123 2004 Vietnam 1 0.25
7 Z123 2004 Non-local 1 0.25
8 Z123 2004 SEAsian 1 0.25
9 Z123 2004 Vietnam 2 0.5
I have a data frame given by the following
DF <- structure(list(ID = c(1, 129, 169, 1087), `Collab Years Patents` = c(NA,
"2011, 2011, 2011", "2010", "2006, 2006"), `Collab Years Publications` = c("2011",
"2015, 2016, 2016", "2010", NA), ECP = c("2011", "2011", "2010",
"2006")), .Names = c("ID", "Collab Years Patents", "Collab Years Publications",
"ECP"), row.names = c(1L, 107L, 136L, 859L), class = "data.frame")
The column ECP is the minimum year of the two collaboration columns (which could contain several years). I need an output that says which column the ECP belongs to. For example, a solution to above could be another column vector to above frame with the elements:
structure(list(ID = c(1, 129, 169, 1087), `Collab Years Patents` = c(NA,
"2011, 2011, 2011", "2010", "2006, 2006"), `Collab Years Publications` = c("2011",
"2015, 2016, 2016", "2010", NA), ECP = c("2011", "2011", "2010",
"2006"), identifier = c("Publications", "Patents", "Both", "Patents"
)), .Names = c("ID", "Collab Years Patents", "Collab Years Publications",
"ECP", "identifier"), row.names = c(1L, 107L, 136L, 859L), class = "data.frame")
Here is an option using str_detect. Loop through the collaboration columns (sapply(DF[2:3],), use str_detect to check which one of the column have the value of 'ECP'. multiply by col to convert the TRUE values to the column index, replace the NA elements with 0, get the column names correspond based on the maximum column index, remove the prefix part of the column names with sub, and assign those elements in 'm1' that are greater than 0 i.e. have 'ECP' in both to 'Both' on the created vector 'v1'
library(stringr)
m1 <- col(DF[2:3]) *sapply(DF[2:3], function(x) str_detect(x, DF$ECP))
m1[is.na(m1)] <- 0
v1 <- sub(".*\\s(\\w+)$", "\\1", names(DF)[2:3][max.col(m1)])
v1[rowSums(m1 > 0) ==2] <- "Both"
DF$identifier <- v1
DF$identifier
#[1] "Publications" "Patents" "Both" "Patents"
Using tidyverse (dplyr and purrr):
library(tidyverse)
DF %>%
mutate_at(2:3,strsplit,", ") %>%
transmute(identifier = pmap(.[2:4],~c("Publications","Patents","Both")[
2*(..3 %in% .x) + (..3 %in% .y)])) %>%
bind_cols(DF,.)
# ID Collab Years Patents Collab Years Publications ECP identifier
# 1 1 <NA> 2011 2011 Publications
# 2 129 2011, 2011, 2011 2015, 2016, 2016 2011 Patents
# 3 169 2010 2010 2010 Both
# 4 1087 2006, 2006 <NA> 2006 Patents
First question on here, so hopefully I've done this correctly!
I have a large dataset, the following is a small sample:
id <- c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6)
year <- c("2010", "2011", "2012", "2014", "2012", "2013", "2011", "2012", "2013", "2010", "2011", "2012", "2013", "2008", "2009", "2011")
value <- c(100, 33, 80, 90, 80, 100, 100, 90, 80, 90, 80, 100, 100, 90, 80, 99)
df <- data.frame(id, year, value)
df
For each id I want to return the values of two successive years so that I can compare the value in year n to year n+1. Where there are not two successive years then don't return anything for that id.
The output should be as follows:
id <- c(1, 1, 2, 3, 3, 4, 4, 4, 5)
year <- c("2010", "2011", "2012", "2011", "2012", "2010", "2011", "2012", "2008")
yvalue <- c(100, 33, 80, 100, 90, 90, 80, 100, 90)
yearadd1 <- c("2011", "2012", "2013", "2012", "2013", "2011", "2012", "2013", "2009")
valueadd1 <- c(33, 80, 100, 90, 80, 80, 100, 100, 80)
df <- data.frame(id, year, yvalue, yearadd1, valueadd1)
df
How do I get r to give me this output?
The main difficulty I face is that for id = 1 the first pair of successive years are 2010 and 2011, whereas for id = 4 they are 2008 and 2009, so I can't define what the first year is as it varies by id.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we loop through the columns 'year', 'value' and get the lead observation using shift, assign (:=) it to new columns and remove the NA rows (na.omit). Then get the row ids (.I) where the successive elements in 'yearadd1' is equal to 1, and extract those rows.
library(data.table)
nm1 <- names(df)[2:3]
dt <- na.omit(setDT(df)[, paste0(nm1, "add1") := lapply(.SD, shift, type = "lead"),
by = id, .SDcols = nm1])
dt[dt[, .I[c(TRUE, diff(as.numeric(as.character(yearadd1)))==1)], id]$V1]
# id year value yearadd1 valueadd1
#1: 1 2010 100 2011 33
#2: 1 2011 33 2012 80
#3: 2 2012 80 2013 100
#4: 3 2011 100 2012 90
#5: 3 2012 90 2013 80
#6: 4 2010 90 2011 80
#7: 4 2011 80 2012 100
#8: 4 2012 100 2013 100
#9: 5 2008 90 2009 80
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
Imagine I have the following data:
Year Month State ppo
2011 Jan CA 220
2011 Feb CA 250
2012 Jan CA 230
2011 Jan WA 200
2011 Feb WA 210
I need to calculate the mean for each state for the year, so the output would look something like this:
Year Month State ppo annualAvg
2011 Jan CA 220 230
2011 Feb CA 240 230
2012 Jan CA 260 260
2011 Jan WA 200 205
2011 Feb WA 210 205
where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.
Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?
Thanks for the help!
Try this:
library(data.table)
setDT(df)
df[ , annualAvg := mean(ppo) , by =.(Year, State) ]
Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)
Using dplyr with group_by %>% mutate to add a column:
library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))
#Source: local data frame [5 x 5]
#Groups: Year, State [3]
# Year Month State ppo annualAvg
# (int) (fctr) (fctr) (int) (dbl)
#1 2011 Jan CA 220 235
#2 2011 Feb CA 250 235
#3 2012 Jan CA 230 230
#4 2011 Jan WA 200 205
#5 2011 Feb WA 210 205
Using data.table:
library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]
df
# Year Month State ppo annualAvg
#1: 2011 Jan CA 220 235
#2: 2011 Feb CA 250 235
#3: 2012 Jan CA 230 230
#4: 2011 Jan WA 200 205
#5: 2011 Feb WA 210 205
Data:
structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"),
State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA",
"WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L,
210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year",
"Month", "State", "ppo", "annualAvg"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)
I have this df:
structure(list(YEAR = c("2007", "2007", "2007", "2008", "2008",
"2008", "2008", "2008", "2008", "2008"), MONTH = c("12", "10",
"11", "01", "03", "05", "06", "08", "09", "10"), TOTAL = c(85055988L,
21567576L, 82763640L, 91007916L, 93936288L, 99646750L, 90091044L,
98811936L, 96888876L, 100909236L)), .Names = c("YEAR", "MONTH",
"TOTAL"), row.names = c("24801", "33863", "34055", "24973", "25046",
"25295", "25384", "25541", "25861", "27319"), class = "data.frame")
I would like to organize this data frame as follows:
YEAR JAN FEB MARCH .... DEC
2009 TOTAL VALUE FOR EACH month goes to each corresponding cells.
any ideas how I could easily accomplish this in R?
dcast and xtabs are among the options to consider:
xtabs(TOTAL ~ YEAR + MONTH, df)
# MONTH
# YEAR 01 03 05 06 08 09 10 11 12
# 2007 0 0 0 0 0 0 21567576 82763640 85055988
# 2008 91007916 93936288 99646750 90091044 98811936 96888876 100909236 0 0
library(reshape2)
dcast(df, YEAR ~ MONTH, value.var="TOTAL", fun.aggregate=sum)
# YEAR 01 03 05 06 08 09 10 11 12
# 1 2007 0 0 0 0 0 0 21567576 82763640 85055988
# 2 2008 91007916 93936288 99646750 90091044 98811936 96888876 100909236 0 0