Consider this x set of dates:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
x <- strftime(x, '%Y')
The following is a distribution of the years of those dates:
> table(x)
x
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994
4 4 3 3 6 4 3 4 5 12 1 1 1 2
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
9 4 2 1 4 4 2 1 4 1 4 3 4 3
2010
1
Now say I want to group them by decade. For this, I use the cut function:
> table(cut(x, seq(1980, 2010, 10)))
Error in cut.default(x, seq(1980, 2010, 10)) : 'x' must be numeric
Ok, so let's force x to numeric:
> table(cut(as.numeric(x), seq(1980, 2010, 10)))
(1.98e+03,1.99e+03] (1.99e+03,2e+03] (2e+03,2.01e+03]
45 28 23
Now, as you can see, the row.names of that table are in scientific format. How do I force them to not be in scientific notation? I've tried wrapping that whole command above inside format, formatC and prettyNum, but all those do is format the frequencies.
Thanks joran for pointing the path to the answer. I'll elaborate it here for the record:
Changing cut's dig.lab parameter from the default 3 to 4 solved this particular mockup as well as my real problem:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4))
(1980,1990] (1990,2000] (2000,2010]
45 28 23
By the way, in order for 1980 to be counted one should include the include.lowest argument:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4, include.lowest = T))
[1980,1990] (1990,2000] (2000,2010]
49 28 23
Now it sums to 100! :)
This doesn't exactly answer the question you asked, but shows you a possible alternative: use the fact that there is a cut.Date method:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
out <- table(cut(x, "10 years"))
out
#
# 1980-01-01 1990-01-01 2000-01-01 2010-01-01
# 48 25 26 1
Here, we also get what I would consider the "correct" values for each bin.
As a crude justification of my statement about "correct" values, consider the values we get when we manually calculate based on table:
y <- strftime(x, '%Y')
Tab <- table(y)
Tab
# y
# 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 1996
# 4 4 3 3 6 4 3 4 5 12 1 1 1 2 9 4
# 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2010
# 2 1 4 4 2 1 4 1 4 3 4 3 1
sum(Tab[grepl("198", names(Tab))])
# [1] 48
sum(Tab[grepl("199", names(Tab))])
# [1] 25
sum(Tab[grepl("200", names(Tab))])
# [1] 26
sum(Tab[grepl("201", names(Tab))])
# [1] 1
Related
I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10
I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014
You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)
Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014
You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014
I would like to rescale multiple variables at once. Each one of the variable should be rescaled between 0 and 10.
My dataset looks something like this
df<-structure(list(Year = 1985:2012, r_mean_dp_C_EU_PTA = c(0.166685371371432,0, 0.340384674048008, 0.255663634111618, 0.137833312888481, 0.215940736735375,0.695926742038269, 1.12488458324014, 1.50426967770413, 1.96800275204271,
1.84220420613839, 2.55081439923073, 2.83958315572122, 3.02471358081631, 2.76227596053162, 5.13672466755955, 6.22501740311663, 6.04685020876299,
5.48990293535953, 5.74245144436088, 6.87554176822673, 5.35866756802216,6.21821261660873, 7.39740372167956, 7.37052059919359, 8.4053331043966,
7.88284279150424, 10),
r_mean_dp_C_US_PTA = c(0, 0.0243131684738152, 0.0295348762350131, 1.24572619158458, 1.20624633452509, 1.57418568231032,1.45479246796848, 2.38700784566208, 2.62865525326503, 2.26401361870534,2.67319203680329, 2.64440548764366, 3.10459526464658, 3.05231530072328,
3.32660416229216, 4.14909239351474, 3.76404440984403, 3.79766644256544,4.55279786294561, 5.57506946922008, 6.83412605593388, 8.07241989452914,9.10370786838265, 9.51564633960853, 8.64357423479438, 9.10723202296861,10, 9.06442082870898),
r_mean_dp_C_eu_esr_sum = c(0.0267071299038037,0, 0.0481033555876806, 0.039231355183461, 0.0255363040160583,0.0284158726695472, 0.234715155525714, 0.544954230234254, 0.683338138878583, 0.828929653572072, 0.950656658215744, 1.21492080702167, 1.30147631753441, 1.36122263965133, 1.33106989847101, 1.7848396827464, 2.19247065377408, 2.1506217173316, 4.91794342139369, 4.83398913690854, 7.28545175419305,5.42827409024432, 7.34375238832023, 8.91410171271897, 8.98533852868884, 9.17361943843028, 9.21421152468197, 10)), row.names = c(NA, -28L
),
class = c("data.table", "data.frame"))
I have tried to use the package scales but it does not work
While the function with name identifiers fails
library(scales)
vars<-names(df[,2:4])
tst<-setDT(df)[, (vars):=lapply((vars), function(x) rescale(x,to = c(0,10)))]
Using position identifiers sets all the variable values to 5 which is not what I am looking for.
tst<-setDT(df)[, 2:4:=lapply(2:4, function(x) rescale(x,to = c(0,10)))]
tst
# Year r_mean_dp_C_EU_PTA r_mean_dp_C_US_PTA r_mean_dp_C_eu_esr_sum
# 1: 1985 5 5 5
# 2: 1986 5 5 5
# 3: 1987 5 5 5
# 4: 1988 5 5 5
# 5: 1989 5 5 5
# 6: 1990 5 5 5
# 7: 1991 5 5 5
# 8: 1992 5 5 5
# 9: 1993 5 5 5
# 10: 1994 5 5 5
# 11: 1995 5 5 5
# 12: 1996 5 5 5
# 13: 1997 5 5 5
# 14: 1998 5 5 5
# 15: 1999 5 5 5
# 16: 2000 5 5 5
# 17: 2001 5 5 5
# 18: 2002 5 5 5
# 19: 2003 5 5 5
# 20: 2004 5 5 5
# 21: 2005 5 5 5
# 22: 2006 5 5 5
# 23: 2007 5 5 5
# 24: 2008 5 5 5
# 25: 2009 5 5 5
# 26: 2010 5 5 5
# 27: 2011 5 5 5
# 28: 2012 5 5 5
Does anyone know what I am doing wrong?
Thanks a lot in advance for your help
We can use .SDcols.
To apply by names
library(data.table)
df[, (vars):= lapply(.SD, scales::rescale, to = c(0, 10)), .SDcols = vars]
To apply by position
df[, 2:4 := lapply(.SD, scales::rescale, to = c(0, 10)), .SDcols = 2:4]
I am a bit confused what the exact output needs to be, as in this example everything is between 0 and 10.
Did you try to use dplyr?
tst <- df %>%
mutate_at(vars, function(x) rescale(x,to = c(0,10)) )
resulted in:
Year r_mean_dp_C_EU_PTA r_mean_dp_C_US_PTA r_mean_dp_C_eu_esr_sum
1 1985 0.1515322 0.00000000 0.02670713
2 1986 0.0000000 0.02431317 0.00000000
3 1987 0.3094406 0.02953488 0.04810336
4 1988 0.2324215 1.24572619 0.03923136
5 1989 0.1253030 1.20624633 0.02553630
6 1990 0.1963098 1.57418568 0.02841587
7 1991 0.6326607 1.45479247 0.23471516
8 1992 1.0226223 2.38700785 0.54495423
9 1993 1.3675179 2.62865525 0.68333814
10 1994 1.7890934 2.26401362 0.82892965
11 1995 1.6747311 2.67319204 0.95065666
12 1996 2.3189222 2.64440549 1.21492081
13 1997 2.5814392 3.10459526 1.30147632
14 1998 2.7497396 3.05231530 1.36122264
15 1999 2.5111600 3.32660416 1.33106990
16 2000 4.6697497 4.14909239 1.78483968
17 2001 5.6591067 3.76404441 2.19247065
18 2002 5.4971366 3.79766644 2.15062172
19 2003 4.9908209 4.55279786 4.91794342
20 2004 5.2204104 5.57506947 4.83398914
21 2005 6.2504925 6.83412606 7.28545175
22 2006 4.8715160 8.07241989 5.42827409
23 2007 5.6529206 9.10370787 7.34375239
24 2008 6.7249125 9.51564634 8.91410171
25 2009 6.7004733 8.64357423 8.98533853
26 2010 7.6412119 9.10723202 9.17361944
27 2011 7.1662207 10.00000000 9.21421152
28 2012 10.0000000 9.06442083 10.00000000
Is this what you want?
My table looks like this:
# Year Month WaterYear
# 1993 3
# 2000 4
# 2013 10
# 2015 6
# 2000 7
# 2008 12
# 2008 9
# 2012 10
# 2000 11
# 2000 12
I am trying to update this table by computing WaterYear equals Year+1 where months range between October and December.
I am working on R and hoping to find the easiest way to make it work.
Simple ifelse function will do the trick.
From your data.
# Create data
Year <- c(1993, 2000, 2013, 2015, 2000, 2008, 2008, 2012, 2000, 2000)
Month <- c(3, 4, 10, 6, 7, 12, 9 ,10, 11, 12)
WaterYear <- rep("",length(Year))
dat <- data.frame(Year, Month, WaterYear)
# If month is greater or equal to 10 change it to Year +1,
# otherwise keep it as it is
dat$WaterYear <- ifelse(dat$Month >=10, Year+1, WaterYear)
Results in
Year Month WaterYear
1993 3
2000 4
2013 10 2014
2015 6
2000 7
2008 12 2009
2008 9
2012 10 2013
2000 11 2001
We can also do
i1 <- dat$Month >=10
dat$WaterYear[i1] <- dat$Year[i1] + 1
dat
# Year Month WaterYear
#1 1993 3
#2 2000 4
#3 2013 10 2014
#4 2015 6
#5 2000 7
#6 2008 12 2009
#7 2008 9
#8 2012 10 2013
#9 2000 11 2001
#10 2000 12 2001
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(dat)), specify the logical condition in 'i' (Month >= 10), and assign (:=) the 'Year' + 1 to 'WaterYear'
library(data.table)
setDT(dat)[Month >=10, WaterYear := as.character(Year + 1)]
Maybe it's something basic, but I couldn't find the answer.
I have
Id Year V1
1 2009 33
1 2010 67
1 2011 38
2 2009 45
3 2009 65
3 2010 74
4 2009 47
4 2010 51
4 2011 14
I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.
Id Year V1
1 2009 33
1 2010 67
1 2011 38
4 2009 47
4 2010 51
4 2011 14
I try
d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )
but it doesn't work.
Can anyone provide some suggestions that how I can do this in R?
I think ave could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x). This gives a logical vector, which can be summed. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.
df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
# Id Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
Another way of using ave
DF
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 4 2 2009 45
## 5 3 2009 65
## 6 3 2010 74
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
This should do the job :)
library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]
I think an approach using ave is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table.
Again, just throwing this out there to provide some options to use different aspects of the language.
d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))
# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
# ID Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)
# medium way
library(data.table)
d2 <- as.data.table(d1)
d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14