I would like to rescale multiple variables at once. Each one of the variable should be rescaled between 0 and 10.
My dataset looks something like this
df<-structure(list(Year = 1985:2012, r_mean_dp_C_EU_PTA = c(0.166685371371432,0, 0.340384674048008, 0.255663634111618, 0.137833312888481, 0.215940736735375,0.695926742038269, 1.12488458324014, 1.50426967770413, 1.96800275204271,
1.84220420613839, 2.55081439923073, 2.83958315572122, 3.02471358081631, 2.76227596053162, 5.13672466755955, 6.22501740311663, 6.04685020876299,
5.48990293535953, 5.74245144436088, 6.87554176822673, 5.35866756802216,6.21821261660873, 7.39740372167956, 7.37052059919359, 8.4053331043966,
7.88284279150424, 10),
r_mean_dp_C_US_PTA = c(0, 0.0243131684738152, 0.0295348762350131, 1.24572619158458, 1.20624633452509, 1.57418568231032,1.45479246796848, 2.38700784566208, 2.62865525326503, 2.26401361870534,2.67319203680329, 2.64440548764366, 3.10459526464658, 3.05231530072328,
3.32660416229216, 4.14909239351474, 3.76404440984403, 3.79766644256544,4.55279786294561, 5.57506946922008, 6.83412605593388, 8.07241989452914,9.10370786838265, 9.51564633960853, 8.64357423479438, 9.10723202296861,10, 9.06442082870898),
r_mean_dp_C_eu_esr_sum = c(0.0267071299038037,0, 0.0481033555876806, 0.039231355183461, 0.0255363040160583,0.0284158726695472, 0.234715155525714, 0.544954230234254, 0.683338138878583, 0.828929653572072, 0.950656658215744, 1.21492080702167, 1.30147631753441, 1.36122263965133, 1.33106989847101, 1.7848396827464, 2.19247065377408, 2.1506217173316, 4.91794342139369, 4.83398913690854, 7.28545175419305,5.42827409024432, 7.34375238832023, 8.91410171271897, 8.98533852868884, 9.17361943843028, 9.21421152468197, 10)), row.names = c(NA, -28L
),
class = c("data.table", "data.frame"))
I have tried to use the package scales but it does not work
While the function with name identifiers fails
library(scales)
vars<-names(df[,2:4])
tst<-setDT(df)[, (vars):=lapply((vars), function(x) rescale(x,to = c(0,10)))]
Using position identifiers sets all the variable values to 5 which is not what I am looking for.
tst<-setDT(df)[, 2:4:=lapply(2:4, function(x) rescale(x,to = c(0,10)))]
tst
# Year r_mean_dp_C_EU_PTA r_mean_dp_C_US_PTA r_mean_dp_C_eu_esr_sum
# 1: 1985 5 5 5
# 2: 1986 5 5 5
# 3: 1987 5 5 5
# 4: 1988 5 5 5
# 5: 1989 5 5 5
# 6: 1990 5 5 5
# 7: 1991 5 5 5
# 8: 1992 5 5 5
# 9: 1993 5 5 5
# 10: 1994 5 5 5
# 11: 1995 5 5 5
# 12: 1996 5 5 5
# 13: 1997 5 5 5
# 14: 1998 5 5 5
# 15: 1999 5 5 5
# 16: 2000 5 5 5
# 17: 2001 5 5 5
# 18: 2002 5 5 5
# 19: 2003 5 5 5
# 20: 2004 5 5 5
# 21: 2005 5 5 5
# 22: 2006 5 5 5
# 23: 2007 5 5 5
# 24: 2008 5 5 5
# 25: 2009 5 5 5
# 26: 2010 5 5 5
# 27: 2011 5 5 5
# 28: 2012 5 5 5
Does anyone know what I am doing wrong?
Thanks a lot in advance for your help
We can use .SDcols.
To apply by names
library(data.table)
df[, (vars):= lapply(.SD, scales::rescale, to = c(0, 10)), .SDcols = vars]
To apply by position
df[, 2:4 := lapply(.SD, scales::rescale, to = c(0, 10)), .SDcols = 2:4]
I am a bit confused what the exact output needs to be, as in this example everything is between 0 and 10.
Did you try to use dplyr?
tst <- df %>%
mutate_at(vars, function(x) rescale(x,to = c(0,10)) )
resulted in:
Year r_mean_dp_C_EU_PTA r_mean_dp_C_US_PTA r_mean_dp_C_eu_esr_sum
1 1985 0.1515322 0.00000000 0.02670713
2 1986 0.0000000 0.02431317 0.00000000
3 1987 0.3094406 0.02953488 0.04810336
4 1988 0.2324215 1.24572619 0.03923136
5 1989 0.1253030 1.20624633 0.02553630
6 1990 0.1963098 1.57418568 0.02841587
7 1991 0.6326607 1.45479247 0.23471516
8 1992 1.0226223 2.38700785 0.54495423
9 1993 1.3675179 2.62865525 0.68333814
10 1994 1.7890934 2.26401362 0.82892965
11 1995 1.6747311 2.67319204 0.95065666
12 1996 2.3189222 2.64440549 1.21492081
13 1997 2.5814392 3.10459526 1.30147632
14 1998 2.7497396 3.05231530 1.36122264
15 1999 2.5111600 3.32660416 1.33106990
16 2000 4.6697497 4.14909239 1.78483968
17 2001 5.6591067 3.76404441 2.19247065
18 2002 5.4971366 3.79766644 2.15062172
19 2003 4.9908209 4.55279786 4.91794342
20 2004 5.2204104 5.57506947 4.83398914
21 2005 6.2504925 6.83412606 7.28545175
22 2006 4.8715160 8.07241989 5.42827409
23 2007 5.6529206 9.10370787 7.34375239
24 2008 6.7249125 9.51564634 8.91410171
25 2009 6.7004733 8.64357423 8.98533853
26 2010 7.6412119 9.10723202 9.17361944
27 2011 7.1662207 10.00000000 9.21421152
28 2012 10.0000000 9.06442083 10.00000000
Is this what you want?
Related
I'm trying to calculate the compound annual growth rate of my data (snipet shown below), does anyone know the best way to do this or if there is a function that does part of the job?
Data: (only woried about the preds column here, others can be ignored)
year month timestep ymin ymax preds date
1 1998 1 1 17.84037 18.58553 18.21295 1998-01-01
2 1998 2 2 17.05009 17.70642 17.37826 1998-02-01
3 1998 3 3 16.97067 17.61320 17.29193 1998-03-01
4 1998 4 4 18.38551 19.00838 18.69695 1998-04-01
5 1998 5 5 21.39082 21.97338 21.68210 1998-05-01
6 1998 6 6 24.77679 25.35464 25.06571 1998-06-01
7 1998 7 7 27.27057 27.82818 27.54938 1998-07-01
8 1998 8 8 28.24703 28.76702 28.50702 1998-08-01
9 1998 9 9 27.72370 28.24619 27.98494 1998-09-01
10 1998 10 10 25.83783 26.33969 26.08876 1998-10-01
11 1998 11 11 22.94968 23.42268 23.18618 1998-11-01
12 1998 12 12 19.50499 20.05466 19.77982 1998-12-01
13 1999 1 13 17.98323 18.50530 18.24426 1999-01-01
14 1999 2 14 17.20124 17.61746 17.40935 1999-02-01
15 1999 3 15 17.11064 17.53492 17.32278 1999-03-01
I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10
I have a list of dataframes that I need to be combined into a single one.
year<-1990:2000
v1<-1:11
v2<-20:30
df1<-data.frame(year,v1)
df2<-data.frame(year,v2)
ldf<-list(df1,df2)
I now want to unlist this dataframe and get
> head(df)
year v1 v2
1 1990 1 20
2 1991 2 21
3 1992 3 22
4 1993 4 23
Note that my question is different from the solution provided in a similar question, where the solution to that question was: `df <- ldply(ldf, data.frame)
Because what I am essentially looking for, is a more automatic way of doing this: df<-merge(df1,df2, by="year")
With more number of list elements, a convenient option is reduce with one of the join functions
library(tidyverse)
ldf %>%
reduce(inner_join, by = "year")
# year v1 v2
#1 1990 1 20
#2 1991 2 21
#3 1992 3 22
#4 1993 4 23
#5 1994 5 24
#6 1995 6 25
#7 1996 7 26
#8 1997 8 27
#9 1998 9 28
#10 1999 10 29
#11 2000 11 30
Is there anything wrong with:
df <- merge(ldf[[1]], ldf[[2]], by="year")
Or for a long list:
df1 <- ldf[[1]]
for (x in 2:length(ldf)) {
df1 <- merge(df1, ldf[[x]])
}
# year v1 v2
# 1 1990 1 20
# 2 1991 2 21
# 3 1992 3 22
# 4 1993 4 23
# 5 1994 5 24
# 6 1995 6 25
# 7 1996 7 26
# 8 1997 8 27
# 9 1998 9 28
# 10 1999 10 29
# 11 2000 11 30
I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA
I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2