R how to "split" on combined subsets? - r

Let's say I have a factor that has a bunch of levels in it. And I have grouped some of those levels as represented by the grps variable.
I can use "split" to split my data frame, but is it possible to split the data frame so that the combined levels represented by grps are in the same split?
set.seed(42)
fooLevels <- c(115,119,156,120,158,219)
foo <- fooLevels[round(runif(20, min=1, max=6))]
doo <- rnorm(20)
df <- data.frame(foo,doo)
grps <- c("{115}","{119}","{156}{120}{158}{219}")
splits <- split(df, f = df$foo)
I'd like the output to look like:
>splits
$`{115}`
foo doo
8 115 0.08983289
9 115 -2.99309008
12 115 0.18523056
$`{119}`
foo doo
2 119 -0.7838389
7 119 0.6792888
13 119 0.5818237
$`{156}{120}{158}{219}`
foo doo
1 120 0.3219253
4 120 0.6428993
6 120 0.2765507
11 120 -0.3672346
18 120 1.0385061
20 120 0.7208782
3 156 1.5757275
10 156 0.2848830
17 156 0.3358481
5 158 0.08976065
16 158 1.30254263
19 158 0.92072857
14 219 1.3997368
15 219 -0.7272921
The order of the rows in the list(data.frame) is of no consequence.

You can set the names of the list and change the expression in str_split to whatever works for you.
lapply(
strsplit(
grps,
'}\\{|\\{|}'
),
function(x) {
df[df$foo %in% x,]
}
)
[[1]]
[1] foo doo
<0 rows> (or 0-length row.names)
[[2]]
foo doo
3 119 -1.388861
8 119 -2.656455
14 119 1.214675
18 119 -1.763163
[[3]]
foo doo
1 219 1.3048697
2 219 2.2866454
4 158 -0.2787888
5 120 -0.1333213
6 120 0.6359504
7 158 -0.2842529
9 120 -2.4404669
10 158 1.3201133
11 156 -0.3066386
12 158 -1.7813084
13 219 -0.1719174
15 156 1.8951935
16 219 -0.4304691
17 219 -0.2572694
19 156 0.4600974
20 120 -0.6399949

If your grp object does not already exist, you could do something like this
x = split(df, df$foo)
y = Reduce(`rbind`, x[names(x)> 120])
o = c(x[names(x) <= 120], setNames(list(y), paste(unique(y$foo), collapse = ' ')))
#> o
#$`119`
# foo doo
#3 119 -1.388861
#8 119 -2.656455
#14 119 1.214675
#18 119 -1.763163
#$`120`
# foo doo
#5 120 -0.1333213
#6 120 0.6359504
#9 120 -2.4404669
#20 120 -0.6399949
#$`156 158 219`
# foo doo
#11 156 -0.3066386
#15 156 1.8951935
#19 156 0.4600974
#4 158 -0.2787888
#7 158 -0.2842529
#10 158 1.3201133
#12 158 -1.7813084
#1 219 1.3048697
#2 219 2.2866454
#13 219 -0.1719174
#16 219 -0.4304691
#17 219 -0.2572694

Related

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

R function to paste information from different columns?

I have a data frame with 3 different identifications and sometimes they overlap. I want to create a new column, with only one of those ids, in an order of preference (id1>id2>id3).
Ex.:
id1 id2 id3
12 145 8763
45 836 5766
13 768 9374
836 5766
12 145
9282
567
45 836 5766
and I want to have:
id1 id2 id3 id.new
12 145 8763 12
45 836 5766 45
13 768 9374 13
836 5766 836
9282 9282
567 567
I have tried the if else,which, grep functions.. but I can't make it work.
Ex. of my try:
df$id1 <- ifelse(df$id1 == "", paste(df$2), (ifelse(df$id1)))
I am able to do this on Excel, but I am switching to R, for being more reliable and reproducible :) But in excel I would use:
=if(A1="",B1,(if(B1="",C1,B1)),A1)
Using coalesce from the dplyr package, we can try:
library(dplyr)
df$id.new <- coalesce(df$id1, df$id2, df$id3)
df
id1 id2 id3 id.new
1 12 145 8763 12
2 45 836 5766 45
3 13 768 9374 13
4 NA 836 5766 836
5 12 145 NA 12
6 NA NA 9282 9282
7 NA 567 NA 567
8 45 836 5766 45
Data:
df <- data.frame(id1=c(12,45,13,NA,12,NA,NA,45),
id2=c(145,836,768,836,145,NA,567,836),
id3=c(8763,5766,9374,5766,NA,9282,NA,5766))
In base you can use apply of is.na(df) with function which.min to get a matrix used for subsetting. Thanks to #tim-biegeleisen for the dataset.
df$id.new <- df[cbind(1:nrow(df), apply(is.na(df), 1, which.min))]
df
# id1 id2 id3 id.new
#1 12 145 8763 12
#2 45 836 5766 45
#3 13 768 9374 13
#4 NA 836 5766 836
#5 12 145 NA 12
#6 NA NA 9282 9282
#7 NA 567 NA 567
#8 45 836 5766 45

Create dataset based on condition

I have dataset new with variable a b and c
a b c
hdjfh 434 876
sdfdsf 34 98
gfdsdfdsf 534 672
rsdfdsf 65 87
gsdfdsf 67 54
vbvnn 98 09
gkhjgfk 100 768
rknfg 78 3546
i want to create two datatsets such that dataset new1 need to satisfy condition b >110 or c >110. second dataset new2 will have records that are not satisfied by the condition b >110 or c >110
If you want to assign the two data sets to new variables, you can do this:
df <- data.frame(a=c('hdjfh','sdfdsf','gfdsdfdsf','rsdfdsf','gsdfdsf','vbvnn','gkhjgfk','rknfg'),b=c(434L,34L,534L,65L,67L,98L,100L,78L),c=c(876L,98L,672L,87L,54L,9L,768L,3546L),stringsAsFactors=F);
cond <- df$b>110|df$c>110;
new1 <- df[cond,];
new2 <- df[!cond,];
new1;
## a b c
## 1 hdjfh 434 876
## 3 gfdsdfdsf 534 672
## 7 gkhjgfk 100 768
## 8 rknfg 78 3546
new2;
## a b c
## 2 sdfdsf 34 98
## 4 rsdfdsf 65 87
## 5 gsdfdsf 67 54
## 6 vbvnn 98 9
Another option is to use split() to get a list:
split(df,df$b>110|df$c>110);
## $`FALSE`
## a b c
## 2 sdfdsf 34 98
## 4 rsdfdsf 65 87
## 5 gsdfdsf 67 54
## 6 vbvnn 98 9
##
## $`TRUE`
## a b c
## 1 hdjfh 434 876
## 3 gfdsdfdsf 534 672
## 7 gkhjgfk 100 768
## 8 rknfg 78 3546
##

need to 'reshape' dataframe

dataset:
zip acs.pop napps pperct cgrp zgrp perc
1: 12007 97 2 2.0618557 2 1 25.000000
2: 12007 97 2 2.0618557 NA 2 50.000000
3: 12007 97 2 2.0618557 1 1 25.000000
4: 12008 485 2 0.4123711 2 1 33.333333
5: 12008 485 2 0.4123711 4 1 33.333333
6: 12008 485 2 0.4123711 NA 1 33.333333
7: 12009 7327 187 2.5522042 4 76 26.206897
8: 12009 7327 187 2.5522042 1 41 14.137931
9: 12009 7327 187 2.5522042 2 23 7.931034
10: 12009 7327 187 2.5522042 NA 103 35.517241
11: 12009 7327 187 2.5522042 3 47 16.206897
12: 12010 28802 580 2.0137490 NA 275 32.163743
13: 12010 28802 580 2.0137490 4 122 14.269006
14: 12010 28802 580 2.0137490 1 269 31.461988
15: 12010 28802 580 2.0137490 2 96 11.228070
16: 12010 28802 580 2.0137490 3 93 10.877193
17: 12018 7608 126 1.6561514 3 30 16.129032
18: 12018 7608 126 1.6561514 NA 60 32.258065
19: 12018 7608 126 1.6561514 2 14 7.526882
20: 12018 7608 126 1.6561514 4 57 30.645161
21: 12018 7608 126 1.6561514 1 25 13.440860
22: 12019 14841 144 0.9702850 NA 62 30.097087
23: 12019 14841 144 0.9702850 4 73 35.436893
24: 12019 14841 144 0.9702850 3 30 14.563107
25: 12019 14841 144 0.9702850 1 23 11.165049
26: 12019 14841 144 0.9702850 2 18 8.737864
27: 12020 31403 343 1.0922523 3 76 14.960630
28: 12020 31403 343 1.0922523 1 88 17.322835
29: 12020 31403 343 1.0922523 2 38 7.480315
30: 12020 31403 343 1.0922523 4 141 27.755906
31: 12020 31403 343 1.0922523 NA 165 32.480315
32: 12022 1002 5 0.4990020 NA 4 44.444444
33: 12022 1002 5 0.4990020 4 2 22.222222
34: 12022 1002 5 0.4990020 3 1 11.111111
35: 12022 1002 5 0.4990020 1 1 11.111111
I know the reshape2 or reshape package can handle this, but I'm not sure how. I need the final output to look like this:
zip acs.pop napps pperct zgrp4 zgrp3 zgrp2 zgrp1 perc4 perc3 perc2 perc1
12009 7327 187 2.5522042 76 47 23 41 26.206897 16.206897 7.931034 14.137931
zip is the id
acs.pop, napps, pperct will be the same for each zip group
zgrp4…zgrp1 are the values of zgrp for each value of cgrp
perc4…perc1 are the values of perc for each value of cgrp
We can try dcast from the devel version of data.table which can take multiple value.var columns. In this case, we have 'zgrp' and 'perc' are the value columns. Using the grouping variables, we create an sequence variable ('ind') and then use dcast to convert from 'long' to 'wide' format.
Instructions to install the devel version are here
library(data.table)#v1.9.5
setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
# zip acs.pop napps pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp 1_perc
#1: 12007 97 2 2.0618557 1 2 1 NA NA 25.00000
#2: 12008 485 2 0.4123711 1 1 1 NA NA 33.33333
#3: 12009 7327 187 2.5522042 76 41 23 103 47 26.20690
#4: 12010 28802 580 2.0137490 275 122 269 96 93 32.16374
#5: 12018 7608 126 1.6561514 30 60 14 57 25 16.12903
#6: 12019 14841 144 0.9702850 62 73 30 23 18 30.09709
#7: 12020 31403 343 1.0922523 76 88 38 141 165 14.96063
#8: 12022 1002 5 0.4990020 4 2 1 1 NA 44.44444
# 2_perc 3_perc 4_perc 5_perc
#1: 50.00000 25.000000 NA NA
#2: 33.33333 33.333333 NA NA
#3: 14.13793 7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807 7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505 8.737864
#7: 17.32284 7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111 NA
Or we can use ave/reshape from base R
df2 <- transform(df1, ind=ave(seq_along(zip), zip,
acs.pop, napps, pperct, FUN=seq_along))
reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
timevar='ind', direction='wide')
This is a good use for spread() in tidyr.
df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc
Example output:
> df2[df2$zip==12009,]
Source: local data frame [1 x 12]
zip acs.pop napps pperct perc_1 perc_2 perc_3 perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009 7327 187 2.552204 14.13793 7.931034 16.2069 26.2069 41 23 47 76
Thanks to #akrun for the assist

Shift up rows in R

This is a simple example of how my data looks like.
Suppose I got the following data
>x
Year a b c
1962 1 2 3
1963 4 5 6
. . . .
. . . .
2001 7 8 9
I need to form a time series of x with 7 column contains the following variables:
Year a lag(a) b lag(b) c lag(c)
What I did is the following:
> x<-ts(x) # converting x to a time series
> x<-cbind(x,x[,-1]) # adding the same variables to the time series without repeating the year column
> x
Year a b c a b c
1962 1 2 3 1 2 3
1963 4 5 6 4 5 6
. . . . . . .
. . . . . . .
2001 7 8 9 7 8 9
I need to shift the last three column up so they give the lags of a,b,c. then I will rearrange them.
Here's an approach using dplyr
df <- data.frame(
a=1:10,
b=21:30,
c=31:40)
library(dplyr)
df %>% mutate_each(funs(lead(.,1))) %>% cbind(df, .)
# a b c a b c
#1 1 21 31 2 22 32
#2 2 22 32 3 23 33
#3 3 23 33 4 24 34
#4 4 24 34 5 25 35
#5 5 25 35 6 26 36
#6 6 26 36 7 27 37
#7 7 27 37 8 28 38
#8 8 28 38 9 29 39
#9 9 29 39 10 30 40
#10 10 30 40 NA NA NA
You can change the names afterwards using colnames(df) <- c("a", "b", ...)
As #nrussel noted in his answer, what you described is a leading variable. If you want a lagging variable, you can change the lead in my answer to lag.
X <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts <- ts(X)
Xts[1:(nrow(Xts)-1),c(4,5,6)] <- Xts[2:nrow(Xts),c(4,5,6)]
Xts[nrow(Xts),c(4,5,6)] <- c(NA,NA,NA)
> head(Xts)
a b c laga lagb lagc
[1,] 1 2 3 2 4 6
[2,] 2 4 6 3 6 9
[3,] 3 6 9 4 8 12
[4,] 4 8 12 5 10 15
[5,] 5 10 15 6 12 18
[6,] 6 12 18 7 14 21
##
> tail(Xts)
a b c laga lagb lagc
[95,] 95 190 285 96 192 288
[96,] 96 192 288 97 194 291
[97,] 97 194 291 98 196 294
[98,] 98 196 294 99 198 297
[99,] 99 198 297 100 200 300
[100,] 100 200 300 NA NA NA
I'm not sure if by shift up you literally mean shift the rows up 1 place like above (because that would mean you are using lagging values not leading values), but here's the other direction ("true" lagged values):
X2 <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts2 <- ts(X2)
Xts2[2:nrow(Xts2),c(4,5,6)] <- Xts2[1:(nrow(Xts2)-1),c(4,5,6)]
Xts2[1,c(4,5,6)] <- c(NA,NA,NA)
##
> head(Xts2)
a b c laga lagb lagc
[1,] 1 2 3 NA NA NA
[2,] 2 4 6 1 2 3
[3,] 3 6 9 2 4 6
[4,] 4 8 12 3 6 9
[5,] 5 10 15 4 8 12
[6,] 6 12 18 5 10 15
##
> tail(Xts2)
a b c laga lagb lagc
[95,] 95 190 285 94 188 282
[96,] 96 192 288 95 190 285
[97,] 97 194 291 96 192 288
[98,] 98 196 294 97 194 291
[99,] 99 198 297 98 196 294
[100,] 100 200 300 99 198 297

Resources