Dynamic subsetting depending on values in R - r

i have a data frame with the next structure
Id Flag value1 value2
123 1 10 3.4
124 1 5 1.2
125 0 19 8.4
126 1 8 1.2
127 0 17 6.5
128 2 1 -6.5
I need to separate the data frame into 'n' subsets depending in only the name of the column, where 'n' is the distinct values of the columns, i would expect the following:
dataframe1
Id Flag value1 value2
123 1 10 3.4
124 1 5 1.2
126 1 8 1.2
dataframe2
Id Flag value1 value2
125 0 19 8.4
127 0 17 6.5
dataframe3
Id Flag value1 value2
128 2 1 -6.5
Since this is going inside a function, I only know the name of the column and the distinct values it can take, I've tried:
dataFrame$column==value
but I would need to do this for every value, and the values are dynamic in length depending on the name of the column.
Thanks in advance

Here, split is your friend.
splitbycol <- function(df, colname) {
split(df, df[[colname]])
}
splitbycol(df, "Flag")
## $`0`
## Id Flag value1 value2
## 3 125 0 19 8.4
## 5 127 0 17 6.5
##
## $`1`
## Id Flag value1 value2
## 1 123 1 10 3.4
## 2 124 1 5 1.2
## 4 126 1 8 1.2
##
## $`2`
## Id Flag value1 value2
## 6 128 2 1 -6.5
Then, if you'd like to make each of the data frames a separate "variable", call e.g.
subdf <- splitbycol(df, "Flag")
for (i in seq_along(subdf))
assign(paste0("df", i), subdf[[i]])
df1
## Id Flag value1 value2
## 3 125 0 19 8.4
## 5 127 0 17 6.5

Another approach avoiding for loop
> List <- split(df, df$Flag) # split
> names(List) <- paste0("dataframe", seq_along(List)) # naming (use seq_along better)
> list2env(List, envir=.GlobalEnv) # from list to data.frame
> dataframe1
# Id Flag value1 value2
#3 125 0 19 8.4
#5 127 0 17 6.5
> dataframe2
# Id Flag value1 value2
#1 123 1 10 3.4
#2 124 1 5 1.2
#4 126 1 8 1.2
> dataframe3
# Id Flag value1 value2
# 6 128 2 1 -6.5

Related

Inexact joining data based on greater equal condition

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)
An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]
You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

flag rows in groups with multiple conditions

I looked here and elsewhere, but I cannot find something that does exactly what I'm looking to accomplish using R.
I have data similar to below, where col1 is a unique ID, col2 is a group ID variable, col3 is a status code. I need to flag all rows with the same group ID, and where any of those rows have a specific status code, X in this case, as == 1, otherwise 0.
ID GroupID Status Flag
1 100 A 1
2 100 X 1
3 102 A 0
4 102 B 0
5 103 B 1
6 103 X 1
7 104 X 1
8 104 X 1
9 105 A 0
10 105 C 0
I have tried writing some ifelse where groupID == groupID and status == X then 1 else 0, but that doesn't work. The pattern of Status is random. In this example, the GroupID is exclusively pairs, but I don't want to assume that in the code, b/c I have other instance where there are 3 or more rows in a GroupID.
It would be helpful if this were open ended IE I could add other conditions if necessary, like, for each matching group ID, where Status == X, and other or other, etc.
Thank you !
Group-based operations like this are easy to do with the dplyr package.
The data:
library(dplyr)
txt <- 'ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C '
df <- read.table(text = txt, header = T)
Once we have the data frame, we establish dplyr groups with the group_by function. The mutate command will then be applied per each group, creating a new column entry for each row.
df.new <- df %>%
group_by(GroupID) %>%
mutate(Flag = as.numeric(any(Status == 'X')))
# A tibble: 10 x 4
# Groups: GroupID [5]
ID GroupID Status Flag
<int> <int> <fct> <dbl>
1 1 100 A 1
2 2 100 X 1
3 3 102 A 0
4 4 102 B 0
5 5 103 B 1
6 6 103 X 1
7 7 104 X 1
8 8 104 X 1
9 9 105 A 0
10 10 105 C 0
From base R
ave(df$Status=='X',df$GroupID,FUN=any)
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Data.table way:
library(data.table)
setDT(df)
df[ , flag := sum(Status == "X") > 0, by=GroupID]
An alternative using data.table
library(data.table)
dt <- read.table(stringsAsFactors = FALSE,text = "ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C", header=T)
setDT(dt)[,.(ID,Status, Flag=ifelse("X"%in% Status,1,0)),by=GroupID]
#returns
GroupID ID Status Flag
1: 100 1 A 1
2: 100 2 X 1
3: 102 3 A 0
4: 102 4 B 0
5: 103 5 B 1
6: 103 6 X 1
7: 104 7 X 1
8: 104 8 X 1
9: 105 9 A 0
10: 105 10 C 0
A base R option with rowsum
i1 <- with(df1, rowsum(+(Status == "X"), group = GroupID) > 0)
transform(df1, Flag = +(GroupID %in% row.names(i1)[i1]))
Or using table
df1$Flag <- +(with(df1, GroupID %in% names(which(table(GroupID,
Status == "X")[,2]> 0))))

Generating the dataframe objects that are produced during jackknife sampling

This post has been editted to more accurately describe the situation. I am utilising a form of jackknife sampling for my work. The jackknifed data will be used for calibration of a model, and the unused data will be used for validation.
Rather than perform the analysis immediately, I want to save the jackknifed samples as dataframes, as well as the data which was removed for each sample...
It's hard to explain, so I will use an example to illustrate:
The aim in the example is to create the datasets 4 times. Each time there should be 2 datasets - 1 of length 9 (the calibration one), and 1 of length 3 (the validation one).
df <-
data.frame(value1 = 1:(3*4),
value2 = seq(from = 1000, by = 50, length.out = 3*4),
tosplit = rep(1:4, each = 3))
df #df represents the dataframe in its entirety
dfs <- split(df, df$tosplit) #df is now split into 4 equal parts of 3
#####
> #Replicate 1
> r1_3parts <- do.call("rbind", dfs[1:3])
> r1_1parts <- do.call("rbind", dfs[4])
>
> r1_3parts
value1 value2 tosplit
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
> r1_1parts
value1 value2 tosplit
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
>
> #Replicate 2
> r2_3parts <- do.call("rbind", dfs[2:4])
> r2_1parts <- do.call("rbind", dfs[1])
>
> r2_3parts
value1 value2 tosplit
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
> r2_1parts
value1 value2 tosplit
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
>
> #Replicate 3
> r3_3parts <- do.call("rbind", dfs[c(3:4, 1)])
> r3_1parts <- do.call("rbind", dfs[2])
>
> r3_3parts
value1 value2 tosplit
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
> r3_1parts
value1 value2 tosplit
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
>
>
> #Replicate 4
> r4_3parts <- do.call("rbind", dfs[c(4, 1:2)])
> r4_1parts <- do.call("rbind", dfs[3])
>
> r4_3parts
value1 value2 tosplit
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
> r4_1parts
value1 value2 tosplit
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
>
This doesn't appear to be an option in packages that I can find - they default to just creating the statistics for you. What I want is to see the sample datasets, and also specify their relative size. Is this possible in an existing package, or if not, is there a suitable way to determine this in a more automated fashion?
Without a random component, this doesn't really strike me as a bootstrap. It seems you are pursuing a variation on permutation.
The data frame can be split with a fairly simple function.
df <-
data.frame(value1 = 1:(3*4),
value2 = seq(from = 1000, by = 50, length.out = 3*4),
tosplit = rep(1:4, each = 3))
split_into_two <- function(data, split_var, split_val){
split <- data[[split_var]] %in% split_val
split(data, split)
}
split_into_two(df, "tosplit", 1:3)
To get the four permutations you describe, we can use lapply:
lapply(list(1:3, 2:4, c(4, 1:2), c(3:4, 1)),
function(x) split_into_two(df, "tosplit", x))
This saves a great deal of copy-paste.

rotating variable in ddply

I am trying to get means from a column in a data frame based on a unique value. So trying to get mean of column b and column c in this exampled based on the unique values in column a. I thought the .(a) would make it calculate by unique value in a (it gives the unique values of a) but it just gives a mean for the whole column b or c.
df2<-data.frame(a=seq(1:5),b=c(1:10), c=c(11:20))
simVars <- c("b", "c")
for ( var in simVars ){
print(var)
dat = ddply(df2, .(a), summarize, mean_val = mean(df2[[var]])) ## my script
assign(var, dat)
}
c
a mean_val
1 15.5
2 15.5
3 15.5
4 15.5
5 15.5
How can I have it take an average for the column based on the unique value from column a?
thanks
You don't need a loop. Just calculate the means of b and c within a single call to ddply and the means will be calculated separately for each value of a. And, as #Gregor said, you don't need to re-specify the data frame name inside mean():
ddply(df2, .(a), summarise,
mean_b=mean(b),
mean_c=mean(c))
a mean_b mean_c
1 1 3.5 13.5
2 2 4.5 14.5
3 3 5.5 15.5
4 4 6.5 16.5
5 5 7.5 17.5
UPDATE: To get separate data frames for each column of means:
# Add a few additional columns to the data frame
df2 = data.frame(a=seq(1:5),b=c(1:10), c=c(11:20), d=c(21:30), e=c(31:40))
# New data frame with means by each level of column a
library(dplyr)
dfmeans = df2 %>%
group_by(a) %>%
summarise_each(funs(mean))
# Separate each column of means into a separate data frame and store it in a list:
means.list = lapply(names(dfmeans)[-1], function(x) {
cbind(dfmeans[,"a"], dfmeans[,x])
})
means.list
[[1]]
a b
1 1 3.5
2 2 4.5
3 3 5.5
4 4 6.5
5 5 7.5
[[2]]
a c
1 1 13.5
2 2 14.5
3 3 15.5
4 4 16.5
5 5 17.5
[[3]]
a d
1 1 23.5
2 2 24.5
3 3 25.5
4 4 26.5
5 5 27.5
[[4]]
a e
1 1 33.5
2 2 34.5
3 3 35.5
4 4 36.5
5 5 37.5

Resources