Conditonally delete columns in R - r

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!

Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1

I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Related

Looping over various lists of data frames to create a variable in each data frame

I have four lists, each of which contains 12 data frames. Something like this: `
for (i in 1:10) {
assign(paste0("df", i), data.frame(c=c(1,2,3), d=c(1,2,3),
e=c(1,2,3), e=c(1,2,3),
g=c(1,2,3), h=c(1,2,3),
i=c(1,2,3), j=c(1,2,3),
k=c(1,2,3), l=c(1,2,3)))
}
for (i in 1:4) {
assign(paste0("list_", i), lapply(ls(pattern="df"), get))
}
rm(list=ls(pattern="df"))
`
Each list corresponds to a year and each of its elements (the data frames) correspond to a month. Conveniently, the position of each element of the list (of each data frame) is equivalent to its month. So, in the first list, the first data frame corresponds to January 2020, the second to February 2020, and so on. In the second list, the first data frame corresponds to January 2021, the second to February 2021, and so on.
What I need to do is to create a new variable that indicates the month of each data frame.
I have been trying different things, including this:
for(j in 20:21) {
for(i in 1:12) {
assign(get(paste0("df_20", j))[[i]],
get(paste0("df_20", j))[[i]] %>%
mutate(month=i)) ## se le suma 1 al mes porque comienza desde febrero
}
}
But nothing works. The problem seems to be the left hand of the assignment. When I use the get() function, the software returns an error ("Error in assign(get(paste0("mies_20", j))[[1]], get(paste0("mies_20", :
invalid first argument"). If I don't include this function, paste0("df_20", j))[[i]] does not recognize the "[[i]]".
Any ideas?
get() each list, iterate over its dataframes using lapply(), then assign back to the environment. It’s also best to use something other than i for your iteration variable, since there’s also a column i in your dataframes.
library(dplyr)
for (j in 1:4) {
list_j <- get(paste0("list_", j))
list_j <- lapply(
seq_along(list_j),
\(mnth) mutate(list_j[[mnth]], month = mnth)
)
assign(paste0("list_", j), list_j)
}
list_1
[[1]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 1
3 3 3 3 3 3 3 3 3 3 3 1
[[2]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 2
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 2
[[3]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 3
2 2 2 2 2 2 2 2 2 2 2 3
3 3 3 3 3 3 3 3 3 3 3 3
[[4]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 4
2 2 2 2 2 2 2 2 2 2 2 4
3 3 3 3 3 3 3 3 3 3 3 4
[[5]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 5
2 2 2 2 2 2 2 2 2 2 2 5
3 3 3 3 3 3 3 3 3 3 3 5
[[6]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 6
2 2 2 2 2 2 2 2 2 2 2 6
3 3 3 3 3 3 3 3 3 3 3 6
[[7]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 7
2 2 2 2 2 2 2 2 2 2 2 7
3 3 3 3 3 3 3 3 3 3 3 7
[[8]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 8
2 2 2 2 2 2 2 2 2 2 2 8
3 3 3 3 3 3 3 3 3 3 3 8
[[9]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 9
2 2 2 2 2 2 2 2 2 2 2 9
3 3 3 3 3 3 3 3 3 3 3 9
[[10]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 10
2 2 2 2 2 2 2 2 2 2 2 10
3 3 3 3 3 3 3 3 3 3 3 10

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

R: Creating multiple resampled dataset based on multiple factors

I need to create multiple (several 1000) resampled datasets from a large database. I have three categorical variables. Site (S), Transect(T), Quadrat(Q). The response variable is Value (V), which is the result of the particular S, T, & Q combination. Quads along each transect at each site. I pasted an abbreviated dataset below.
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0
The idea would be that for a given site, the resampled dataset would contain ## of quads from transect 1 to n, where ## would be the number of quadrats(Q) per transect (T) per site (S). I am not trying to resample the dataset based on S, T, & Q. I would like to be able to resample a user-defined number of rows, based on the conditions I define. For example, if I chose to resample using based on 2 quadrats(Q) per transect (T) per site(S), I envision the resampled dataset looking like the below example.
S T Q V
A 1 1 8
A 1 3 0
A 2 1 0
A 2 2 15
A 3 2 25
A 3 3 0
B 1 2 1
B 1 3 0
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
C 1 1 0
C 1 3 0
C 2 1 45
C 2 3 0
C 3 2 1
C 3 3 0
Please let me know if that doesn't make sense and I'll revise until it does. Thanks for any assistance!
Consider by to slice dataframes by Site and Transect factors and then sample random rows:
set.seed(444)
quads <- 2
# BUILD LIST OF SUBSETTED RANDOM SAMPLED DATAFRAMES
df_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), quads),])
# STACK ALL DATAFRAMES INTO ONE FINAL DF
sample_df <- do.call(rbind, df_list)
# SORT DATAFRAME BY S AND T
sample_df <- with(sample_df, sample_df[order(S, T),])
# RESET ROW NAMES
row.names(sample_df) <- NULL
sample_df
# S T Q V
# 1 A 1 1 8
# 2 A 1 3 0
# 3 A 2 2 15
# 4 A 2 1 0
# 5 A 3 1 0
# 6 A 3 3 0
# 7 B 1 2 1
# 8 B 1 1 0
# 9 B 2 3 2
# 10 B 2 1 33
# 11 B 3 1 0
# 12 B 3 2 207
# 13 C 1 1 0
# 14 C 1 2 1
# 15 C 2 1 45
# 16 C 2 3 0
# 17 C 3 3 0
# 18 C 3 2 1
Data
txt = '
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0'
df = read.table(text=txt, header=TRUE)
To build randomly generated dataframes, simply extend out quads and run it through lapply:
max_quads <- 3
quads <- replicate(1000, sample(1:max_quads, 1))
df_list <- lapply(quads, function(q) {
by_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), q),]))
sample_df <- do.call(rbind, by_list)
sample_df <- with(sample_df, sample_df[order(S, T),])
row.names(sample_df) <- NULL
return(sample_df)
})

How to calculate recency in R

I have the following data:
set.seed(20)
round<-rep(1:10,2)
part<-rep(1:2, c(10,10))
game<-rep(rep(1:2,c(5,5)),2)
pay1<-sample(1:10,20,replace=TRUE)
pay2<-sample(1:10,20,replace=TRUE)
pay3<-sample(1:10,20,replace=TRUE)
decs<-sample(1:3,20,replace=TRUE)
previous_max<-c(0,1,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,0)
gamematrix<-cbind(part,game,round,pay1,pay2,pay3,decs,previous_max )
gamematrix<-data.frame(gamematrix)
Here is the output:
part game round pay1 pay2 pay3 decs previous_max
1 1 1 1 9 5 6 2 0
2 1 1 2 8 1 1 1 1
3 1 1 3 3 5 5 3 0
4 1 1 4 6 1 5 1 0
5 1 1 5 10 3 8 3 0
6 1 2 6 10 1 5 1 0
7 1 2 7 1 10 7 3 0
8 1 2 8 1 10 8 2 1
9 1 2 9 4 1 5 1 0
10 1 2 10 4 7 7 2 0
11 2 1 1 8 4 1 1 0
12 2 1 2 8 5 5 2 0
13 2 1 3 1 9 3 1 1
14 2 1 4 8 2 10 2 1
15 2 1 5 2 6 2 3 1
16 2 2 6 5 5 6 2 0
17 2 2 7 4 5 1 2 0
18 2 2 8 2 10 5 2 1
19 2 2 9 3 7 3 2 1
20 2 2 10 9 3 1 1 0
How can I calculate a new indicator variable "previous_max",which returns whether in the next round of the same game, the same participant choose the maximal payoff from the previous round.
So I want something like follows:
Participant (part) 1:
In the first round of each game, previous_max is "0" (no previous round), in round 2, previous_max ="1", because in round 1, the maximal pay was max(pay1,pay2,pay3)=max(9,5,6)=9, and in round 2, the participant's decisions (decs) was 1 (which was the maximal value in previous round).
In round 3, previous_max=0, because the maximal value in round 2 was 8 (which is "pay1"), but the participant choose "3" (which is pay3).
Here's a solution using dplyr and purr::map.
I would have preferred to use group_by than split but max.col ignores groups and I don't know of a dplyr equivalent`.
the output is slightly different but I think it's because of your mistakes, please explain if not and I'll update my answer.
library(purrr)
library(dplyr)
gamematrix %>%
split(.$part) %>%
map(~ .x %>% mutate(
prev_max = as.integer(
decs ==
c(0,max.col(.[c("pay1","pay2","pay3")])[-n()]) # the number of the max columns, offset by one
))) %>%
bind_rows
# ` part game round pay1 pay2 pay3 decs prev_max
# 1 1 1 1 9 5 6 2 0
# 2 1 1 2 8 1 1 1 1
# 3 1 1 3 3 5 5 3 0
# 4 1 1 4 6 1 5 1 0
# 5 1 1 5 10 3 8 3 0
# 6 1 2 6 10 1 5 1 1
# 7 1 2 7 1 10 7 3 0
# 8 1 2 8 1 10 8 2 1
# 9 1 2 9 4 1 5 1 0
# 10 1 2 10 4 7 7 2 0
# 11 2 1 1 8 4 1 1 0
# 12 2 1 2 8 5 5 2 0
# 13 2 1 3 1 9 3 1 1
# 14 2 1 4 8 2 10 2 1
# 15 2 1 5 2 6 2 3 1
# 16 2 2 6 5 5 6 2 1
# 17 2 2 7 4 5 1 2 0
# 18 2 2 8 2 10 5 2 1
# 19 2 2 9 3 7 3 2 1
# 20 2 2 10 9 3 1 1 0

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Resources