Subset a Dataframe by column vector of a different Dataframe - r

Supposing i got a dataframe like this:
dfA<-data.frame(A=c(letters[1:3]),B=c(letters[4:6]),C=c(letters[7:9]))
>dfA
A B C
1 a d g
2 b e h
3 c f i
and another one like this:
dfB<-data.frame(replicate(12,sample(0:5,5,rep=T)))
colnames(dfB)<-sample(letters[1:9],12,rep=T)
> dfB
a a d d g e i c i a g h
1 0 3 3 2 2 1 2 4 1 2 4 0
2 2 2 3 0 0 0 4 4 1 5 2 1
3 4 5 0 3 2 4 3 5 1 4 2 3
4 0 1 0 4 4 3 2 2 1 2 3 1
5 4 0 2 1 2 4 0 5 5 0 5 1
How could I refer to all columns from dfB, which have names contained in column A of dfA?
I am quite new to R and searched this forum a lot, but couldn't get the exact answer.
I tried something like this: sub<-subset(dfB, !colnames(dfB) %in% dfA$A) with unsatisfying results so far.
The output I'd wanna get would be:
> sub
a a c a
1 0 3 4 2
2 2 2 4 5
3 4 5 5 4
4 0 1 2 2
5 4 0 5 0
Can anyone help?

as akrun pointed out in the comments
subset(dfB, select=colnames(dfB) %in% dfA$A)
works perfectly.

Related

Replace a cell with NA according to value in another cell in R

I have a dataset from which I made a reproducible example:
set.seed(1)
Data <- data.frame(
A = sample(0:5),
B = sample(0:5),
C = sample(0:5),
D = sample(0:5),
corr_A.B = sample(0:5),
corr_A.C = sample(0:5),
corr_A.D = sample(0:5))
> Data
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 4 0 1
5 0 4 2 1 2 3 3
6 4 1 0 5 3 4 5
And I would like to check, for each column B, C and D, if one of their cell is equal to 0, I would like to replace, on the same row, the corresponding corr_A column with NA. For instance, since Data$B[4] is equal to 0, I would like Data$corr_A.B[4] to be replaced by NA.
I look to obtain the following result:
> Data
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 NA 0 NA
5 0 4 2 1 2 3 3
6 4 1 0 5 3 NA 5
I have tried different ways, using for loops, but I am struggling a lot. Also, in the dataset I am working on, there are many other columns that do not need to be checked for that condition, I would like to be able to specifically designated in which columns I am looking for 0 values.
If someone would be kind enough to give it a try? Many thanks
A one-liner using function is.na<-.
is.na(Data[5:7]) <- Data[2:4] == 0
Data
# A B C D corr_A.B corr_A.C corr_A.D
#1 1 5 4 2 1 2 4
#2 5 3 1 3 5 5 0
#3 2 2 3 4 0 1 2
#4 3 0 5 0 NA 0 NA
#5 0 4 2 1 2 3 3
#6 4 1 0 5 3 NA 5
For a base R solution, we can just use ifelse here:
Data$corr_A.B <- ifelse(Data$B == 0, NA, Data$corr_A.B)
Data$corr_A.C <- ifelse(Data$C == 0, NA, Data$corr_A.C)
Data$corr_A.D <- ifelse(Data$D == 0, NA, Data$corr_A.D)
df<- data.frame(A=c(1,5,2,3,0,4),
B=c(5,3,2,0,4,1),
C=c(4,1,3,5,2,0),
D=c(2,3,4,0,1,5),
corr_A.B=c(1,5,0,4,2,3),
corr_A.C=c(2,5,1,0,3,4),
corr_A.D=c(4,0,2,1,3,5))
df %>% mutate(corr_A.B=case_when(B==0 ~ NA_real_,
TRUE~ corr_A.B),
corr_A.C=case_when(C==0 ~NA_real_,
TRUE ~ corr_A.C),
corr_A.D=case_when(D==0 ~ NA_real_,
TRUE ~ corr_A.D))
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 NA 0 NA
5 0 4 2 1 2 3 3
6 4 1 0 5 3 NA 5
A base, one-liner, vectorized, but convoluted solution:
Data[t(t(which(Data[,2:4]==0,arr.ind=TRUE))+c(0,4))]<-NA
Using apply(). You could do:
cbind(Data,apply(Data[c("B","C","D")],2,function(x){
ifelse(x==0,NA,x)
}))

R: Creating multiple resampled dataset based on multiple factors

I need to create multiple (several 1000) resampled datasets from a large database. I have three categorical variables. Site (S), Transect(T), Quadrat(Q). The response variable is Value (V), which is the result of the particular S, T, & Q combination. Quads along each transect at each site. I pasted an abbreviated dataset below.
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0
The idea would be that for a given site, the resampled dataset would contain ## of quads from transect 1 to n, where ## would be the number of quadrats(Q) per transect (T) per site (S). I am not trying to resample the dataset based on S, T, & Q. I would like to be able to resample a user-defined number of rows, based on the conditions I define. For example, if I chose to resample using based on 2 quadrats(Q) per transect (T) per site(S), I envision the resampled dataset looking like the below example.
S T Q V
A 1 1 8
A 1 3 0
A 2 1 0
A 2 2 15
A 3 2 25
A 3 3 0
B 1 2 1
B 1 3 0
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
C 1 1 0
C 1 3 0
C 2 1 45
C 2 3 0
C 3 2 1
C 3 3 0
Please let me know if that doesn't make sense and I'll revise until it does. Thanks for any assistance!
Consider by to slice dataframes by Site and Transect factors and then sample random rows:
set.seed(444)
quads <- 2
# BUILD LIST OF SUBSETTED RANDOM SAMPLED DATAFRAMES
df_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), quads),])
# STACK ALL DATAFRAMES INTO ONE FINAL DF
sample_df <- do.call(rbind, df_list)
# SORT DATAFRAME BY S AND T
sample_df <- with(sample_df, sample_df[order(S, T),])
# RESET ROW NAMES
row.names(sample_df) <- NULL
sample_df
# S T Q V
# 1 A 1 1 8
# 2 A 1 3 0
# 3 A 2 2 15
# 4 A 2 1 0
# 5 A 3 1 0
# 6 A 3 3 0
# 7 B 1 2 1
# 8 B 1 1 0
# 9 B 2 3 2
# 10 B 2 1 33
# 11 B 3 1 0
# 12 B 3 2 207
# 13 C 1 1 0
# 14 C 1 2 1
# 15 C 2 1 45
# 16 C 2 3 0
# 17 C 3 3 0
# 18 C 3 2 1
Data
txt = '
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0'
df = read.table(text=txt, header=TRUE)
To build randomly generated dataframes, simply extend out quads and run it through lapply:
max_quads <- 3
quads <- replicate(1000, sample(1:max_quads, 1))
df_list <- lapply(quads, function(q) {
by_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), q),]))
sample_df <- do.call(rbind, by_list)
sample_df <- with(sample_df, sample_df[order(S, T),])
row.names(sample_df) <- NULL
return(sample_df)
})

Creating a Rolling Wall Count Variable in R

have a dataset with around 21k in observations, and a categorical variable for each observation with options A, B and C. I'm looking to create an experience variable for countries that have previously taken option C in prior observations (case t-1 to put it simpler). I've been told this is called a rolling wall count. I haven't been able to figure out how to go about this or what package is best to use. Any suggestions would be super helpful!
dispute=c("1","1","1","2","2","2","2","3","3","3")
partner=c("1","2","3","1","2","3","4","2","1","3")
position=c("A","C","C","B","C","A","C","B","C","C")
Currently my data looks something like this:
Dispute Partner Position
1 1 A
1 2 C
1 3 C
2 1 B
2 2 C
2 3 A
2 4 C
3 1 B
3 2 C
3 3 C
Ideally I create a variable that cumulatively counts when each unique observation takes on the value C (generating an "experience" count for each unique "partner"
Dispute Partner Position Experience
1 1 A NA
1 2 C 1
1 3 C 1
2 1 B NA
2 2 C 2
2 3 A NA
2 4 C 1
3 1 B NA
3 2 C 3
With data.table
library(data.table)
setDT(df)[, experience:=cumsum(position=="C")*(position=="C"), by=partner]
dispute partner position experience
1: 1 1 A 0
2: 1 2 C 1
3: 1 3 C 1
4: 2 1 B 0
5: 2 2 C 2
6: 2 3 A 0
7: 2 4 C 1
8: 3 2 B 0
9: 3 1 C 1
10: 3 3 C 2
With dplyr
library(dplyr)
df %>%
group_by(partner) %>%
mutate(experience=cumsum(position=="C")*(position=="C"))
dispute partner position experience
1 1 1 A 0
2 1 2 C 1
3 1 3 C 1
4 2 1 B 0
5 2 2 C 2
6 2 3 A 0
7 2 4 C 1
8 3 2 B 0
9 3 1 C 1
10 3 3 C 2
data
df <- data.frame(dispute=c("1","1","1","2","2","2","2","3","3","3"),
partner=c("1","2","3","1","2","3","4","2","1","3"),
position=c("A","C","C","B","C","A","C","B","C","C"))

Extracting event rows from a data frame

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources