Subseting column in one data frame using two columns in another data frame in r - r

I have tried for the similar problem on SO but couldn't.
I have two data frames. I want to subset one column in one data frame using two columns in another data frame.
The data frame are as following.
df1 <- data.frame(x = c(22,23,22,34,21),
y = c(1,4,2,3,2))
df1
x y
1 22 1
2 23 4
3 22 2
4 34 3
5 21 2
df2 <- data.frame(a = c("John", "Matt", "foo","boo"),
b = c(4, NA, NA,2),
c = c(3, NA, 3, 3))
df2
a b c
1 John 4 3
2 Matt NA NA
3 foo NA 3
4 boo 2 3
I want to subset column df1$y using column b and c from dataframe df2 using vectorized operation.
The output should in list form as following
df1
df1[1]
x y
2 23 4
4 34 3
df1[2]
df1[3]
x y
4 34 3
df1[4]
x y
3 22 2
4 34 3
5 21 2

You can try something like this:
dfnew<-list()
for (i in 1:nrow(df2)){
dfnew[[i]]<-df1[which(df1$y %in% df2[i,2:3]),]
}
Result:
dfnew
[[1]]
x y
2 23 4
4 34 3
[[2]]
[1] x y
<0 rows> (or 0-length row.names)
[[3]]
x y
4 34 3
[[4]]
x y
3 22 2
4 34 3
5 21 2

We can use lapply
lapply(split(df2[-1], as.character(df2$a)), function(x) df1[df1$y %in% unlist(x),])

Related

I have a list of data frames and a character vector. I want to rename the second column of each data frame by iterating through the vector. How do I?

I have a list of dataframes. Each of these dataframes has the same number of columns and rows, and has a similar data structure:
df.list <- list(data.frame1, data.frame2, data.frame3)
I have a vector of characters:
charvec <- c("a","b","c")
I want to replace the column name of the second column in each data frame by iterating through the above character vector. For example, the first data frame's second column should be "a". The second data frame's second column should be "b".
[[1]]
col1 a
1 1 2
2 2 3
[[2]]
col1 b
1 1 2
2 2 3
A reproducible example:
charvec <- c("a","b","c")
df_list <- list(df1 = data.frame(x = seq_len(3), y = seq_len(3)), df2 = data.frame(x = seq_len(4), y = seq_len(4)), df3 = data.frame(x = seq_len(5), y = seq_len(5)))
for(i in seq_along(df_list)){
names(df_list[[i]])[2] <- charvec[i]
}
> df_list
$df1
x a
1 1 1
2 2 2
3 3 3
$df2
x b
1 1 1
2 2 2
3 3 3
4 4 4
$df3
x c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
Also can use map2 from purrr. Thanks to #ismirsehregal for example data.
library(purrr)
map2(
df_list,
charvec,
\(x, y) {
names(x)[2] <- y
x
}
)
Output
$df1
x a
1 1 1
2 2 2
3 3 3
$df2
x b
1 1 1
2 2 2
3 3 3
4 4 4
$df3
x c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

R - how to select elements from sublists of a list by their name

I have a list of lists that looks like this:
list(list("A[1]" = data.frame(W = 1:5),
"A[2]" = data.frame(X = 6:10),
B = data.frame(Y = 11:15),
C = data.frame(Z = 16:20)),
list("A[1]" = data.frame(W = 21:25),
"A[2]" = data.frame(X = 26:30),
B = data.frame(Y = 31:35),
C = data.frame(Z = 36:40)),
list("A[1]" = data.frame(W = 41:45),
"A[2]" = data.frame(X = 46:50),
B = data.frame(Y = 51:55),
C = data.frame(Z = 56:60))) -> dflist
I need my output to also be a list of list with length 3 so that each sublist retains elements whose names start with A[ while dropping other elements.
Based on some previous questions, I am trying to use this:
dflist %>%
map(keep, names(.) %in% "A[")
but that gives the following error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
Trying to select a single element, for example just A[1] like this:
dflist %>%
map(keep, names(.) %in% "A[1]")
also doesn't work. How can I achieve the desired output?
I think you want:
purrr::map(dflist, ~.[stringr::str_starts(names(.), "A\\[")])
What this does is:
For each sublist (purrr::map)
Select all elements of that sublist (.[], where . is the sublist)
Whose names start with A[ (stringr::str_starts(names(.), "A\\["))
You got the top level map correct, since you want to modify the sublists. However, map(keep, names(.) %in% "A[") has some issues:
names(.) %in% "A[" should be a function or a formula (starting with ~
purrr::keep applies the filtering function to each element of the sublist, namely to the data frames directly. It never "sees" the names of each data frame. Actually I don't think you can use keep for this problem at all
Anyway this produces:
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50
If we want to use keep, use
library(dplyr)
library(purrr)
library(stringr)
map(dflist, ~ keep(.x, str_detect(names(.x), fixed("A["))))
Here a base R solution:
lapply(dflist, function(x) x[grep("A\\[",names(x))] )
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Creating a dataframe grouping observations according to labels

I have an x vector with categorical variables and a y vector of numerical variables, both of the same length.
I need to create a data-frame in which all numerical observations in y are separated into groups by a categorical label in x so the end result would look something like:
x obs1 obs2 obs3
a 1 3 5
b 6 7 8
c 3 4 6
Now both aggregate and tapply require a FUN specification but I don't want to do operations on the variables.
x= {random sampling from letters of the alphabet}
y= {random numbers}
Remember, everything is a function in R. So things like c() are just function calls.
x <- rep(letters[1:3], each=3)
y <- c(1, 3, 5, 6, 7, 8, 3, 4, 6)
foo <- tapply(y, x, c)
# > foo
# $a
# [1] 1 3 5
# $b
# [1] 6 7 8
# $c
# [1] 3 4 6
Then you can use this silly pattern to get the data.frame you're looking for:
do.call(rbind, foo)
# [,1] [,2] [,3]
# a 1 3 5
# b 6 7 8
# c 3 4 6
I am not clear about something from your example: is it possible for there to be different numbers of y-values for each category in x? For example, would you consider basic data like this:
> x <- c(rep(c("a", "b", "c"), 3), "c", "c")
> y <- sample(1:20, 11)
> df <- data.frame(x, y)
> df
x y
1 a 16
2 b 4
3 c 9
4 a 2
5 b 12
6 c 17
7 a 7
8 b 10
9 c 11
10 c 1
11 c 8
Here there are more values for category c. This is not entirely what you are looking for, but it might be a start:
> library(reshape2)
> dcast(df, x ~ y)
Using y as value column: use value.var to override.
x 1 2 4 7 8 9 10 11 12 16 17
1 a NA 2 NA 7 NA NA NA NA NA 16 NA
2 b NA NA 4 NA NA NA 10 NA 12 NA NA
3 c 1 NA NA NA 8 9 NA 11 NA NA 17
The values for each of the categories appear on the right rows... the NAs are a nuisance though. How would you want the data to appear in this case? Something like
1 a 2 7 16
2 b 4 10 12
3 c 1 8 9 11 17
This will not work, of course, because each row must have the same number of columns, so you would end up with NAs for the last two elements in the top two rows.
However, I suspect that a list would probably be the best solution in this case anyway, in which case, consider this:
> dl <- split(y, x)
> dl[["a"]]
[1] 16 2 7
> dl$b
[1] 4 12 10
> dl[["c"]]
[1] 9 17 11 1 8
You can then operate on the elements of this list. As with all things R, there are a variety of ways to do this. For example, to get the output as a list:
> lapply(dl, sum)
$a
[1] 25
$b
[1] 26
$c
[1] 46
Or with output as a vector
> sapply(dl, sum)
a b c
25 26 46
Or, alternatively, to get the output as a data frame:
> library(plyr)
> ldply(dl, sum)
.id V1
1 a 25
2 b 26
3 c 46
These mechanisms afford a far greater degree of generality than functions like rowSum() since you can apply essentially arbirary functions to each of the elements in the original list.

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources