In R, extract rows based on strings in different columns - r

Sorry if the solution to my problem is already out there, and I overlooked it. There are a lot of similar topics which all helped me understand the basics of what I'm trying to do, but did not quite solve my exact problem.
I have a data frame df:
> type = c("A","A","A","A","A","A","B","B","B","B","B","B")
> place = c("x","y","z","x","y","z","x","y","z","x","y","z")
> value = c(1:12)
>
> df=data.frame(type,place,value)
> df
type place value
1 A x 1
2 A y 2
3 A z 3
4 A x 4
5 A y 5
6 A z 6
7 B x 7
8 B y 8
9 B z 9
10 B x 10
11 B y 11
12 B z 12
>
(my real data has 3 different values in type and 10 in place, if that makes a difference)
I want to extract rows based on the strings in columns m and n.
E.g. I want to extract all rows that contain A in type and x and z in place, or all rows with A and B in type and y in place.
This works perfectly with subset, but I want to run my scripts on different combinations of extracted rows, and adjusting the subset command every time isn't very effective.
I thought of using a vector containing as elements what to get from type and place, respectively.
I tried:
v=c("A","x","z")
df.extract <- df[df$type&df$place %in% v]
but this returns an error.
I'm a total beginner with R and programming, so please bear with me.

You could try
df[df$type=='A' & df$place %in% c('x','y'),]
# type place value
#1 A x 1
#2 A y 2
#4 A x 4
#5 A y 5
For the second case
df[df$type %in% c('A', 'B') & df$place=='y',]
Update
Suppose, you have many columns and needs to subset the dataset based on values from many columns. For example.
set.seed(24)
df1 <- cbind(df, df[sample(1:nrow(df)),], df[sample(1:nrow(df)),])
colnames(df1) <- paste0(c('type', 'place', 'value'), rep(1:3, each=3))
row.names(df1) <- NULL
You can create a list of the values from the columns of interest
v1 <- setNames(list('A', 'x', c('A', 'B'),
'x', 'B', 'z'), paste0(c('type', 'place'), rep(1:3, each=2)))
and then use Reduce
df1[Reduce(`&`,Map(`%in%`, df1[names(v1)], v1)),]

you can make a function extract :
extract<-function(df,type,place){
df[df$type %in% type & df$place %in% place,]
}
that will work for the different subsets you want to do :
df.extract<-extract(df=df,type="A",place=c("x","y")) # or just extract(df,"A",c("x","y"))
> df.extract
type place value
1 A x 1
2 A y 2
4 A x 4
5 A y 5
df.extract<-extract(df=df,type=c("A","B"),place="y") # or just extract(df,c("A","B"),"y")
> df.extract
type place value
2 A y 2
5 A y 5
8 B y 8
11 B y 11

Related

Vector to dataframe with variable row length [duplicate]

This question already has answers here:
Convert Rows into Columns by matching string in R
(3 answers)
Closed 4 years ago.
Given a vector, I want to convert it to a dataframe using a 'key' value which is randomly distributed throughout the vector at the start of what is to be a row. In this case, "z" would be the first value in each column.
vd <- c("z","a","b","c","z","a","b","c","z","a","b","c","d")
The resultant data should look like:
#using magrittr
data.frame(x1 = c("z","a","b","c", NA), x2 = c("z","a","b","c", NA), x3 = c("z","a","b","c","d"))
%>% transpose()
One solution would be to find the largest distance between 'keys' in the vector and then interject blank values at the end of 'sections' that are smaller than the longest 'section' so you could use matrix()
What would be the best way to do this?
plyr::ldply(split(vd, cumsum(vd == "z")), rbind)[-1]
(copied from here)
result:
1 2 3 4 5
1 z a b c <NA>
2 z a b c <NA>
3 z a b c d
We can use cumsum to identify groups then split them. Then we append the vectors and format them as a data.frame.
x <- split(vd,cumsum("z"==vd))
maxl <- max(lengths(x))
as.data.frame(lapply(x,function(y) c(y,rep(NA,maxl-length(y)))))
# X1 X2 X3
# 1 z z z
# 2 a a a
# 3 b b b
# 4 c c c
# 5 <NA> <NA> d

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Sampling elements in data frame [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I'm trying to do resampling of the elements of a data frame. I'm open to use other data structures if recommended, but my understanding is that a DF would be better for combining strings, numbers, etc.
Let's say my input is this data frame:
16 x y z 2
11 a b c 1
.........
And I'd like to build as output another data structure (I take, another df) like this:
16 x y z
16 x y z
11 a b c
.........
I guess my main issue is the way to append the content, which is on columns df[,1:4].
Thanks in advance, p.
It's unclear from your description, but your desired output implies that you want to duplicate columns 1:4 according to column 5, this should do the job
df[rep(seq_len(nrow(df)), df[, 5]), -5]
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
Assuming you're starting with something like:
mydf
# V1 V2 V3 V4 V5
# 1 16 x y z 2
# 2 11 a b c 1
Then, you can just use expandRows from my "splitstackshape" package, like this:
library(splitstackshape)
expandRows(mydf, count = "V5")
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
By default, the function assumes that you are expanding your dataset based on an existing column, but you can just as easily add a numeric vector as the count argument, and set count.is.col = FALSE.
If you want to sample with replacement n rows from df data frame:
df[sample(nrow(df), n, replace=TRUE), ]

Create indicator

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Resources