Extract row of same ids from from a list [duplicate] - r

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
I have a data frame containing id and other variables, and also a list which has some ids. Now I want to extract row of ids from the data frame which has same ids in list.
data frame
id value time
1 12 1.0
1 14 1.6
4 18 2.0
6 9 3.6
3 11 4.2
5 12 0.8
list
1,3,4
Result
id value time
1 12 1.0
1 14 1.6
3 11 4.2
4 18 2.0

As #Sotos explained, that could be done as following using %in%:
Data[Data$id %in% list,]
# id value time
# 1: 1 12 1.0
# 2: 1 14 1.6
# 3: 4 18 2.0
# 4: 3 11 4.2

Related

Add a column that iterates/ counts every time a sequence resets

I have a dataframe, with a column that increases with every row, and periodically (though not regularly) resets back to 1.
I'd like to track/ count these resets in separate column. This for-loop example does exactly what I want, but is incredibly slow when applied to large datasets. Is there a better/ quicker/ more R way to do this same operation:
ColA<-seq(1,20)
ColB<-rep(seq(1,5),4)
DF<-data.frame(ColA, ColB)
DF$ColC<-NA
DF[1,'ColC']<-1
#Removing line 15 and changing line 5 to 1.1 per comments in answer
DF<-DF[-15,]
DF[5,2]<-0.1
for(i in seq(1,nrow(DF)-1)){
print(i)
MyRow<-DF[i+1,]
if(MyRow$ColB < DF[i,'ColB']){
DF[i+1,"ColC"]<-DF[i,"ColC"] +1
}else{
DF[i+1,"ColC"]<-DF[i,"ColC"]
}
}
No need for a loop here. We can just use the vectorized cumsum. This ought to be faster:
DF$ColC<-cumsum(DF$ColB==1)
DF
To keep using varying variable reset values that are always lower then the previous value, use cumsum(ColB < lag(ColB)):
DF %>% mutate(ColC = cumsum(ColB < lag(ColB, default = Inf)))
ColA ColB ColC
1 1 1.0 1
2 2 2.0 1
3 3 3.0 1
4 4 4.0 1
5 5 0.1 2
6 6 1.0 2
7 7 2.0 2
8 8 3.0 2
9 9 4.0 2
10 10 5.0 2
11 11 1.0 3
12 12 2.0 3
13 13 3.0 3
14 14 4.0 3
16 16 1.0 4
17 17 2.0 4
18 18 3.0 4
19 19 4.0 4
20 20 5.0 4

(R Studio)How to convert dataframe to Matrix?

I have a dataset, it is a data frame format.
But I need to convert to the matrix for recommender system purpose.
my data format:
col1 col1 col3
1 name 1 5.9
2 name 1 7.9
3 name 1 10
4 name 1 9
5 name 1 8.4
1 name 2 6
2 name 2 8.5
3 name 2 10
4 name 2 9.3
This is what I want:
name 1 name 2
1 5.9 6
2 7.9 8.5
3 10 10
4 9 9.3
5 8.4 NA (missing value, autofill "NA")
For the data you shared, the following base R solution works (as long as your data frame is called df
do.call(cbind, lapply(split(df$Hotel_Rating, df$Hotel_Name), `[`,
seq(max(table(df$Hotel_Name)))))

subsetting a dataframe by a condition in R [duplicate]

This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.

How to run a loop on different sections of the same data.frame [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:
df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109))
number seconds
1 1 1
2 1 4
3 1 8
4 2 1
5 2 5
6 2 11
7 2 23
8 3 1
9 3 8
10 4 1
11 4 9
12 4 11
13 4 24
14 4 44
15 4 112
16 5 1
17 5 34
18 5 55
19 5 109
my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:
ZZ <- unique(df$number)
for (i in ZZ){
Y <- max(df$seconds) - min(df$seconds)
}
Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:
library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]
number spread
1: 1 7
2: 2 22
3: 3 7
4: 4 111
5: 5 108

Creating a new column in a data frame whose entries depend on multiple columns in a another data frame

I want to make new column in my data set with the values determined by values in another data set, but it's not as simple as the values in one column being a function of the values in the other. Here's an example:
>df1
chromosome position
1 1 1
2 1 2
3 1 4
4 1 5
5 1 7
6 1 12
7 1 13
8 1 15
9 1 21
10 1 23
11 1 24
12 2 1
13 2 5
14 2 7
15 2 8
16 2 12
17 2 15
18 2 18
19 2 21
20 2 22
and
>df2
chromosome segment_start segment_end segment.number
1 1 1 5 1.1
2 1 6 20 1.2
3 1 21 25 1.3
4 2 1 7 2.1
5 2 8 16 2.2
6 2 18 22 2.3
I want to make a new column in df1 called 'segment', and the value in segment is to be determined by which segment (as determined by 'segment_start', 'segment_end', and 'chromosome' from df2) the value in 'position' belongs to. For example, in df1, row 7, position=13, and chromosome=1. Because 13 is between 6 and 20, the entry in my hypothetical 'segment' column would be 1.2, from row 2 of df2, because 13 falls between segment_start and segment_end from that row (6 and 20, respectively), and the 'chromosome' value from df1 row 7 is 1, just as 'chromosome' in df2 row 2 is 1.
Each row in df1 belongs to one of the segments described in df2; that is, it lies on the same chromosome as one of the segments, and its 'position' is >=segment_start and <=segment_end. And I want to get that information into df1, so it says what segment each position belongs to.
I was thinking of using an if function, and started with:
if(df1$position>=df2$segment_start & df1$position<=df2$segment_end & df1$chromosome==df2$chromosome) df1$segment<-df2$segment.number
But am not sure that way will be feasible. If nothing else maybe the code can help illustrate what it is I'm trying to do. Basically, I want match each row by its position and chromosome to a segment in df2. Thanks.
This appears to be a rolling join. You can use data.table for this
require(data.table)
DT1 <- data.table(df1, key = c('chromosome','position'))
DT2 <- data.table(df2, key = c('chromosome','section_start'))
# this will perform the join you want (but retain all the
# columns with names names of DT2)
# DT2[DT1, roll=TRUE]
# which is why I have renamed and subset here)
DT2[DT1, roll=TRUE][ ,list(chromosome,position = segment_start,segment.number)]
# chromosome position segment.number
# 1: 1 1 1.1
# 2: 1 2 1.1
# 3: 1 4 1.1
# 4: 1 5 1.1
# 5: 1 7 1.2
# 6: 1 12 1.2
# 7: 1 13 1.2
# 8: 1 15 1.2
# 9: 1 21 1.3
# 10: 1 23 1.3
# 11: 1 24 1.3
# 12: 2 1 2.1
# 13: 2 5 2.1
# 14: 2 7 2.1
# 15: 2 8 2.2
# 16: 2 12 2.2
# 17: 2 15 2.2
# 18: 2 18 2.3
# 19: 2 21 2.3
# 20: 2 22 2.3
You really need to check out the GenomicRanges package from Bioconductor. It provides the data structures that are appropriate for your use case.
First, we create the GRanges objects:
gr1 <- with(df1, GRanges(chromosome, IRanges(position, width=1L)))
gr2 <- with(df2, GRanges(chromosome, IRanges(segment_start, segment_end),
segment.number=segment.number))
Then we find the overlaps and do the merge:
hits <- findOverlaps(gr1, gr2)
gr1$segment[queryHits(hits)] <- gr2$segment.number[subjectHits(hits)]
I'm going to assume that the regions in df2 are non-overlapping, continuous and complete (not missing any positions from df1). I seem to do this differently every time I try, so here's my latest idea.
First, make sure chromosome is a factor in both data sets
df1$chromosome<-factor(df1$chromosome)
df2$chromosome<-factor(df2$chromosome)
Now I want to unwrap, chr/pos into one over all generic position, i'll do that with
ends<-with(df2, tapply(segment_end, chromosome, max))
offset<-head(c(0,cumsum(ends)),-1)
names(offset)<-names(ends)
This assigns unique position values to all positions across all chromosomes and it tracks the offset to the beginning of each chromosome in this new system. Now we will build a translation function from the data in df2
seglookup <- approxfun(with(df2, offset[chromosome]+segment_start), 1:nrow(df2),
method="constant", rule=2)
We use approxfun to find the right interval in the genetic position space for each segment. Now we use this function on df1
segid <- with(df1, seglookup(offset[chromosome]+position))
Now we have the correct ID for each position. We can verify this by merging the data and looking at the results
cbind(df1, df2[segid,-1])
chromosome position segment_start segment_end segment.number
1 1 1 1 5 1.1
2 1 2 1 5 1.1
3 1 4 1 5 1.1
4 1 5 1 5 1.1
5 1 7 6 20 1.2
6 1 12 6 20 1.2
7 1 13 6 20 1.2
8 1 15 6 20 1.2
9 1 21 21 25 1.3
10 1 23 21 25 1.3
11 1 24 21 25 1.3
12 2 1 1 7 2.1
13 2 5 1 7 2.1
14 2 7 1 7 2.1
15 2 8 8 16 2.2
16 2 12 8 16 2.2
17 2 15 8 16 2.2
18 2 18 18 22 2.3
19 2 21 18 22 2.3
20 2 22 18 22 2.3
So it looks like we did alright.

Resources