This question already has an answer here:
The rules of subsetting
(1 answer)
Closed 8 years ago.
Good day
I have a data set I got from a txt file
> MyData
Xdat Ydat
1 1 12
2 2 23
3 3 34
4 4 45
5 5 56
6 6 67
7 7 78
I need to use this set to extract rows that correspond to the case where the 2nd column(Ydat) is greater than 40.
Resulting in
MyData2
Xdat Ydat
4 4 45
5 5 56
6 6 67
7 7 78
Simple subsetting will do it -
MyData[which(MyData[,2]>40),]
as #DavidArenburg points out, this works fine:
MyData[(MyData[,2]>40),]
Related
This question already has answers here:
How to deal with nonstandard column names (white space, punctuation, starts with numbers)
(3 answers)
Remove rows in R matrix where all data is NA [duplicate]
(2 answers)
Closed 1 year ago.
The data is like
example<-matrix(NA,40,7)
colnames(example)=c("1month","2month","3month","4month","5month","6month","7month")
example[,1]<-rep(c(1,3,6,2,4,98,5,3,NA),len=40)
example[,2]<-rep(c(2,7,NA,8,2,NA,3,NA),len=40)
example[,3]<-rep(c(5,3,2,NA),len=40)
example[,4]<-rep(c(NA,91,98,52,35,NA),len=40)
example[,5]<-rep(c(3,NA),len=40)
example[,6]<-rep(c(98,NA,NA,123),len=40)
example[,7]<-rep(c(3,51,NA,NA,4,NA,5,NA),len=40)
example<-as.data.frame(example)
I want to remove 'NA' for each column.
I can do it using drop_na function
but !is.na() doesn't work.
example %>% select('1month') %>% drop_na('1month')<- this work
example %>% select('1month') %>% filter(!is.na('1month')) <- this doesn't work. the result for this is under.
I wonder why this doesn't work and there is any way that I can use != or !is.na() function.
Thank you for your help. Sincerely.
1month
1 1
2 3
3 6
4 2
5 4
6 98
7 5
8 3
9 NA
10 1
11 3
12 6
13 2
14 4
15 98
16 5
17 3
18 NA
19 1
20 3
21 6
22 2
23 4
24 98
25 5
26 3
27 NA
28 1
29 3
30 6
31 2
32 4
33 98
34 5
35 3
36 NA
37 1
38 3
39 6
40 2
This question already has an answer here:
The rules of subsetting
(1 answer)
Closed 8 years ago.
Good day
I have a data set I got from a txt file
> MyData
Xdat Ydat
1 1 12
2 2 23
3 3 34
4 4 45
5 5 56
6 6 67
7 7 78
I need to use this set to extract rows that correspond to the case where the 2nd column(Ydat) is greater than 40.
Resulting in
MyData2
Xdat Ydat
4 4 45
5 5 56
6 6 67
7 7 78
Simple subsetting will do it -
MyData[which(MyData[,2]>40),]
as #DavidArenburg points out, this works fine:
MyData[(MyData[,2]>40),]
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:
df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109))
number seconds
1 1 1
2 1 4
3 1 8
4 2 1
5 2 5
6 2 11
7 2 23
8 3 1
9 3 8
10 4 1
11 4 9
12 4 11
13 4 24
14 4 44
15 4 112
16 5 1
17 5 34
18 5 55
19 5 109
my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:
ZZ <- unique(df$number)
for (i in ZZ){
Y <- max(df$seconds) - min(df$seconds)
}
Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:
library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]
number spread
1: 1 7
2: 2 22
3: 3 7
4: 4 111
5: 5 108
This question already has answers here:
Order data frame by two columns in R
(2 answers)
Closed 8 years ago.
I would like to order all lines based on two column values in R. This is my input:
chr start no
4 85 non1
4 23 non2
6 10 non2
8 25 non2
22 56 non4
2 15 non1
This is my expected output:
chr start no
2 15 non1
4 23 non2
4 85 non1
6 10 non2
8 25 non2
22 56 non4
Thank You. Cheers.
The order function accepts a variable number of input vectors, ordering by the first , then second and so on ...
BED=read.table(text=
"chr start no
4 85 non1
4 23 non2
6 10 non2
8 25 non2
22 56 non4
2 15 non1", header=T)
BED[order(BED$chr, BED$start),]
chr start no
6 2 15 non1
2 4 23 non2
1 4 85 non1
3 6 10 non2
4 8 25 non2
5 22 56 non4
While you can certainly use order from the base package, for working with data frames I'd highly recommend using the plyr package.
chr <- c(4,4,6,8,22,2)
start <- c(85, 23, 10, 25, 56, 15)
no <- c("non1", "non2", "non2", "non2", "non4", "non1")
myframe <- data.frame(chr, start, no)
creates your data frame. In terms of dealing with the character column:
myframe$chr <- as.numeric(myframe$chr)
and then getting the arranged version is very easy:
library(plyr)
arrangedFrame <- arrange(myframe, chr, start)
print(arrangedFrame)
chr start no
1 2 15 non1
2 4 23 non2
3 4 85 non1
4 6 10 non2
5 8 25 non2
6 22 56 non4
there are also a lot of easily modified options using arrange that make different reorderings easier than using order. And while I haven't used it a lot yet, I know Hadley released dplyr not too long ago which offers even more functionality and which I'd encourage you to check out.