Subsetting data frames in R - r

I'm new to R and learning about subsetting. I have a table and I'm trying to get the size of a subset of the table. My issue is that when I try two different ways I get two different answers. For a table "dat" where I'm trying to select all rows where RMS is 5 and BDS is 2:
dim(dat[(dat$RMS==5) & (dat$BDS==2),])
gives me a different answer than
dim(subset(dat,(dat$RMS==5) & (dat$BDS==2)))
The second one is correct, could someone explain why these are different and why the first one is giving me the wrong answer?
Thanks

The reason must be in different treatment of NA values by these two methods. If you remove rows with NA from the data frame you should get the same results:
dat_clean = na.omit(dat)

Works for me.....
> x = c(1,1,2,2,3,3)
> y = c(4,4,5,5,6,6)
>
> X = data.frame(x,y)
>
> dim(X[X$x==1 & X$y==4,])
[1] 2 2
>
> (X[X$x==1 & X$y==4,])
x y
1 1 4
2 1 4
> dim(subset(X,(X$x==1) & (X$y==4)))
[1] 2 2
> subset(X,(X$x==1) & (X$y==4))
x y
1 1 4
2 1 4

Related

R combine different variables of different length to one to generate a control group

I am creating a new sample using bootstrap to generate a control group with control for age and gender. every variable have a different length of number between 2 values to eight values
control_40_50_HRV_female2 <- abs(parametric_bootstrap_boot2(control_40_50_HRV_female_SDNN))
control_50_60_HRV_male2 <- abs(parametric_bootstrap_boot2(control_50_60_HRV_male_SDNN))
control_50_60_HRV_female2 <-abs(parametric_bootstrap_boot2(control_50_60_HRV_female_SDNN))
control_60_70_HRV_male2 <- abs(parametric_bootstrap_boot2(control_60_70_HRV_male_SDNN))
control_60_70_HRV_female2 <-abs(parametric_bootstrap_boot2(control_60_70_HRV_female_SDNN))
control_70_80_HRV_male2 <-abs(parametric_bootstrap_boot2(control_70_80_HRV_male_SDNN))
how can I put them all in one group (one variable)? so I can start doing a t-test ? I hope this is clear
this a sample output for the variables
> control_40_50_HRV_female2
[1] 29.08388 102.49869
> control_50_60_HRV_male2
[1] 36.81686 127.47986 13.40681
> control_50_60_HRV_female2
[1] 25.50313
> control_60_70_HRV_male2
[1] 39.93050 140.75967 13.89545 316.45988 158.91477
> control_60_70_HRV_female2
[1] 26.27908 106.40483
when I run this command
out <- stack(mget(ls(pattern = '^control_\\d{2}_\\d{2}_\\w+_')))[2:1]
dim(out)
I get this.. which is a list of all the variables I have created from the beginning of the script..
> dim(out)
[1] 96683 2
> head(out)
ind values
1 control_40_50_HRV_female_BPM 63.48
2 control_40_50_HRV_female_BPM 52.67
3 control_40_50_HRV_female_BPM 88.92
4 control_40_50_HRV_female_BPM 69.04
5 control_40_50_HRV_female_BPM 53.46
6 control_40_50_HRV_female_BPM 63.64
while I only need a list of these variable
control_40_50_HRV_female2, control_50_60_HRV_male2 .... control_70_80_HRV_male2
We can use mget to get the values of the 'control' objects into a list and then stack it to a two column data.frame
out <- stack(mget(ls(pattern = '^control_\\d{2}_\\d{2}_\\w+_')))[2:1]
head(out)
# ind values
#1 control_40_50_HRV_female2 29.08388
#2 control_40_50_HRV_female2 102.49869
#3 control_50_60_HRV_female2 25.50313
#4 control_50_60_HRV_male2 36.81686
#5 control_50_60_HRV_male2 127.47986
#6 control_50_60_HRV_male2 13.40681

How to compare two variable columns with each other in R?

I'm new to R and need help! I have many variables including Response and RightResponse.
I need to compare those two columns, and create a new column that can show whether there is a match or miss between each of the value pairs.
Thanks.
Perhaps something like this?
library(magrittr)
library(dplyr)
> res <- data.frame(Response=c(1,4,4,3,3,6,3),RightResponse=c(1,2,4,3,3,6,5))
> res <- res %>% mutate("CorrectOrNot" = ifelse(Response == RightResponse, "Correct","Incorrect"))
> res
Response RightResponse CorrectOrNot
1 1 1 Correct
2 4 2 Incorrect
3 4 4 Correct
4 3 3 Correct
5 3 3 Correct
6 6 6 Correct
7 3 5 Incorrect
Basically the mutate function has created a new column containing the results of a comparison between Response and RightResponse.
Hope this helps!

How to find which interval/range a variable falls under in R

I have a data frame
> data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
3 8 -1.884417429 49.80308
4 12 1.692774467 50.45827
5 16 -0.907602775 51.14937
6 20 0.166186536 49.17502
7 24 0.420263825 49.56720
and a variable
t=2
and want to find the subset of the data under which it falls (rows 1 and 2 in this example), and then calculate the ratio in variables x and y, ie
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
then obtain, based on value t, (t-0)/(4-0), and then use that ratio to calculate the position in x and y
I found a fund function in matlab (Find which interval a point B is located in Matlab) and wonder if there is a similar function in R
Specifically, is there a way to determine which interval a variable falls under? And once I find that interval, a way to extract the subset of data?
I can only think of %in% operator currently,
> t %in% df$Col1
[1] FALSE
For more clarity, I have tried
> z=NULL
> for(i in 1:(nrow(df)-1)){
+ z[[i]]=df$Col1[i]:df$Col1[i+1]
+ }
> w=NULL
> for(i in 1:length(z)){
+ w=c(w,t %in% z[[i]])
+ }
> v=which(w==1)
> df[v:(v+1),]
Col1 x y
1 0 1.076101 50.17514
2 4 1.971503 47.81647
>
and now hope there may be a more concise answer, as my real data is >1M rows.
Try using the code below and see whether it will give you the expected results:
dataframe=data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
funfun=function(x){v=findInterval(x,dataframe$Col1);c(v,v+1)}
dataframe[funfun(2),]
Col1 x y
1 0 0.831266 50.28246
2 4 1.751892 48.78810
dataframe[funfun(10),]
Col1 x y
3 8 0.2624929 48.33945
4 12 -0.2243066 51.11304
If this helps please let us know. thank you

Edit column under condition

I have a table:
id <- c(1,1,2,2,2,2,2,3,3,4,4,5,5,5)
dist <- c(0,1,1,0,2,15,0,4,4,0,5,5,16,2)
data <- data.frame(id, dist )
I would like to edit the column id when dist is superior to a certain value (let´s say 10). I am looking to add +1 when data$dist >10
The final output would be:
data$id_new <- c(1,1,2,2,2,3,3,4,4,5,5,6,7,7)
Is it possible to do something with a if loop? I tried to something with a loop but I am still not successful.
Maybe using cumsum:
data$new_id <- data$id + cumsum(data$dist > 10)
Explanation:
cumsum(data$dist > 10) will return the cumulative sum of indices in data$dist which are greater than 10. You can see how this works by taking the expression apart in R and seeing how each piece works.
We can use duplicated with >
with(data, cumsum(dist > 10| !duplicated(id)))
#[1] 1 1 2 2 2 3 3 4 4 5 5 6 7 7

Subsetting data conditional on first instance in R

data:
row A B
1 1 1
2 1 1
3 1 2
4 1 3
5 1 1
6 1 2
7 1 3
Hi all! What I'm trying to do (example above) is to sum those values in column A, but only when column B = 1 (so starting with a simple subset line - below).
sum(data$A[data$B==1])
However, I only want to do this the first time that condition occurs until the values switch. If that condition re-occurs later in the column (row 5 in the example), I'm not interested in it!
I'd really appreciate your help in this (I suspect simple) problem!
Using data.table for syntax elegance, you can use rle to get this done
library(data.table)
DT <- data.table(data)
DT[ ,B1 := {
bb <- rle(B==1)
r <- bb$values
r[r] <- seq_len(sum(r))
bb$values <- r
inverse.rle(bb)
} ]
DT[B1 == 1, sum(a)]
# [1] 2
Here's a rather elaborate way of doing that:
data$counter = cumsum(data$B == 1)
sum(data$A[(data$counter >= 1:nrow(data) - sum(data$counter == 0)) &
(data$counter != 0)])
Another way:
idx <- which(data$B == 1)
sum(data$A[idx[idx == (seq_along(idx) + idx[1] - 1)]])
# [1] 2
# or alternatively
sum(data$A[idx[idx == seq(idx[1], length.out = length(idx))]])
# [1] 2
The idea: First get all indices of 1. Here it's c(2,3,5). From the start index = "2", you want to get all the indices that are continuous (or consecutive, that is, c(2,3,4,5...)). So, from 2 take that many consecutive numbers and equate them. They'll not be equal the moment they are not continuous. That is, once there's a mismatch, all the other following numbers will also have a mismatch. So, the first few numbers for which the match is equal will only be the ones that are "consecutive" (which is what you desire).

Resources