How to subset in RStudio properly? - r

I have created the following data frame:
age <- c(21,35,829,2)
sex <- c("m","f","m","c")
height <- c(181,173,171,166)
weight <- c(69,58,75,60)
dat <- as.data.frame(cbind(age,sex,height,weight), stringsAsFactors = FALSE)
dat$age <- as.numeric(age)
dat
I want to choose now only the rows of students which are older than 20 or younger than 80.
Why does this work : dat[dat$age<20| dat$age>80,] ; subset(dat, age < 20 | age > 80)
But this does not: dat[dat$age>20| dat$age<80,] ; subset(dat, age > 20 | age < 80)
I can subset the rows who are NOT younger than 80 or older than 20, but not those who are actually in this interval.
What is the mistake?
Thanks in advance.

Because your condition allows basically every possible age. Think about it, your conditions are independent (because you are using the | operator), so every row, that fits in one of your conditions, are selected by your filter. Every age that is defined in your data.frame now, are higher than 20, OR if not, they certainly are lower than 80.
If you want to select every row, that is in between age 20 and 80, you would change the logic operator. To make these conditions dependent, like this:
dat[dat$age>20 & dat$age<80,]
subset(dat, age > 20 & age < 80)
Resulting this:
age sex height weight
1 21 m 181 69
2 35 f 173 58
Now, if you want to select all the rows, that are outside of this interval, you could negate this logic condition with the ! operator, like was suggested by #r2evans in the comment section. It would be something like this:
dat[!(dat$age > 20 & dat$age < 80),]
subset(dat, !(age > 20 & age < 80))
Resulting this:
age sex height weight
3 829 m 171 75
4 2 c 166 60

Why not use dplyr filter?
library(dplyr)
df_age <- dat %>%
dplyr::filter(age > 20
, age < 80)

Related

R subset dataframe, skip part of interval

I'm tryng to subset my total data (including all the other varibales) to an interval of zipcodes EXCLUDING a certain part of that interval. Quite new to R and can't get it to work. (Zipcode = postnr)
I have over 100 000 zipcodes (postnr) and want all values for individs in zipcode 10 000-12 999 and 15 600 - 16 800 in my dataset
Attempt 1
Datan <- subset(Data2, Data2$postnr >= 10000 & Data2$postnr <= 16880)
Datant <- subset(Datan, Datan$postnr >= 15600 & Datan$postnr < 13000)
Datan returns 31 3000 obs in 26 variabels and Datant returns 0 obs in 26 variabels..
Attempt 2
attach(Data2)
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) & between(postnr, 15600, 16880))
Data 5 returns 0 obsverations...
I have thousands of values for all my variables inside those intervals. What am I doing wrong?
If you think about and versus or you have gotten it. As it is, you're really close!
Can a number be between 1 and 2 and 3 and 5? Nope. But if I said, can a number be between 1 and 2 or 3 and 5? Yup.
Updated
For subset:
Datan <- subset(Data2, postnr >= 10000 & postnr <= 13000 |
postnr >= 15600 & postnr < 16800)
Where that verticle pipe: | means 'or'.
For dplyr:
(I assume it's dplyr with filter.) You don't need to attach the data, it will extract the variable names from Data2 if it's in the pipe (which it is).
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) |
between(postnr, 15600, 16880))
I have no data, so I can not properly test this, but the following should work.
Note the or operator (|) to specify two different conditions.
library(data.table)
dt <- as.data.table(Data2)
dt[(postnr>10000&postnr<13000)|(postnr>15600&postnr<=16880),]

How do I subset a data frame based on the values in another data frame?

I have a dataframe where the columns represent patients of various ages, and another dataframe with the values of those ages. I want to subset the data such that patients only below the age of 50 are displayed
> dat
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at 179.86300 106.495000 265.58600 301.24300 218.50900 224.61000
31308_at 559.07800 411.483000 481.17600 570.73300 333.53900 370.07900
31309_r_at 20.76970 30.641500 50.21530 42.68920 27.10590 21.57620
31310_at 154.19100 224.446000 188.82300 177.86300 233.46300 120.90800
31311_at 956.79700 648.310000 933.65600 1016.41000 762.01300 1040.29000
And the annotation file with the ages of the patients
> ann
Gender Age
GSM27015 M 26
GSM27016 M 26
GSM27018 M 29
GSM27021 M 37
GSM27023 M 40
GSM27024 M 42
GSM27025 M 45
GSM27027 M 52
GSM27028 M 53
Here's something else to consider.
You could transpose your data, so that patients are rows and not columns. As it looks like you have age and gender in your column names, you can also make these additional columns as well.
dat_new <- cbind(do.call(rbind, strsplit(colnames(dat), '\\.')), as.data.frame(t(dat)))
colnames(dat_new)[1:3] <- c("id", "age", "gender")
rownames(dat_new) <- NULL
This is what it would look like:
id age gender 31307_at 31308_at 31309_r_at 31310_at 31311_at
1 GSM27015 26 M 179.863 559.078 20.7697 154.191 956.797
2 GSM27016 26 M 106.495 411.483 30.6415 224.446 648.310
3 GSM27018 29 M 265.586 481.176 50.2153 188.823 933.656
4 GSM27021 37 M 301.243 570.733 42.6892 177.863 1016.410
5 GSM27023 40 M 218.509 333.539 27.1059 233.463 762.013
6 GSM27024 42 M 224.610 370.079 21.5762 120.908 1040.290
Then, if you wish to subset based on age (e.g., <= 50 years), you can do:
dat_new[dat_new$age <= 50, ]
Perhaps try
dat[as.numeric(gsub(".*?\\.(\\d+)\\..*","\\1",names(dat)))<50]
Does this work:
> library(dplyr)
> data
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M GSM27024.52.M
31307_at 179.8630 106.4950 265.5860 301.2430 218.5090 224.6100 331.230
31308_at 559.0780 411.4830 481.1760 570.7330 333.5390 370.0790 370.079
31309_r_at 20.7697 30.6415 50.2153 42.6892 27.1059 21.5762 98998.000
31310_at 154.1910 224.4460 188.8230 177.8630 233.4630 120.9080 120.908
31311_at 956.7970 648.3100 933.6560 1016.4100 762.0130 1040.2900 1000.290
> data %>% select_if(as.numeric(gsub('GSM\\d{5}\\.(\\d{2})..','\\1',names(data))) < 50)
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at 179.8630 106.4950 265.5860 301.2430 218.5090 224.6100
31308_at 559.0780 411.4830 481.1760 570.7330 333.5390 370.0790
31309_r_at 20.7697 30.6415 50.2153 42.6892 27.1059 21.5762
31310_at 154.1910 224.4460 188.8230 177.8630 233.4630 120.9080
31311_at 956.7970 648.3100 933.6560 1016.4100 762.0130 1040.2900
>
So I added one more column to your data "GSM27024.52.M" and in the select output, it wasn't selected.
An option with parse_number
library(stringr)
dat[readr::parse_number(str_remove(names(dat), "^[^.]+\\.")) < 50]

Matching intervals with values in another table in R

Currently, I have one df and a price table.
Order Number wgt wgt_intvl price
------------------- --------------- -----
01 22 0-15 50
02 5 15-25 75
03 35 25-50 135
What I'd like is to match the weight from the df into an interval of the price table in R. For example, the first order (Order Number 01) corresponds with a price of 75. Therefore, I want to add a column in the first df, say df$cost that corresponds with the appropriate price according to wgt_intvl in the price table.
The way I see to do it is with an if-else statement, but this is highly inefficient and I was wondering if there is a better way to do it. In reality these tables are much longer - there is no logical "buildup" in price or weight interval. I have 15 weight intervals in this table. My current solution would look like this:
If(wgt < 15){
df$cost <- 50
} else if (wgt > 15 & wgt < 25){
df$cost <- 75
} else if(wgt > 25 & wgt < 50){
df$cost <- 135
}
This times fifteen, using the corresponding prices of the price table. I'd love a more efficient solution. Thanks in advance!
Using the data shown reproducibly in the Note at the end, form the vector of cutpoints (i.e. the first number in each interval) and then use findInterval to find the interval corresponding to the weight.
cutpoints <- as.numeric(sub("-.*", "", dfprice$wgt_intvl))
transform(dfmain, price = dfprice$price[findInterval(wgt, cutpoints)])
giving:
Order wgt price
1 01 22 75
2 02 5 50
3 03 35 135
4 04 25 135
Note
dfmain <- data.frame(Order = c("01", "02", "03", "04"), wgt = c(22, 5, 35, 25),
stringsAsFactors = FALSE)
dfprice <- data.frame(wgt_intvl = c("0-15", "15-25", "25-50"),
price = c(50, 75, 135), stringsAsFactors = FALSE)
Instead of an if-statement you could use a more efficient case_when operation:
library(dplyr)
df %>%
mutate(cost = case_when(
wgt < 15 ~ 50,
wgt > 15 & wgt <25 ~ 75,
TRUE ~ 135))
Alternatively you could use cut() to transform wgt to wgt_intvl and match via left_join().

Create group id column using dplyr if numbers between certain values [duplicate]

I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:
data$agegrp(data$age >= 40 & data$age <= 49) <- 3
data$agegrp(data$age >= 30 & data$age <= 39) <- 2
data$agegrp(data$age >= 20 & data$age <= 29) <- 1
the above code is not working under survival package. It's giving me:
invalid function in complex assignment
Can you point me where the error is? data is the dataframe I am using.
I would use findInterval() here:
First, make up some sample data
set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43
Use findInterval() to categorize your "ages" vector.
findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3
Alternatively, as recommended in the comments, cut() is also useful here:
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)
We can use dplyr:
library(dplyr)
data <- data %>% mutate(agegroup = case_when(age >= 40 & age <= 49 ~ '3',
age >= 30 & age <= 39 ~ '2',
age >= 20 & age <= 29 ~ '1')) # end function
Compared to other approaches, dplyr is easier to write and interpret.
This answer provides two ways to solve the problem using the data.table package, which would greatly improve the speed of the process. This is crucial if one is working with large data sets.
1s Approach: an adaptation of the previous answer but now using data.table + including labels:
library(data.table)
agebreaks <- c(0,1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,500)
agelabels <- c("0-1","1-4","5-9","10-14","15-19","20-24","25-29","30-34",
"35-39","40-44","45-49","50-54","55-59","60-64","65-69",
"70-74","75-79","80-84","85+")
setDT(data)[ , agegroups := cut(age,
breaks = agebreaks,
right = FALSE,
labels = agelabels)]
2nd Approach: This is a more wordy method, but it also makes it more clear what exactly falls within each age group:
setDT(data)[age <1, agegroup := "0-1"]
data[age >0 & age <5, agegroup := "1-4"]
data[age >4 & age <10, agegroup := "5-9"]
data[age >9 & age <15, agegroup := "10-14"]
data[age >14 & age <20, agegroup := "15-19"]
data[age >19 & age <25, agegroup := "20-24"]
data[age >24 & age <30, agegroup := "25-29"]
data[age >29 & age <35, agegroup := "30-34"]
data[age >34 & age <40, agegroup := "35-39"]
data[age >39 & age <45, agegroup := "40-44"]
data[age >44 & age <50, agegroup := "45-49"]
data[age >49 & age <55, agegroup := "50-54"]
data[age >54 & age <60, agegroup := "55-59"]
data[age >59 & age <65, agegroup := "60-64"]
data[age >64 & age <70, agegroup := "65-69"]
data[age >69 & age <75, agegroup := "70-74"]
data[age >74 & age <80, agegroup := "75-79"]
data[age >79 & age <85, agegroup := "80-84"]
data[age >84, agegroup := "85+"]
Although the two approaches should give the same result, I prefer the 1st one for two reasons. (a) It is shorter to write and (2) the age groups are ordered in the correct way, which is crucial when it comes to visualizing the data.
Let's say that your ages were stored in the dataframe column labeled age. Your dataframe is df, and you want a new column age_grouping containing the "bucket" that your ages fall in.
In this example, suppose that your ages ranged from 0 -> 100, and you wanted to group them every 10 years. The following code would accomplish this by storing these intervals in a new age grouping column:
df$age_grouping <- cut(df$age, seq(0, 100, 10))

Go through an R dataframe and increase the salary at certain conditions

I have this sample of my dataframe (df):
age salary
1 25 20000
2 35 22000
3 31 23500
4 24 19200
5 27 27900
6 32 31010
I want to increase the salary by 11% for people who are aged above 30 and their salary is not the maximum salary in the table. I wrote this loop:
for(row in df){
if (row$age > 30 & row$salary != max(df$salary)){
row$salary = row$salary * 0.11
}
}
but I get less than the salaries posted rather than an increase.
Would really appreciate any help.
Here's one way without explicit ifelse. Should be one of the fastest ways to do this -
df$new_salary <- with(df, salary + 0.11*salary*(age > 30)*(salary != max(salary))
The reason why you're experiencing problems is because in each iteration of the for loop (specifically, when going through each matching row), you're applying a transformation to the entire column. Try this instead:
k <- max(df$age)
df[df$age>30 & df$age<k, 'salary'] <- df[df$age>30 & df$age<k, 'salary'] * 1.11

Resources