subsetting a dataframe by existing object - r

I have a predefined object grade <- "G3". I would like to subset a data frame by grabbing 3 from "grade" object, subsetting only grade 3.
Here is an example of data
id <- c(1,2,3,4,5)
grade <- c(3,3,4,4,5)
score <- c(10,5,10,5,10)
data <- data.frame("id"=id,"grade"=grade, "score"=score)
> data
id grade score
1 1 3 10
2 2 3 5
3 3 4 10
4 4 4 5
5 5 5 10
I would like to get something like this:
> data
id grade score
1 1 3 10
2 2 3 5
Thanks!

With tidyverse, we can use !! to check for the 'grade' object in the global environment instead of the column in the 'data' environment, remove the 'G' and do a ==
library(dplyr)
library(stringr)
data %>%
filter(grade == str_remove(!!grade, "G"))
# id grade score
#1 1 3 10
#2 2 3 5

You can use filter, but you would likely want to change the object name so it doesn't match the variable name.
Grade <- "G3"
data <- data.frame("id"=id,"grade"=grade, "score"=score) %>%
filter(paste0("G", grade) == Grade)

You can use readr's parse_number to extract digits from a string with a minimum of fuss, and then subset with the result:
library(readr)
data[data$grade == parse_number(grade),]
Or with base R's sub replace non-numbers with "":
data[data$grade == sub("[^0-9]", "", grade),]
Or if the only other character in your string is always "G" then:
data[data$grade == sub("G", "", grade),]

Related

Apply filter to the table function

I'm looking for a way to execute a simple task faster than I am currently able to.
I want to use the table function in R on part of a dataframe. Of course it would be possible to first use subset and then table, but this is a bit tedious. (In my case, during a first inspection of the data, I want to check the frequency of NAs on individual variables in a multi-national survey for each of the 25 participating countries. So I'd need to create 25 subsets, make the table, and then remove the subsets again because I don't need them anymore.)
Here is some example data:
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
And this is the workaround solution.
df1 <- subset(df, a == 1)
table(df1$b)
df2 <- subset(df, a == 2)
table(df2$b)
rm(df1, df2)
Is there a simpler way?
Also, I feel like I am spamming with ultra-basic questions like these. If anyone has a suggestion on how I could have found the answer directly I'd be happy to hear it. Other than trying some code myself, I googled terms like 'r apply filter to table', 'r filter table function', 'r table subset dataframe', etc.
Assuming 99 are your NAs then there is a way using purrr package, which I find is excellent to see how many NAs there are in each column:
library(purrr)
df |>
map_df(~sum(. == 99))
a b
<int> <int>
1 0 3
Can you provide an example of the structure of the original data (multi-national survey)?
Probably you would be able to answer your question with a much tidier code using the package dplyr with functions such as
survey_data %>%
select(column1, column2, country, etc) %>% #choose your desired columns
group_by(country) %>%
summarise_all(funs(sum(is.na(.))))
You could split on your a variable and use lapply to use table on each list like this:
lapply(split(df, df$a), \(x) table(x))
#> $`1`
#> b
#> a 1 2 3 99
#> 1 1 1 1 2
#>
#> $`2`
#> b
#> a 1 2 3 99
#> 2 2 1 1 1
Created on 2023-02-18 with reprex v2.0.2
Just use it in an lapply.
alv <- unique(df$a)
lapply(alv, \(x) table(subset(df, a == x, b))) |> setNames(alv)
# $`1`
# b
# 1 2 3 99
# 1 1 1 2
#
# $`2`
# b
# 1 2 3 99
# 2 1 1 1
However, it might be better to code 99 (and probably others) as NA,
df[] <- lapply(df, \(x) replace(x, x %in% c(99), NA))
and count the NAs in b for each individual a.
with(df, tapply(b, a, \(x) sum(is.na(x))))
# 1 2
# 2 1
Just use table() on the whole dataframe, and pull out the parts you want afterwards. You convert the a and b values to character values when indexing into the two-way table. For example,
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
full <- table(df$a, df$b)
full["1",] # corresponds to subset a == 1
#> 1 2 3 99
#> 1 1 1 2
full["2",] # corresponds to subset a == 2
#> 1 2 3 99
#> 2 1 1 1
full[, "99"] # corresponds to subset b == 99
#> 1 2
#> 2 1
Created on 2023-02-18 with reprex v2.0.2

Subsetting whole clusters froma dataframe

In my data.frame below, I wonder how to subset a whole cluster of study that has any outcome larger than 1 in it?
My desired output is shown below. I tried subset(h, outcome > 1) but that doesn't give my desired output.
h = "
study outcome
a 1
a 2
a 1
b 1
b 1
c 3
c 3"
h = read.table(text = h,h=T)
DESIRED OUTPUT:
"
study outcome
a 1
a 2
a 1
c 3
c 3"
Modify the subset -
subset the 'study' based on the first logical expression outcome > 1
Use %in% on the 'study' to create the final logical expression in subset
subset(h, study %in% study[outcome > 1])
-output
study outcome
1 a 1
2 a 2
3 a 1
6 c 3
7 c 3
If we want to limit the number of 'study' elements having 'outcome' value 1, i.e. the first 'n' 'study', then get the unique 'study' from the first expression of subset, use head to get the first 'n' 'study' values and use %in% to create logical expression
n <- 3
subset(h, study %in% head(unique(study[outcome > 1]), n))
Or can be done with a group by approach with any
library(dplyr)
h %>%
group_by(study) %>%
filter(any(outcome > 1)) %>%
ungroup

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Paste string values from df column into a function

I have a dataset in R organized like so:
x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2
Then, I have a function like
fun <- function(number, df1, string, df2){NormC <- as.numeric(df1[string, "normc"])
df2$NormC <- rep(NormC)}
How can I iterate through my df and insert each value of "x" into the function?
I think the problem is that this part of the function (which has 4 input variables) is structured like so- NormC <- as.numeric(df[string, "normc"])
As explained by #duckmayr, you don't need to iterate through column x. Here is an example creating new variable.
df <- read.table(text = " x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2", header = TRUE)
fun <- function(string){paste0(string, "X")} # example
# option 1
df$new.col1 <- fun(df$x) # see duckmayr's comment
# option 2
library(data.table)
setDT(df)[, new.col2 := fun(x)]

Subset a dataframe using a string of column names

I need to subset a dataframe (df) by a string of columns names that I have created - not sure how to inject this into a subet..?
for example
colstoKeep is a character string:
"col1", "col2", "col3", "col4"
how do I push this into a subset function
df<- df[colstoKeep]
I'm sure this is easy.? because the above doesn't work.
df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
colsToKeep <- "\"A\", \"C\""
If I understand your question correctly, your colsToKeep variable is a string as given above. In order to extract the variables, you will have to convert that into a vector. If I've used the right format, you can do that with the following code.
library(magrittr)
colsToKeepVector <-
strsplit(colsToKeep, ",") %>%
unlist() %>%
trimws() %>%
gsub("\"", "", .)
df[colsToKeepVector]
However, if I'm also understanding that you had a vector that you collapsed to a string (paste(..., collapse = ", ")?), I would strongly advise you not to do that.
(Edited to match the string format in the question)
df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
cols_to_keep <- c("A","C")
df[,cols_to_keep]
A C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

Resources