How to subset dataframe based on multiple variables in R - r

I have a dataframe of 286 columns and 157355 rows. I wish to subset rows that contain one or more of several defined factor variables such as F32, F341 etc.
Once this has been completed, I wish to identify which other factor variables are most common in the subset rows.
I have tried to filter for values of interest but an error messages appears saying the data must be numeric, logical or complex, for example;
d<- a %>%
filter_at(vars(f.41202.0.0:f.41202.0.65), all_vars('F32'))
I also tried this, but the resulting dataframe had no values present;
f <- a %>%
rowwise() %>%
filter(any(c(1:280) %in% c('F32', 'F320', 'F321', 'F322', 'F323',
'F328', 'F329', 'F330', 'F331', 'F332',
'F333', 'F334', 'F338', 'F339')))
the same occurred when I tried to place all relevant variables into an ICD object;
f <- b %>%
rowwise() %>%
filter(any(c(1:286) %in% ICD))
I would greatly appreciate any suggestions, thanks
my data looks like this (sorry I can't find a way to format it better on this page);
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
2 NA NA M87 NA
3 NA F3 NA K17
4 NA NA F3 M87
After sub-setting rows based on F3 it should look like this;
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
3 NA F3 NA K17
4 NA NA F3 M87
so the same variable columns are retained, but rows without F3 are removed
then I would hope to list the other variables (other than F3) based on how common they are within that subset, in this case that would be
most common: M87
2nd most common: K17
If it helps, I am trying to identify individuals with a particular disease, then I will try to find out which other diseases those individuals most commonly have
thanks for the help

If you wish to use tidyverse, you can use filter_all to look at all of the columns. Then, check if any_vars are in a vector of diagnostic codes. In my example, I look at F3 and F320.
Afterwards, if you want to count up the number of diagnosis codes, you could reshape your data from wide to long, and then count frequencies. If you wish, you can remove NA by filter. Let me know if this is what you had in mind.
df <- data.frame(
Var1 = c("F3", NA, NA, NA),
Var2 = c(NA, NA, "F3", NA),
Var3 = c(NA, "M87", NA, "F3"),
Var4 = c("M87", NA, "K17", "M87")
)
library(tidyverse)
df %>%
filter_all(any_vars(. %in% c("F3", "F320"))) %>%
pivot_longer(cols = starts_with("Var"), names_to = "Var", values_to = "Code") %>%
filter(!is.na(Code)) %>%
count(Code, sort = TRUE)
After the filter, you should have:
Var1 Var2 Var3 Var4
1 F3 <NA> <NA> M87
2 <NA> F3 <NA> K17
3 <NA> <NA> F3 M87
After pivot_longer and count:
# A tibble: 3 x 2
Code n
<fct> <int>
1 F3 3
2 M87 2
3 K17 1
Side note: if you wish to filter based on only some of your variables (instead of selecting all variables), you can use filter_at instead, such as:
filter_at(vars(starts_with("Var")), any_vars(. %in% c("F3", "F320")))

Related

Merging rows in a dataframe R with duplicate id's

I have a question considering merging rows in a dataframe:
I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:
I have a dataframe with a structure like this:
person_id test_date serial_number freezer_number test_1 test_2 test_3 test_4
x 01/01/2010 c d positive NA NA NA
x 05/01/2010 a b NA positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
I want to merge the rows so that the data of the other columns remain intact (mainly the test
date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.
This would be the ideal output:
person_id test_date serial_number freezer_number test_date2 test_1 test_2 test_3 test_4
x 01/01/2010 c d 05/01/2010 positive positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
How do I go about this? I have tried the "aggregate()" functions before, however this is very unclear to me.
Any help is appreciated, I can give more information to clarify my current code and data frame!
You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by #Andrea M).
library(dplyr)
library(lubridate)
df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))
Output
> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA
What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider().
However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.
# Load libraries
library(tidyr)
library(dplyr)
library(stringr)
# Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)
df2 <- df %>%
# Add a column indicating test number
group_by(person_id) %>%
mutate(test_number = row_number(),
# Gather the test results into a single column
test_result = paste0(test_1, test_2, test_3, test_4) %>%
str_remove_all("NA")) %>%
select(-(test_1:test_4)) %>%
# Reshape from long to wide
pivot_wider(names_from = test_number,
values_from = c(test_date, serial_number,
freezer_number, test_result)) %>%
# Reorder the columns
relocate(ends_with("1"), .before = ends_with("2"))
df2

Count number of non-blank columns and assign to individual respondents

I am trying to get a total number of friends that will become the denominator in a later step.
example data:
set.seed(24) ## for sake of reproducibility
n <- 5
data <- data.frame(id=1:n,
Q1= c("same", "diff", NA, NA, NA),
Q2= c("diff", "diff", "same", "diff", NA),
Q3= c("same", "diff", NA ,NA, "diff"),
Q4= c("diff", "same", NA, NA, NA))
i first need to create a column that contains a numeric count of how many columns each participant responded to (either "same" or "diff", not counting NAs/blanks). I have tried the following
friendship <- total.friends <- rowSums(c(data$Q1, data$Q2, data$Q3, data$Q4)), != "")
friendship <- total.friends <-rowSums(!is.na(c(data$Q1, data$Q2, data$Q3, data$Q4)))
Neither is effective, likely because my data is not numeric. the first did count the cells but did not group by id as I require. is there any function i can use to count the populated cells? how can i edit this to count cells populated only with "diff" so that i can then start on the second step (making the proportion)?
You could
data2 <- apply(data[,-1],MARGIN=1,function(x){c <- length(x[!is.na(x)])})
result <- as.data.frame(cbind(data[,1],data2)) %>% setNames(c("id","number"))
And result will hold the amount of not NA each id has.
The data2 is basically a count of the number of not NAs for each id, it uses the apply function with margin 1 which basically takes each row of your dataframe and applies a function to that row. The function that is being applied is the c<-length(x[!is.na(x)] part. Which basically, the 'x[!is.na(x)]' filters away all the NA entries in each row so that it only has NOT NA entries of the row, then we apply the length() function to that result so it gives us how many entries where there after filtering the NAs.
The result of that apply will be a single column array, in which each row is the result of computing that procedure to each row, and considering you have a row for each id. It translates as computing that function to each id
Lastly, in the result line I simply add the id back to the previous step, for the sake of having in it well identified and not just one column of results.
Hope this works for you :)
Here's a regex solution with grep:
data$count <- apply(data, 1, function(x) length(grep("[a-z]", x, value = T)))
Here using length you count the number of times grep finds a lower-case letter in any row cell.
Result:
data
id Q1 Q2 Q3 Q4 count
1 1 same diff same diff 4
2 2 diff diff diff same 4
3 3 <NA> same <NA> <NA> 1
4 4 <NA> diff <NA> <NA> 1
5 5 <NA> <NA> diff <NA> 1
You can also accomplish this using c_across and rowwise from the dplyr library:
library(dplyr)
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(Q1:Q4)))) %>%
dplyr::ungroup()
Note: alternatively you can use starts_with("Q") inside of c_across to do this across all columns that start with "Q" (shown below).
To count the number of a specific response you can do or compute other variables that depend on a newly created variable, like a proportion, in the mutate statement:
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(starts_with("Q")))),
Diff = sum(c_across(starts_with("Q")) == "diff", na.rm = T),
Prop = Diff / Total) %>%
dplyr::ungroup()
id Q1 Q2 Q3 Q4 Total Diff Prop
<int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 1 same diff same diff 4 2 0.5
2 2 diff diff diff same 4 3 0.75
3 3 NA same NA NA 1 0 0
4 4 NA diff NA NA 1 1 1
5 5 NA NA diff NA 1 1 1

Eliminate row from data.frame based on partially duplicate values

I have a relatively large data.frame with 205K observations and 54 variables. This data.frame is the result of appending three different data.frames. The original data.frames all have the columns date, time, lat and lon, but each data.frame carries accessory information which i need to retain. In the final data.frame I have therefore sets of three rows where date, time, lat, lon, are exactly the same but the values of var1, var2 and so forth are different and some are NA. A simplified version of my data.frame could look like the following:
mydf
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
3 <NA> 1 2 <NA> <NA> <NA> <NA> <NA> 8 9
In particular, I would like highlight in my data.frame those sets of rows with the same date, time, lat and long, but only retain the ones where, as an instance, var1 is not NA so that the final data.frame should look like:
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
I know that I can use the
distinct(mydf, ..., .keep_all = TRUE)
but I can't figure out to use the arguments properly. Any help is greatly appreciated.
With dplyr:
First identify duplicated rows from top to bottom to get all the duplicated rows with the selected variables. Then filter where this new variable is TRUE and where var1 is other than NA:
library(dplyr)
mydf %>% mutate(dup = duplicated(select(., date, time, lat, lon)) |
duplicated(select(., date, time, lat, lon), fromLast = TRUE)) %>%
filter(dup == TRUE & !is.na(var1))
I have found a less direct and most likely less elegant solution to deal with the duplicate rows in my specific data.frame by first adding a "datetime" column to all my data.frames, subsequently by "subtracting" one data.frame to another and finally by appending them toa final data.frame
#my data.frames
#df1
#df2
df1<- mutate(df1, datetime = paste(df1$date, df1$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df2<- mutate(df2, datetime = paste(df2$date, df2$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df1<-anti_join(df1, df2, by ="datetime") #delete from df1 those rows that occur in df2 based on the column "datedime"
df.f<-rbind.fill(df1, df2) #append the two data.frames in "df.f"
Possibly not the fastest approach but it works fine with my data.

Merging two dataframes on one matching variable and retaining only one value for other disjoint variables

I have two dataframes I need to merge. The dataframes share all of the same columns. I am merging based on one shared variable, worker_ID. However, the other variables are often disjoint: one dataframe will have an "NA" and the other will have another value for a given variable. How can I merge in such a way that the output only retains the non-NA value?
x = worker_ID Var_1 Var_2 Var_3
1 33 NA NA
2 NA 46 NA
y = worker_ID Var_1 Var_2 Var_3
1 NA 75 NA
2 NA NA 66
z <- merge(x,y,by="worker_ID", all = TRUE)
This method does not work because instead of my desired output, z, I get a dataframe with two columns for each variable (one for the value of the variable in x and another for y). My desired output is z.
z = worker_ID Var_1 Var_2 Var_3
1 33 75 NA
2 NA 46 66
How can I tell R to let any non-NA entries supersede NA ones?
As Ben suggested, you can use coalesce(). Based on your present sample data, I did the following. For each pair of columns in a same position in x and y, I used coalesce() and created a vector. I converted the result of sapply() to a data frame and added worker_ID in the end. Note that I used as.numeric() for Var_3. I am not sure how your data is like, but Var_3 in x can be logical rather than numeric. I made sure that Var_3 in x and Var_3 in 'y` are both numeric.
library(tidyverse)
sapply(2:ncol(x), function(whatever){
coalesce(as.numeric(pull(x, whatever)),
as.numeric(pull(y, whatever))) -> foo
return(foo)
}) %>%
as_tibble %>%
bind_cols(work_ID = pull(x, 1), .)
# A tibble: 2 x 4
# work_ID V1 V2 V3
# <int> <dbl> <dbl> <dbl>
#1 1 33 75 NA
#2 2 NA 46 66
UPDATE
Taking akrun's advice, I think the following code works well. map_dfc() loops through each column pair just as sapply() does. The good thing is that map_dfc() creates a data frame; no need to use as_tibble().
map_dfc(2:ncol(x), ~ coalesce(as.numeric(pull(x, .x)),
as.numeric(pull(y, .x)))) %>%
bind_cols(work_ID = pull(x, 1), .)

Summarize data frame to return non-NA values along subsets

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.
I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.
Example:
#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values
dat <- as.data.frame(cbind(a, b, c, d))
> desired output
a b c d
1 B A 1
2 C A <NA>
Rules of thumb:
1) Need to remove NA values from each column
2) Loop along data subsets (column "a" in example above)
3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values
Ideas:
lapply or dplyr is probably helpful to loop along all columns
na.omit is likely helpful, if the subsetting column that has entries for all
rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column
Any help is appreciated to put the final pieces together! Thank you!
One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.
library(dplyr)
dat %>%
group_by(a) %>%
summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
# a b c d
# <fctr> <fctr> <fctr> <fctr>
# 1 1 B A 1
# 2 2 C A <NA>
Solution with data.table and na.omit
library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)
I think the merge statement can be improved
Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:
library(dplyr)
library(tidyr)
dat %>%
filter_at(vars(b:c), any_vars(!is.na(.))) %>%
group_by(a) %>%
fill(b) %>%
fill(c) %>%
filter_at(vars(b:c), all_vars(!is.na(.)))
# A tibble: 2 x 4
# Groups: a [2]
a b c d
<fctr> <fctr> <fctr> <fctr>
1 1 B A 1
2 2 C A NA
You could also use just dplyr:
dat %>%
group_by(a) %>%
summarise_each(funs(first(.[!is.na(.)])))

Resources