I have two data frames. They are
x <- data.frame(sulfur = c(NA, 5, 7, NA, NA), nitrate = c(NA, NA, NA, 3, 7))
y <- data.frame(sulfur = c(NA, 3, 7, 9, NA), nitrate = c(NA, NA, NA, 6, 7))
I want a new data frame which should be like
z <- data.frame(sulfur(NA, 5, 7, NA, NA, NA, 3, 7, 9, NA), nitrate=c(NA, NA, NA, 3, 7, NA, NA, NA, 6, 7))
I am trying two join columns and make it a single data frame. How do I do it?
Try this:
df<-data.frame(Sulfur=c(NA,5,7,NA,NA), Nitrate = c(NA,NA,NA,3,7))
df2<-data.frame(Sulfur=c(NA,3,7,9,NA), Nitrate = c(NA,NA,NA,6,7))
df3<-(rbind(df,df2))
>df3
Sulfur Nitrate
1 NA NA
2 5 NA
3 7 NA
4 NA 3
5 NA 7
6 NA NA
7 3 NA
8 7 NA
9 9 6
10 NA 7
> class(lst)
[1] list
> dplyr::rbind_all(lst)
> do.call(lst,rbind)
You can use above function since it can apply on multiple (more than two) data frames.
Other options include, placing the datasets in a list and then use rbindlist from data.table
library(data.table)
rbindlist(list(x,y))
or we can use bind_rows from dplyr.
library(dplyr)
bind_rows(x,y)
NOTE: The above two functions can be applied to more than 2 datasets.
Related
I hope someone can help me with this query. I have a large data set and am going to run analyses on a set of participants, provided they meet certain criteria. In this case, the criterion is that each participant provided at least 1 answer to Measure 1 items AND at least 1 answer to Measure 2 items (there are three items for Measure 1 and three items for Measure 2). As such, if they provide three answers to all Measure 1 items but none to Measure 2 items, they are removed from the data set. Same thing if they provide two answers to one of the measures but No answer to items belonging to the other measure. Consider the example below:
df <- data.frame(tester_ID = c("A1", "A2", "A3", "A4", "A5", "A6",
"A7", "A1", "A2", "A3", "A4", "A5", "A6", "A7"),
Phase = c("Phase1", "Phase1", "Phase1", "Phase1", "Phase1",
"Phase1", "Phase1", "Phase2",
"Phase2", "Phase2", "Phase2", "Phase2", "Phase2",
"Phase2"),
Item1Measure1 = c(5, NA, 3, 4, 4, 1, 4, 4, 5, NA, NA, NA, NA, NA),
Item2Measure1 = c(5, 3, NA, NA, 4, 1, NA, 4, 5, NA, NA, 3, NA, 1),
Item3Measure1 = c(NA, NA, NA, NA, 4, 1, NA, 4, 5, 1, 3, 5, NA, NA),
Item1Measure2 = c(NA, NA, NA, NA, NA, 1, NA, 4, 5, NA,NA, NA,NA,NA),
Item2Measure2 = c(5, NA, NA, 4, 4, 1, 4, NA, 5, 2, 4, 1, 2, 4),
Item3Measure2 = c(5, NA, 3, 4, 4, 1, 4, NA, 5, NA, NA, NA, NA, NA))
Created on 2022-06-05 by the reprex package (v2.0.1)
I am hoping create a condition whereby only participants that provided AT LEAST one answer to a Measure1 item AND AT LEAST one answer to a Measure2 item are considered. For instance, the Tester_ID named A2, in Phase one, did not reply to any items for Measure 2, so that tester would be excluded in the new data set. The same applies to Tester_ID A6, in Phase 2, as that tester only provided answers to Measure 2 items but none to Measure 1 items. The remaining 12 rows would meet the criterion of at least one answer per Measure.
Any help would be greatly appreciated.
We may use if_any - loop over the 'Measure1', columns, check for non-NA elements (complete.cases) and (&) loop separately over the 'Measure2', do the same, both of the conditions return a single TRUE/FALSE with if_any, which will be TRUE only if both are TRUE i.e. if there is at least one non-NA in both sets of columns
library(dplyr)
df %>%
filter(if_any(ends_with('Measure1'), complete.cases ) &
if_any(ends_with('Measure2'), complete.cases))
-output
tester_ID Phase Item1Measure1 Item2Measure1 Item3Measure1 Item1Measure2 Item2Measure2 Item3Measure2
1 A1 Phase1 5 5 NA NA 5 5
2 A3 Phase1 3 NA NA NA NA 3
3 A4 Phase1 4 NA NA NA 4 4
4 A5 Phase1 4 4 4 NA 4 4
5 A6 Phase1 1 1 1 1 1 1
6 A7 Phase1 4 NA NA NA 4 4
7 A1 Phase2 4 4 4 4 NA NA
8 A2 Phase2 5 5 5 5 5 5
9 A3 Phase2 NA NA 1 NA 2 NA
10 A4 Phase2 NA NA 3 NA 4 NA
11 A5 Phase2 NA 3 5 NA 1 NA
12 A7 Phase2 NA 1 NA NA 4 NA
I have the following data frame:
data <- structure(list(Date = structure(c(-17897, -17896, -17895, -17894,
-17893, -17892, -17891, -17890, -17889, -17888, -17887, -17887,
-17886, -17885, -17884, -17883, -17882, -17881, -17880, -17879,
-17878, -17877, -17876, -17875, -17874, -17873, -17872, -17871,
-17870, -17869, -17868, -17867, -17866, -17865, -17864), class = "Date"),
duration = c(NA, NA, NA, 5, NA, NA, NA, 5, NA, NA, 1, 1,
NA, NA, 3, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4, NA, NA, 4, NA, NA), name = c(NA, NA, NA, "Date_beg",
NA, NA, NA, "Date_end", NA, NA, "Date_beg", "Date_end", NA,
NA, "Date_beg", NA, "Date_end", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "Date_beg", NA, NA, "Date_end", NA, NA
)), row.names = c(NA, -35L), class = c("tbl_df", "tbl", "data.frame"
))
And looks like:
Date duration name
<date> <dbl> <chr>
1 1921-01-01 NA NA
2 1921-01-02 NA NA
3 1921-01-03 NA NA
4 1921-01-04 5 Date_beg
5 1921-01-05 NA NA
6 1921-01-06 NA NA
7 1921-01-07 NA NA
8 1921-01-08 5 Date_end
9 1921-01-09 NA NA
10 1921-01-10 NA NA
...
I want to replace the NA values in column name that are between rows with Date_beg and Date_end with the word "event".
I have tried this:
data %<>% mutate(name = ifelse(((lag(name) == 'Date_beg')|(lag(name) == 'event')) &
But only the first row after Date_beg changes. It is quite easy with a for-loop, but I wanted to use a more R-like method.
There is probably a better way using data.table::nafill, but as you're using tidyverse functions, I would do it by creating an extra event column using tidyr::fill and then pulling it through to the name column where name is NA:
library(tidyr)
data %>%
mutate(
events = ifelse(
fill(data, name)$name == "Date_beg",
"event",
NA),
name = coalesce(name, events)
) %>%
select(-events)
You can do it by looking at the indices where there have been more "Date_beg" than "Dat_end" with:
data$name[lag(cumsum(data$name == "Date_beg" & !is.na(data$name))) -
cumsum(data$name == "Date_end" & !is.na(data$name)) >0] <- "event"
print(data, n=20)
# # A tibble: 35 x 3
# Date duration name
# <date> <dbl> <chr>
# 1 1921-01-01 NA NA
# 2 1921-01-02 NA NA
# 3 1921-01-03 NA NA
# 4 1921-01-04 5 Date_beg
# 5 1921-01-05 NA event
# 6 1921-01-06 NA event
# 7 1921-01-07 NA event
# 8 1921-01-08 5 Date_end
# 9 1921-01-09 NA NA
# 10 1921-01-10 NA NA
# 11 1921-01-11 1 Date_beg
# 12 1921-01-11 1 Date_end
# 13 1921-01-12 NA NA
# 14 1921-01-13 NA NA
# 15 1921-01-14 3 Date_beg
# 16 1921-01-15 NA event
# 17 1921-01-16 3 Date_end
# 18 1921-01-17 NA NA
# 19 1921-01-18 NA NA
# 20 1921-01-19 NA NA
# # ... with 15 more rows
Lagging the first index by one is required so that you don't overwrite the "Date_beg" at the start of each run.
Another dplyr approach using the cumsum function.
If the row in the name column in NA, it'll add 0 to the cumsum, otherwise add 1. Therefore the values under Date_beg will always be odd numbers (0 + 1) and the values under Date_end will always be even numbers (0 + 1 + 1). Then replace values that are odd in the ref column AND not NA in the name column with "event".
library(dplyr)
data %>%
mutate(ref = cumsum(ifelse(is.na(name), 0, 1)),
name = ifelse(ref %% 2 == 1 & is.na(name), "event", name)) %>%
select(-ref)
Consider the following:
library(data.table)
DataTableA <- data.table(v1 = c(1, 2, NA, 6, 3, NA),
v2 = c(NA, 4, NA, NA, 1, 2),
v3 = c(3, 3, NA, 4, 2, NA),
v4 = c(2, NA, 3, NA, 3, NA),
v5 = c(1, NA, NA, NA, 3, 4))
DataTableA
## v1 v2 v3 v4 v5
## 1: 1 NA 3 2 1
## 2: 2 4 3 NA NA
## 3: NA NA NA 3 NA
## 4: 6 NA 4 NA NA
## 5: 3 1 2 3 3
## 6: NA 2 NA NA 4
varnames <- c("v2", "v4", "v5")
What is the best way of getting the rows of DataTableA where at least one of the variables named in varnames is not NA, without explicitly referring to the variable names?
I know I could do
DataTableA[!is.na(v2) | !is.na(v4) | !is.na(v5)]
but I want to avoid writing out the variable names.
Something that works is
DataTableA[apply(!is.na(DataTableA[, ..varnames]), 1, any)]
but I'm wondering if there's a better way. If there's not, that's OK of course. I don't have any problem with using apply as above, but what I've seen of data.table so far makes me think there might be a simpler approach.
This question is similar, but more complex.
Thanks for any help you can give.
We can use specify the 'varnames' in .SDcols, loop over the .SD (Subset of Data.table), apply the function and Reduce
DataTableA[DataTableA[, Reduce(`|`, lapply(.SD, is.na)), .SDcols = varnames]]
Or with rowSums
DataTableA[DataTableA[, rowSums(!is.na(.SD)) > 0, .SDcols = varnames]]
I have a list of tibbles like the following:
list(A = structure(list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
g1 = c(0, 1, 2, NA, NA, NA, NA, NA, NA),
g2 = c(NA, NA, NA, 3, 4, 5, NA, NA, NA),
g3 = c(NA, NA, NA, NA, NA, NA, 6, 7, 8)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")),
B = structure(list(ID = c(1, 2, 1, 2, 1, 2),
g1 = c(10, 11, NA, NA, NA, NA),
g2 = c(NA, NA, 12,13, NA, NA),
g3 = c(NA, NA, NA, NA, 14, 15)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
)
Each element looks like this:
ID g1 g2 g3
<dbl> <dbl> <dbl> <dbl>
1 0 NA NA
2 1 NA NA
3 2 NA NA
1 NA 3 NA
2 NA 4 NA
3 NA 5 NA
1 NA NA 6
2 NA NA 7
3 NA NA 8
The g* columns are created dynamically, during previous mutates, and their number can vary, but it will be the same across all list elements.
Every g* column has only certain non-NA elements (as many as the unique IDs).
I would like to shift the g* columns so that they contain the non-NA element to the top rows.
I can do it for a single column by
num.shifts<- rle(is.na(myList[[1]]$g1))$lengths[1]
shift(myList[[1]]$g2,-num.shifts)
but how can I do it for all the g* columns, for all list elements, when I don't know in advance the number of g* columns?
Ideally, I would like a tidyverse solution, but not a requirement...
Thanks!
We can loop over the list with map, and use mutate_at to go over the columns that matches the 'g' followed by digits and order based on the non-NA elements
library(dplyr)
library(tidyr)
map(lst1, ~
.x %>%
mutate_at(vars(matches('^g\\d+')), ~ .[order(is.na(.))]))
In base R, we can do
lapply(lst1, function(x) {i1 <- grepl("^g\\d+$", names(x))
x[i1] <- lapply(x[i1], function(y) y[order(is.na(y))])
x})
My zoo (time series) data set looks like below and goes on for hundreds of rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
NA NA NA NA 1 1 1 NA NA NA 3 3 3 NA NA 1 1
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
This variable is part of a larger zoo data set. The general pattern of this variable is a series of 1's, then NAs, then 3's, then NAs, and repeat the pattern again starting with a series of 1's. There is no regular pattern of the number of NAs.
I am trying to (i) fill the NAs between the 1's and 3's with 2, (ii) fill the NAs between the 3's and subsequent 1's with 4, and (iii) fill the NAs in the first four observations with 4 following the general pattern. When done, the values will be a series of 1, 2, 3, and 4 without a pattern of the quantity for each of the four values.
I have spent hours trying ifelse and for loops without success. (Relatively newbie with this part of R.)
I previous did this task in Stata but can't figure out the code in R to fill the NAs. The Stata code to fill the NAs is:
replace cycle4I = 2 if missing(cycle4I) & (cycle4I[_n-1] == 1 | cycle4I[_n-1] == 2) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 3)
replace cycle4I = 4 if missing(cycle4I) & (cycle4I[_n-1] == 3 | cycle4I[_n-1] == 4) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 1)
Here is a straightforward way:
library(zoo)
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
x <- cycle4I
x[1] <- 3
x <- is.na(x) + na.locf(x)
x[1] <- 4
Which gives:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 4 4 4 1 1 1 2 2 2 3 3 3 4 4 1 1
Here is one way
library(dplyr)
library(zoo)
data_frame(cycle4I = c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1)) %>%
mutate(final =
cycle4I %>%
lag %>%
na.locf(na.rm = FALSE) %>%
`+`(1) %>%
ifelse(is.na(cycle4I),
., cycle4I) )