Fill multiple columns in a R dataframe [duplicate] - r

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a dataframe called flu that is a count of case(n) by group per week.
flu <- structure(list(isoweek = c(1, 1, 2, 2, 3, 3, 4, 5, 5), group = c("fluA",
"fluB", "fluA", "fluB", "fluA", "fluB", "fluA", "fluA", "fluB"
), n = c(5, 6, 3, 5, 12, 14, 6, 23, 25)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(isoweek = structure(list(), class = c("collector_double",
"collector")), group = structure(list(), class = c("collector_character",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
In the data set there are some rows where zero cases are not reported in the data so there are no NA values to work with.
I have identified a fix for this to fill down missing weeks with zeros.
flu %>% complete(isoweek, nesting(group), fill = list(n = 0))
My problem is that this only works for the weeks of data reported. For example, at weeks 6, 7, 8 etc if there are no cases reported I have no data.
How can I extend this fill down process to extend the data frame with zeros for isoweeks 6 to 10 (for example) and have a corresponding fluA and fluB for each week with a zero value for each isoweek/group pair?

You can expand multiple columns in complete. Let's say if you need data till week 8, you can do :
tidyr::complete(flu, isoweek = 1:8, group, fill = list(n = 0))
# A tibble: 16 x 3
# isoweek group n
# <dbl> <chr> <dbl>
# 1 1 fluA 5
# 2 1 fluB 6
# 3 2 fluA 3
# 4 2 fluB 5
# 5 3 fluA 12
# 6 3 fluB 14
# 7 4 fluA 6
# 8 4 fluB 0
# 9 5 fluA 23
#10 5 fluB 25
#11 6 fluA 0
#12 6 fluB 0
#13 7 fluA 0
#14 7 fluB 0
#15 8 fluA 0
#16 8 fluB 0

Related

Replacing values of a column in R dataframe

I have a data frame named C0001 with 3671 observations of 31 variables. I want to apply a check on each value of one variable named Y. If the value of that variable is greater than 30, then replace it with 30 otherwise keep the existing value. I wrote the following in R but it gives me an error:
C0001 <- read.csv("C0001.csv")
C0001$Y<- ifelse(C0001$Y > 30, 30, C0001$Y)
Error in ans[npos] <- rep(no, length.out = len)[npos] :
replacement has length zero
In addition: Warning message:
In rep(no, length.out = len) : 'x' is NULL so the result will be NULL
Could someone help me with what mistake I am making here? Is there some other way to do the same operation without using ifelse?
Thank you
Try to replace read.csv() with read_csv() as well check your core work directory. The read_csv() function imports data into R as a tibble, while read.csv() imports a regular old R data frame instead. The error indicates that your input is either NULL or a length 0 vector: make sure the indices are correct.
library(readr)
C0001 <- read_csv("C:/Users/Desktop//C0001.csv")
C0001
> C0001
# A tibble: 6 x 3
x y z
<dbl> <dbl> <dbl>
1 2 40 4
2 3 12 5
3 45 12 6
4 1 50 7
5 1 50 30
6 1 0 0
C0001$y<- ifelse(C0001$y > 30, 30, C0001$y)
C0001
# A tibble: 6 x 3
x y z
<dbl> <dbl> <dbl>
1 2 30 4
2 3 12 5
3 45 12 6
4 1 30 7
5 1 30 30
6 1 0 0
Data sample:
structure(list(x = c(2, 3, 45, 1, 1, 1), y = c(30, 12, 12, 30,
30, 0), z = c(4, 5, 6, 7, 30, 0)), row.names = c(NA, -6L), spec = structure(list(
cols = list(x = structure(list(), class = c("collector_double",
"collector")), y = structure(list(), class = c("collector_double",
"collector")), z = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Use vectorization like this:
C0001$Y <- C0001$Y[C0001$Y > 30]
This works instead of using ifelse().

Using case_when with dplyr across

I'm trying to translate a mutate_at() to a mutate() using dplyr's new "across" function and a bit stumped.
In a nutshell, I need to compare the values in a series of columns to a "baseline" column. When the values in the columns are higher than the baseline, I need to use the baseline value. When the values in the columns are lower than or equal to the baseline, I need to keep the value. Here's an example dataset (my actual dataset is much larger):
test <- structure(list(baseline = c(5, 7, 8, 4, 9, 1, 0, 46, 47), bob = c(7,
11, 34, 9, 6, 8, 3, 49, 12), sally = c(3, 5, 2, 2, 6, 1, 3, 4,
56), rita = c(6, 4, 6, 7, 6, 0, 3, 11, 3)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(baseline = structure(list(), class = c("collector_double",
"collector")), bob = structure(list(), class = c("collector_double",
"collector")), sally = structure(list(), class = c("collector_double",
"collector")), rita = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
My current code uses mutate_at() and works fine:
trial1 <- test %>%
mutate_at(
vars('bob','sally', 'rita'),
funs(case_when(
. > baseline ~ baseline,
. <= baseline ~ .)))
But when I try to update it to reflect across() from dplyr 1.0, I keep getting an error. Here is my attempt:
trial2 <- test %>%
mutate(across(c(bob, sally, rita),
case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
And here is the error:
error: Problem with mutate() input ..1.
x . > baseline ~ baseline, . <= baseline ~ . must be length 36 or one, not 9, 4.
ℹ Input ..1 is across(...)
Any ideas what I might be doing wrong? Does case_when() work with across?
We can use the ~ to specify the anonymous function/lambda function call
library(dplyr)
test %>%
mutate(across(c(bob, sally, rita),
~ case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3
According to ?across the arguments to fns can be either
Functions to apply to each of the selected columns. Possible values are:
NULL, to returns the columns untransformed.
A function, e.g. mean.
A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
Also, instead of case_when, we can make use of the pmin
test %>%
mutate(across(c(bob, sally, rita), ~ pmin(baseline, .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3

Generate self reference key within the table using R mutate in a dataframe

I have an input table with 3 columns (Person_Id, Visit_Id (unique Id for each visit and each person) and Purpose) as shown below. I would like to generate another new column which provides the immediate preceding visit of the person (ex: if person has visited hospital with Visit Id = 2, then I would like to have another column called "Preceding_visit_Id" which will be 1 (ex:2, if visit id = 5, preceding visit id will be 4). Is there a way to do this in a elegant manner using mutate function?
Input Table
Output Table
As you can see that 'Preceding_visit_id' column refers the previous visit of the person which is defined using visit_id column
Please note that this is a transformation for one of the columns in a huge program, so anything elegant would be helpful.
Dput command output is here
structure(list(Person_Id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3), Visit_Id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14), Purpose = c("checkup", "checkup", "checkup", "checkup",
"checkup", "checkup", "checkup", "checkup", "checkup", "checkup",
"checkup", "checkup", "checkup", "checkup"), Preceding_visit_id = c(NA,
1, 2, 3, 4, NA, 6, 7, 8, 9, 10, NA, 12, 12)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec =
structure(list(
cols = list(Person_Id = structure(list(), class = c("collector_double",
"collector")), Visit_Id = structure(list(), class = c("collector_double",
"collector")), Purpose = structure(list(), class =
c("collector_character",
"collector")), Preceding_visit_id = structure(list(), class =
c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))'''
The Person_Id fields in your examples don't match.
I'm not sure if this is what you're after, but from your dput() I have created a file that removes the last column:
df_input <- df_output %>%
select(-Preceding_visit_id)
Then done this:
df_input %>%
group_by(Person_Id) %>%
mutate(Preceding_visit_id = lag(Visit_Id))
And the output is this:
# A tibble: 14 x 4
# Groups: Person_Id [3]
Person_Id Visit_Id Purpose Preceding_visit_id
<dbl> <dbl> <chr> <dbl>
1 1 1 checkup NA
2 1 2 checkup 1
3 1 3 checkup 2
4 1 4 checkup 3
5 1 5 checkup 4
6 2 6 checkup NA
7 2 7 checkup 6
8 2 8 checkup 7
9 2 9 checkup 8
10 2 10 checkup 9
11 2 11 checkup 10
12 3 12 checkup NA
13 3 13 checkup 12
14 3 14 checkup 13

Joining two dataframes to remove NaN values in the first dataframe

I would like to merge two dataframe columns.
I have df1 and that has a specific column (df$col1). This column has rows 1-100, certain rows have NA values (lets say rows 10,15,20,50,69).
Dataframe 2 has rows 10,15,20,50,69.
Is it possible to merge DF2 to df$col such that only the NA values in df$col are filled by DF2..depending on the index number for each dataset
I tried this but instead got a dataframe that did not look anything like what I want
merge(brfss2$pa1min_,df,by.x=1,by.y=1,all.x=TRUE,all.y=TRUE)
Here are the two dataframes
Dataframe1:
1 NA
2 110
3 NA
4 35
5 NA
6 120
7 280
8 30
9 240
10 260
11 322
12 NA
Dataframe 2:
1 2127.6
3 1403.0
5 198.0
12 112.8
a different method - I imported your data and gave column names:
df <- structure(list(col1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), col2 = c(NA, 110, NA, 35, NA, 120, 280, 30, 240, 260, 322,
NA)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-12L), spec = structure(list(cols = list(col1 = structure(list(), class = c("collector_double",
"collector")), col2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
df2 <- structure(list(col1 = c(1, 3, 5, 12), col2 = c(2127.6, 1403,
198, 112.8)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), spec = structure(list(cols = list(
col1 = structure(list(), class = c("collector_double", "collector"
)), col2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
Using tidyverse you can merge and then add a new column conditionally based on the value without NA:
library(tidyverse)
df %>%
merge(df2, by = "col1", all.x = TRUE) %>%
mutate(new_col = if_else(is.na(col2.x), col2.y, col2.x)) %>%
select(new_col)
new_col
1 2127.6
2 110.0
3 1403.0
4 35.0
5 198.0
6 120.0
7 280.0
8 30.0
9 240.0
10 260.0
11 322.0
12 112.8
I wrote the package safejoin which solves this very succinctly
# devtools::install_github("moodymudskipper/safejoin")
safe_left_join(df1,df2, by = "col1", conflict = dplyr::coalesce)
# # A tibble: 12 x 2
# col1 col2
# <dbl> <dbl>
# 1 1 2128.
# 2 2 110
# 3 3 1403
# 4 4 35
# 5 5 198
# 6 6 120
# 7 7 280
# 8 8 30
# 9 9 240
# 10 10 260
# 11 11 322
# 12 12 113.

Aggregate by group AND add column to data frame in R [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
For a sample dataframe:
df1 <- structure(list(place = c("a", "a", "b", "b", "b", "b", "c", "c",
"c", "d", "d"), animal = c("cat", "bear", "cat", "bear", "pig",
"goat", "cat", "bear", "goat", "goat", "bear"), number = c(5,
6, 7, 4, 5, 6, 8, 5, 3, 7, 4)), .Names = c("place", "animal",
"number"), row.names = c(NA, -11L), spec = structure(list(cols = structure(list(
place = structure(list(), class = c("collector_character",
"collector")), animal = structure(list(), class = c("collector_character",
"collector")), number = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("place", "animal", "number")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df",
"tbl", "data.frame"))
I want to create a variable 'sum' which sums the 'number' column by 'place' (regardless of animal), and adds it to the datafame.
The command below:
df1$sum <- aggregate(df1$number, by=list(Category=df1$place), FUN=sum)
... tries to do the sum but can't complete the function because it wants to report by only the number of individual places (hence why we get this error):
Error in `$<-.data.frame`(`*tmp*`, sum, value = list(Category = c("a", :
replacement has 4 rows, data has 11
Any ideas how I add this extra column onto my dataframe?
Since you have a tibble, first a dplyr solution. Next a base R version.
using dplyr:
df1 %>%
group_by(place) %>%
mutate(sum_num = sum(number))
# A tibble: 11 x 4
# Groups: place [4]
place animal number sum_num
<chr> <chr> <dbl> <dbl>
1 a cat 5 11
2 a bear 6 11
3 b cat 7 22
4 b bear 4 22
5 b pig 5 22
6 b goat 6 22
7 c cat 8 16
8 c bear 5 16
9 c goat 3 16
10 d goat 7 11
11 d bear 4 11
using base R:
df1$sum_num <- ave(df1$number, df1$place, FUN = sum)
# A tibble: 11 x 4
place animal number sum_num
<chr> <chr> <dbl> <dbl>
1 a cat 5 11
2 a bear 6 11
3 b cat 7 22
4 b bear 4 22
5 b pig 5 22
6 b goat 6 22
7 c cat 8 16
8 c bear 5 16
9 c goat 3 16
10 d goat 7 11
11 d bear 4 11

Resources