How to add multiple columns to a tibble? - r

I want to add multiple empty columns to a tibble.
The names of the new columns are stored in 'columnsToAdd'
> columnsToAdd
[1] "column1" "column2" "column3" "column4" "column5"
When I run the following code lines, ...
library(dplyr)
someTibble <- tibble(name = paste("Name", 1:10))
columnsToAdd <- paste("column", 1:30, sep = "")
someTibble %>%
tibble::add_column(columnsToAdd = NA)
... I get this result, ...
# A tibble: 10 x 2
name columnsToAdd
<chr> <lgl>
1 Name 1 NA
2 Name 2 NA
3 Name 3 NA
4 Name 4 NA
5 Name 5 NA
6 Name 6 NA
7 Name 7 NA
8 Name 8 NA
9 Name 9 NA
10 Name 10 NA
... instead, I want to get the following result:
# A tibble: 10 x 6
name column1 column2 column3 column4 column5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name 1 NA NA NA NA NA
2 Name 2 NA NA NA NA NA
3 Name 3 NA NA NA NA NA
4 Name 4 NA NA NA NA NA
5 Name 5 NA NA NA NA NA
6 Name 6 NA NA NA NA NA
7 Name 7 NA NA NA NA NA
8 Name 8 NA NA NA NA NA
9 Name 9 NA NA NA NA NA
10 Name 10 NA NA NA NA NA

Just create them as columns. This works for base data frames and data.table tables:
Eg:
> someTibble <- tibble(name = paste("Name", 1:10))
> columnsToAdd <- paste("column", 1:3,sep="")
That gives me one column:
> head(someTibble)
# A tibble: 6 x 1
name
<chr>
1 Name 1
2 Name 2
3 Name 3
4 Name 4
5 Name 5
6 Name 6
Then I do:
> someTibble[,columnsToAdd]=NA
and magically the columns appear:
> head(someTibble)
# A tibble: 6 x 4
name column1 column2 column3
<chr> <lgl> <lgl> <lgl>
1 Name 1 NA NA NA
2 Name 2 NA NA NA
3 Name 3 NA NA NA
4 Name 4 NA NA NA
5 Name 5 NA NA NA
6 Name 6 NA NA NA
Note this is not really magic, it is standard base R behaviour since from before R was born.

We can pass the column names as a list then use the !!! "triple bang"
columnsToAdd <- paste("column", 1:5, sep = "")
someTibble %>%
tibble::add_column(!!!set_names(as.list(rep(NA, length(columnsToAdd))),nm=columnsToAdd))
# A tibble: 6 x 6
name column1 column2 column3 column4 column5
<chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 Name 1 NA NA NA NA NA
2 Name 2 NA NA NA NA NA
3 Name 3 NA NA NA NA NA
4 Name 4 NA NA NA NA NA
5 Name 5 NA NA NA NA NA
6 Name 6 NA NA NA NA NA
However, I think #MrGumble may miss something as from add_column {tibble}
... Name-value pairs, passed on to tibble(). All values must have one element for each row in the data frame, or be of length 1. These arguments are passed on to tibble(), and therefore also support unquote via !! and unquote-splice via !!!.
and here is an example from tibble {tibble}
tibble(!!! list(x = rlang::quo(1:10), y = quote(x * 2)))

What you are seeing in tibble::add_column(columnsToAdd = NA) is the quasi-something evaluation that dplyr and tidyr introduced. If you check the definition:
> args(add_column)
function (.data, ..., .before = NULL, .after = NULL)
you'll see that it doesn't expect a certain variable. It literally expects the actual variable name, without quotation.
An entirely different approach is to create a matrix (or data.frame, whatever tickles your fancy), and smack it onto the side of someTibble:
extra <- matrix(NA_real_, nrow=nrow(someTibble), ncol=length(columnsToAdd), dimnames=list(NULL, columnsToAdd))
dplyr::bind_cols(someTibble, as.data.frame(extra))

Related

Is there a cleaner way to return a data point than this: SchIndxRead %>% select(,.DormList) %>% filter(SchIndxRead$.College.Lookup=="MIAD")?

I'd like to be able to select out data from my data.frame simply and elegantly, but I'm new to R.
This worked:
SchIndxRead %>% select(,.DormList) %>% filter(SchIndxRead$.College.Lookup=="MIAD")
I tried using this:
SchIndxRead[SchIndxRead$.College.Lookup=='MIAD',".DormList"]
And expected just "Two50Two"
but got this result:
> [1] "Two50Two" NA NA NA NA
> [6] NA NA NA NA NA
> [11] NA NA NA NA NA
> [16] NA NA NA NA NA
> [21] NA NA NA NA NA
Your column .College.Lookup probably has NA values, such that the expression SchIndxRead$.College.Lookup=="MIAD" returns TRUE's and FALSE's, but also NA's.
When you try to subset a variable with a vector that contains NA's, the result will also have NA's:
set.seed(10)
df = tibble(a = 1:10, b = sample(c(0, 1, NA), 10, TRUE))
> df
# A tibble: 10 × 2
a b
<int> <dbl>
1 1 NA
2 2 0
3 3 1
4 4 NA
5 5 1
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
> df$b == 1
[1] NA FALSE TRUE NA TRUE NA NA NA NA NA
> df[df$b == 1, "a"]
# A tibble: 9 × 1
a
<int>
1 NA
2 3
3 NA
4 5
5 NA
6 NA
7 NA
8 NA
9 NA
That's why there were NA's in your second attempt.
But dplyr::filter "ignores" NA's, that is, it filters out rows where the condition returns FALSE or NA. That's why there weren't NA's in your first attempt.
Two hints to improve your code:
It would've been better to change the order of select and filter:
SchIndxRead %>% filter(.College.Lookup == "MIAD") %>% select(.DormList)
This way you don't have to add the SchIndxRead$ later.
You might prefer using pull():
SchIndxRead %>% filter(.College.Lookup == "MIAD") %>% pull(.DormList)

count occurrences across columns and match to ID column

I have a df of 100+ columns and not all are filled
> head(othertopics,20)
# A tibble: 20 x 118
Q6 Q10.1 Q10.2 Q10.3 Q10.4 Q10.5 Q10.6 Q10.7 Q10.8 Q10.9 Q10.10 Q10.11 Q10.12 Q10.13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 294 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 103 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 87 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
7 136 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
9 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
10 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
11 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
12 19 NA NA NA 4 NA NA NA NA NA NA NA NA NA
13 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
14 108 NA NA NA NA NA NA NA NA NA NA NA NA NA
Q6 is an ID.
across Q10.1 to Q10.117 there are different values assigned for each ID (see line 12).
Using unlist i used unlist and managed to get the frequency for every time a value was mentioned among the 117 columns. But i need to match them to their respective ID.
So basically i need to match an ID col with 117 columns and get the frequency of each column.
othertopics<-data.frame(table(unlist(TableTopic2[,22:138])))
Var1 Freq
10 1
100 4
101 1
102 12
103 7
104 21
105 36
106 1
so for example variable 105 appeared 36 times across 17 values of IDs on column Q6( This number I counted on Excel).
So, so far I only have the first half of my solution as i need to know what is the ID associated with the variables . ( ie: the 17 values i counted)
also note that the variable columns contain the number of their variable, So for example row Q10.105 is for variable 105 which has a frequency of 36.
I hope i was able to make it clear.
Thanks!
This question is not particularly clear, but I'll do my best. I think the way to tidy this data is to pivot all of the non-id columns to one column (I call it 'col_name') and then have another column with all of the values (mostly NA's; I call it 'numbered_var' for numbered variable). Then, you can aggregate based on the numbered_variable column.
This example is obviously not reproducible, so I constructed a simplified version of your data (I think):
library(dplyr)
library(tidyr)
df <- tibble(
id = 1:5,
Q1 = c(NA_integer_, 10L, NA_integer_, 10L, NA_integer_),
Q2 = c(NA_integer_, NA_integer_, 11L, NA_integer_, 11)
)
It looks like this:
# A tibble: 5 × 3
id Q1 Q2
<int> <int> <dbl>
1 1 NA NA
2 2 10 NA
3 3 NA 11
4 4 10 NA
5 5 NA 11
Next, I use tidyr::pivot_longer() to put the column names containing Q into a column, with their associated value in another column:
df <- pivot_longer(
df,
cols = contains("Q"), # you will want to use this, but first remove the Q from the id column name in your data
names_to = "col_name",
values_to = "numbered_var"
)
This makes the data long:
# A tibble: 10 × 3
id col_name numbered_var
<int> <chr> <dbl>
1 1 Q1 NA
2 1 Q2 NA
3 2 Q1 10
4 2 Q2 NA
5 3 Q1 NA
6 3 Q2 11
7 4 Q1 10
8 4 Q2 NA
9 5 Q1 NA
10 5 Q2 11
You should still probably have three columns, but the id's would repeat themselves n-column times, just as they repeat twice for the two columns here.
Next, I would group by the variables, which seem to be of interest, and list the unique id's that have the variables in a new column:
df <- group_by(df, numbered_var)
df <- summarize(
df,
var_appearances = n(),
ids = list(unique(id))
)
Now, the data frame looks like this:
# A tibble: 3 × 3
numbered_var var_appearances ids
<dbl> <int> <list>
1 10 2 <int [2]>
2 11 2 <int [2]>
3 NA 6 <int [5]>
That ids column is a list-column with a vector of ids:
print(df$ids)
[[1]]
[1] 2 4
[[2]]
[1] 3 5
[[3]]
[1] 1 2 3 4 5
I'm not sure this is exactly what you'll be able to do, but hopefully it sets you in the right direction.

Append several tables into one with one different column in R

I have a data frame after making several tables I would like to create a data frame that combines all tables into one data frame in order to export to excel. The only issue is the first variable is different in each table so bind_rows will not work.
Dummy sample data:
df1 = data.frame(Id = c(11:16), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1))
df2 = data.frame(HH_size = c(1:6 ), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1) )
let's say I made these tables
df11<- df1 %>%
dplyr::group_by(date) %>%
count(Id) %>%
tidyr::spread(date,n)
df22<- df2 %>%
dplyr::group_by(date) %>%
count(HH_size) %>%
tidyr::spread(date,n)
df11
Id `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<int> <int> <int> <int> <int> <int> <int>
1 11 1 NA NA NA NA NA
2 12 NA 1 NA NA NA NA
3 13 NA NA 1 NA NA NA
4 14 NA NA NA 1 NA NA
5 15 NA NA NA NA 1 NA
6 16 NA NA NA NA NA 1
This will not work
list <- c("df11" , "df22")
list %>% map_df(bind_rows)
Error: Argument 1 must have names
here is my desired output:
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
Id 11 1 NA NA NA NA NA
Id 12 NA 1 NA NA NA NA
Id 13 NA NA 1 NA NA NA
Id 14 NA NA NA 1 NA NA
Id 15 NA NA NA NA 1 NA
Id 16 NA NA NA NA NA 1
HH_size 1 1 NA NA NA NA NA
HH_size 2 NA 1 NA NA NA NA
HH_size 3 NA NA 1 NA NA NA
HH_size 4 NA NA NA 1 NA NA
HH_size 5 NA NA NA NA 1 NA
HH_size 6 NA NA NA NA NA 1
This will serve your purpose.
. in dplyr/magrittr means result upto previous pipe. So names(.)[1] took out the name of first column and mutated it into a new column named label
Then again you needed first column back as cat. So I mutated a column cat with .x[[1]] which is first column of every iterated value passed on. I think you may also use . instead of .x as value just prior to pipe is .x only.
unselect first column
rearrange placement of these columns as desired.
map_df(list(df11, df22), ~.x %>%
mutate(label = names(.)[1],
cat = .x[[1]]) %>%
select(-1) %>%
select(label, cat, everything()))
# A tibble: 12 x 8
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<chr> <int> <int> <int> <int> <int> <int> <int>
1 Id 11 1 NA NA NA NA NA
2 Id 12 NA 1 NA NA NA NA
3 Id 13 NA NA 1 NA NA NA
4 Id 14 NA NA NA 1 NA NA
5 Id 15 NA NA NA NA 1 NA
6 Id 16 NA NA NA NA NA 1
7 HH_size 1 1 NA NA NA NA NA
8 HH_size 2 NA 1 NA NA NA NA
9 HH_size 3 NA NA 1 NA NA NA
10 HH_size 4 NA NA NA 1 NA NA
11 HH_size 5 NA NA NA NA 1 NA
12 HH_size 6 NA NA NA NA NA 1
Put all the dataframes in a list and then you can do :
library(tidyverse)
list_df <- lst(df1, df2)
map_df(list_df, ~{
col <- names(.x)[1]
.x %>%
count(.data[[col]], date) %>%
pivot_wider(names_from = date, values_from = n) %>%
mutate(label = col) %>%
rename_with(~'cat', 1)
})
# cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06` label
# <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11 1 NA NA NA NA NA Id
# 2 12 NA 1 NA NA NA NA Id
# 3 13 NA NA 1 NA NA NA Id
# 4 14 NA NA NA 1 NA NA Id
# 5 15 NA NA NA NA 1 NA Id
# 6 16 NA NA NA NA NA 1 Id
# 7 1 1 NA NA NA NA NA HH_size
# 8 2 NA 1 NA NA NA NA HH_size
# 9 3 NA NA 1 NA NA NA HH_size
#10 4 NA NA NA 1 NA NA HH_size
#11 5 NA NA NA NA 1 NA HH_size
#12 6 NA NA NA NA NA 1 HH_size

Running a forloop over a header in R

I am trying to switch from stata to R and need help with a forloop
Context:
I have data(survey questionnaire) with 5 blocks and with 10 questions each. B1B2 <- 2nd question of first block. My rows are people (who can only be in 1 block each) so I have values for that block and NAs for the other variables in the other block. (eg. a person in 3rd block will have observations for B3B1-10 and NA for B1B1-10, B2B1-10 etc.) I am trying to combine all the blocks to B1-10. Heres a header of my data:
B1B1 B1B2 B1B3 B1B4 B1B5 B1B6 B1B7 B1B8 B1B9 B1B10 B2B1 B2B2 B2B3 B2B4 B2B5 B2B6 B2B7 B2B8 B2B9 B2B10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA 1 2 2 2 2 1 2 1 1 2
2 NA NA NA NA NA NA NA NA NA NA 1 1 1 2 2 2 2 1 1 1
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA 2 2 2 2 1 2 2 1 1 1
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I got it working for 1 instance using the unite function:
data %>% unite("B1", B1B1,B2B1,B3B1,B4B1,B5B1, na.rm = TRUE, remove = FALSE) -> data
I want to loop this from B1 to B10 as such
for (i in (1:10)){ data %>% unite("paste0("B",i)", paste0("B1B",i),paste0("B2B",i),paste0("B3B",i),paste0("B4B",i),paste0("B5B",i), na.rm = TRUE, remove = FALSE) -> data}
but im getting an unexpected symbol error. I think I have a misunderstanding of how forloops work in R and any explanation on why my code doesnt run would be greatly appreciated
Here is my working stata code if it helps:
forvalues i=1(1)10{
gen b`i'=B1B`i' if B1B`i' != .
replace b`i'=B2B`i' if B2B`i' != .
replace b`i'=B3B`i' if B3B`i' != .
replace b`i'=B4B`i' if B4B`i' != .
replace b`i'=B5B`i' if B5B`i' != .
}
Here is an idea. Split the dataframe into list of questions and map over each element of the lists.
Example data: Three Blocks with 2 Questions
df <- data.frame(B1B1 = c(1,2, rep(NA, 4)),
B1B2 = c(3,4, rep(NA, 4)),
B2B1 = c(NA,NA,5,6,NA,NA),
B2B2 = c(NA,NA,7,8,NA,NA),
B3B1 = c(rep(NA,4), 1,2),
B3B2 = c(rep(NA,4), 3,4))
B1B1 B1B2 B2B1 B2B2 B3B1 B3B2
1 1 3 NA NA NA NA
2 2 4 NA NA NA NA
3 NA NA 5 7 NA NA
4 NA NA 6 8 NA NA
5 NA NA NA NA 1 3
6 NA NA NA NA 2 4
Code:
library(tidyverse)
split.default(df, str_extract(names(df), "..$")) %>%
map_df(~ coalesce(!!! .x))
Result:
# A tibble: 6 x 2
B1 B2
<dbl> <dbl>
1 1 3
2 2 4
3 5 7
4 6 8
5 1 3
6 2 4

How to extract values of existing variable and paste them in top rows of dataframe (using R)

Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.

Resources