I am trying to switch from stata to R and need help with a forloop
Context:
I have data(survey questionnaire) with 5 blocks and with 10 questions each. B1B2 <- 2nd question of first block. My rows are people (who can only be in 1 block each) so I have values for that block and NAs for the other variables in the other block. (eg. a person in 3rd block will have observations for B3B1-10 and NA for B1B1-10, B2B1-10 etc.) I am trying to combine all the blocks to B1-10. Heres a header of my data:
B1B1 B1B2 B1B3 B1B4 B1B5 B1B6 B1B7 B1B8 B1B9 B1B10 B2B1 B2B2 B2B3 B2B4 B2B5 B2B6 B2B7 B2B8 B2B9 B2B10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA 1 2 2 2 2 1 2 1 1 2
2 NA NA NA NA NA NA NA NA NA NA 1 1 1 2 2 2 2 1 1 1
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA 2 2 2 2 1 2 2 1 1 1
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I got it working for 1 instance using the unite function:
data %>% unite("B1", B1B1,B2B1,B3B1,B4B1,B5B1, na.rm = TRUE, remove = FALSE) -> data
I want to loop this from B1 to B10 as such
for (i in (1:10)){ data %>% unite("paste0("B",i)", paste0("B1B",i),paste0("B2B",i),paste0("B3B",i),paste0("B4B",i),paste0("B5B",i), na.rm = TRUE, remove = FALSE) -> data}
but im getting an unexpected symbol error. I think I have a misunderstanding of how forloops work in R and any explanation on why my code doesnt run would be greatly appreciated
Here is my working stata code if it helps:
forvalues i=1(1)10{
gen b`i'=B1B`i' if B1B`i' != .
replace b`i'=B2B`i' if B2B`i' != .
replace b`i'=B3B`i' if B3B`i' != .
replace b`i'=B4B`i' if B4B`i' != .
replace b`i'=B5B`i' if B5B`i' != .
}
Here is an idea. Split the dataframe into list of questions and map over each element of the lists.
Example data: Three Blocks with 2 Questions
df <- data.frame(B1B1 = c(1,2, rep(NA, 4)),
B1B2 = c(3,4, rep(NA, 4)),
B2B1 = c(NA,NA,5,6,NA,NA),
B2B2 = c(NA,NA,7,8,NA,NA),
B3B1 = c(rep(NA,4), 1,2),
B3B2 = c(rep(NA,4), 3,4))
B1B1 B1B2 B2B1 B2B2 B3B1 B3B2
1 1 3 NA NA NA NA
2 2 4 NA NA NA NA
3 NA NA 5 7 NA NA
4 NA NA 6 8 NA NA
5 NA NA NA NA 1 3
6 NA NA NA NA 2 4
Code:
library(tidyverse)
split.default(df, str_extract(names(df), "..$")) %>%
map_df(~ coalesce(!!! .x))
Result:
# A tibble: 6 x 2
B1 B2
<dbl> <dbl>
1 1 3
2 2 4
3 5 7
4 6 8
5 1 3
6 2 4
Related
Sample data:
df<-data.frame(Country = c("FR", "FR", "US", "US", "US", "US", "AU", "UK", "UK", "UK"),
Name = c("Jean","Jean","Rose","Rose","Rose","Rose","Liam","Mark","Mark","Mark"),
A=c(2,NA,NA,1,3,NA,1,2,NA,NA),
B=c(2,5,NA,1,NA,2,1,NA,3,NA),
C=c(2,NA,4,1,NA,NA,NA,NA,NA,NA),
D=c(NA,3,NA,NA,4,4,1,2,4,4))
Input:
Country Name A B C D
1 FR Jean 2 2 2 NA
2 FR Jean NA 5 NA 3
3 US Rose NA NA 4 NA
4 US Rose 1 1 1 NA
6 US Rose 3 NA NA 4
7 US Rose NA 2 NA 4
8 AU Liam 1 1 NA 1
9 UK Mark 2 NA NA 2
10 UK Mark NA 3 NA 4
11 UK Mark NA NA NA 4
Desired output:
Country Name A B C D A B C D A B C D A B C D
1 FR Jean 2 2 2 NA NA 5 NA 3
2 US Rose NA NA 4 NA 1 1 1 NA 3 NA NA 4 NA 2 NA 4
3 AU Liam 1 1 NA 1
4 UK Mark 2 NA NA 2 NA 3 NA 4 NA NA NA 4
As you can see from the data, the goal is:
if Country and Name contains identical data for each subsequent row, move the data in columns ABCD in those rows into new ABCD columns.
the actual table that I have do not contain just 11 rows. In subsequent rows (after row 11), the Country and Name data may be repeated 1, 2, 3,...n times. How do I make a CONDITIONAL such that, as long as the row below is identical, automatically move the data in ABCD to create new ABCD columns?
You won't be able to have two columns with the same name. I don't know if it will help you, but you could do this:
Code
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = A:D) %>%
group_by(Country,name) %>%
mutate(name = paste0(name,row_number())) %>%
pivot_wider(names_from = name,values_from = value)
Output
# A tibble: 4 x 18
# Groups: Country [4]
Country Name A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 FR Jean 2 2 2 NA NA 5 NA 3 NA NA NA NA NA NA
2 US Rose NA NA 4 NA 1 1 1 NA 3 NA NA 4 NA 2
3 AU Liam 1 1 NA 1 NA NA NA NA NA NA NA NA NA NA
4 UK Mark 2 NA NA 2 NA 3 NA 4 NA NA NA 4 NA NA
# ... with 2 more variables: C4 <dbl>, D4 <dbl>
It is possible to have columns with the same name, but it will not be recommended.
. <- split(df[-1:-2], df[1:2], TRUE)
. <- lapply(., \(x) c(t(x)))
. <- do.call(rbind, lapply(., `length<-`, max(lengths(.))))
colnames(.) <- rep(names(df)[-1:-2], length.out=ncol(.))
cbind(df[match(row.names(.), interaction(df[1:2], drop=TRUE)), 1:2], .)
# Country Name A B C D A B C D A B C D A B C D
#1 FR Jean 2 2 2 NA NA 5 NA 3 NA NA NA NA NA NA NA NA
#7 AU Liam 1 1 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA
#8 UK Mark 2 NA NA 2 NA 3 NA 4 NA NA NA 4 NA NA NA NA
#3 US Rose NA NA 4 NA 1 1 1 NA 3 NA NA 4 NA 2 NA 4
I have a df of 100+ columns and not all are filled
> head(othertopics,20)
# A tibble: 20 x 118
Q6 Q10.1 Q10.2 Q10.3 Q10.4 Q10.5 Q10.6 Q10.7 Q10.8 Q10.9 Q10.10 Q10.11 Q10.12 Q10.13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 294 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 103 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 87 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
7 136 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
9 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
10 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
11 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
12 19 NA NA NA 4 NA NA NA NA NA NA NA NA NA
13 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
14 108 NA NA NA NA NA NA NA NA NA NA NA NA NA
Q6 is an ID.
across Q10.1 to Q10.117 there are different values assigned for each ID (see line 12).
Using unlist i used unlist and managed to get the frequency for every time a value was mentioned among the 117 columns. But i need to match them to their respective ID.
So basically i need to match an ID col with 117 columns and get the frequency of each column.
othertopics<-data.frame(table(unlist(TableTopic2[,22:138])))
Var1 Freq
10 1
100 4
101 1
102 12
103 7
104 21
105 36
106 1
so for example variable 105 appeared 36 times across 17 values of IDs on column Q6( This number I counted on Excel).
So, so far I only have the first half of my solution as i need to know what is the ID associated with the variables . ( ie: the 17 values i counted)
also note that the variable columns contain the number of their variable, So for example row Q10.105 is for variable 105 which has a frequency of 36.
I hope i was able to make it clear.
Thanks!
This question is not particularly clear, but I'll do my best. I think the way to tidy this data is to pivot all of the non-id columns to one column (I call it 'col_name') and then have another column with all of the values (mostly NA's; I call it 'numbered_var' for numbered variable). Then, you can aggregate based on the numbered_variable column.
This example is obviously not reproducible, so I constructed a simplified version of your data (I think):
library(dplyr)
library(tidyr)
df <- tibble(
id = 1:5,
Q1 = c(NA_integer_, 10L, NA_integer_, 10L, NA_integer_),
Q2 = c(NA_integer_, NA_integer_, 11L, NA_integer_, 11)
)
It looks like this:
# A tibble: 5 × 3
id Q1 Q2
<int> <int> <dbl>
1 1 NA NA
2 2 10 NA
3 3 NA 11
4 4 10 NA
5 5 NA 11
Next, I use tidyr::pivot_longer() to put the column names containing Q into a column, with their associated value in another column:
df <- pivot_longer(
df,
cols = contains("Q"), # you will want to use this, but first remove the Q from the id column name in your data
names_to = "col_name",
values_to = "numbered_var"
)
This makes the data long:
# A tibble: 10 × 3
id col_name numbered_var
<int> <chr> <dbl>
1 1 Q1 NA
2 1 Q2 NA
3 2 Q1 10
4 2 Q2 NA
5 3 Q1 NA
6 3 Q2 11
7 4 Q1 10
8 4 Q2 NA
9 5 Q1 NA
10 5 Q2 11
You should still probably have three columns, but the id's would repeat themselves n-column times, just as they repeat twice for the two columns here.
Next, I would group by the variables, which seem to be of interest, and list the unique id's that have the variables in a new column:
df <- group_by(df, numbered_var)
df <- summarize(
df,
var_appearances = n(),
ids = list(unique(id))
)
Now, the data frame looks like this:
# A tibble: 3 × 3
numbered_var var_appearances ids
<dbl> <int> <list>
1 10 2 <int [2]>
2 11 2 <int [2]>
3 NA 6 <int [5]>
That ids column is a list-column with a vector of ids:
print(df$ids)
[[1]]
[1] 2 4
[[2]]
[1] 3 5
[[3]]
[1] 1 2 3 4 5
I'm not sure this is exactly what you'll be able to do, but hopefully it sets you in the right direction.
I have a data frame after making several tables I would like to create a data frame that combines all tables into one data frame in order to export to excel. The only issue is the first variable is different in each table so bind_rows will not work.
Dummy sample data:
df1 = data.frame(Id = c(11:16), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1))
df2 = data.frame(HH_size = c(1:6 ), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1) )
let's say I made these tables
df11<- df1 %>%
dplyr::group_by(date) %>%
count(Id) %>%
tidyr::spread(date,n)
df22<- df2 %>%
dplyr::group_by(date) %>%
count(HH_size) %>%
tidyr::spread(date,n)
df11
Id `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<int> <int> <int> <int> <int> <int> <int>
1 11 1 NA NA NA NA NA
2 12 NA 1 NA NA NA NA
3 13 NA NA 1 NA NA NA
4 14 NA NA NA 1 NA NA
5 15 NA NA NA NA 1 NA
6 16 NA NA NA NA NA 1
This will not work
list <- c("df11" , "df22")
list %>% map_df(bind_rows)
Error: Argument 1 must have names
here is my desired output:
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
Id 11 1 NA NA NA NA NA
Id 12 NA 1 NA NA NA NA
Id 13 NA NA 1 NA NA NA
Id 14 NA NA NA 1 NA NA
Id 15 NA NA NA NA 1 NA
Id 16 NA NA NA NA NA 1
HH_size 1 1 NA NA NA NA NA
HH_size 2 NA 1 NA NA NA NA
HH_size 3 NA NA 1 NA NA NA
HH_size 4 NA NA NA 1 NA NA
HH_size 5 NA NA NA NA 1 NA
HH_size 6 NA NA NA NA NA 1
This will serve your purpose.
. in dplyr/magrittr means result upto previous pipe. So names(.)[1] took out the name of first column and mutated it into a new column named label
Then again you needed first column back as cat. So I mutated a column cat with .x[[1]] which is first column of every iterated value passed on. I think you may also use . instead of .x as value just prior to pipe is .x only.
unselect first column
rearrange placement of these columns as desired.
map_df(list(df11, df22), ~.x %>%
mutate(label = names(.)[1],
cat = .x[[1]]) %>%
select(-1) %>%
select(label, cat, everything()))
# A tibble: 12 x 8
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<chr> <int> <int> <int> <int> <int> <int> <int>
1 Id 11 1 NA NA NA NA NA
2 Id 12 NA 1 NA NA NA NA
3 Id 13 NA NA 1 NA NA NA
4 Id 14 NA NA NA 1 NA NA
5 Id 15 NA NA NA NA 1 NA
6 Id 16 NA NA NA NA NA 1
7 HH_size 1 1 NA NA NA NA NA
8 HH_size 2 NA 1 NA NA NA NA
9 HH_size 3 NA NA 1 NA NA NA
10 HH_size 4 NA NA NA 1 NA NA
11 HH_size 5 NA NA NA NA 1 NA
12 HH_size 6 NA NA NA NA NA 1
Put all the dataframes in a list and then you can do :
library(tidyverse)
list_df <- lst(df1, df2)
map_df(list_df, ~{
col <- names(.x)[1]
.x %>%
count(.data[[col]], date) %>%
pivot_wider(names_from = date, values_from = n) %>%
mutate(label = col) %>%
rename_with(~'cat', 1)
})
# cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06` label
# <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11 1 NA NA NA NA NA Id
# 2 12 NA 1 NA NA NA NA Id
# 3 13 NA NA 1 NA NA NA Id
# 4 14 NA NA NA 1 NA NA Id
# 5 15 NA NA NA NA 1 NA Id
# 6 16 NA NA NA NA NA 1 Id
# 7 1 1 NA NA NA NA NA HH_size
# 8 2 NA 1 NA NA NA NA HH_size
# 9 3 NA NA 1 NA NA NA HH_size
#10 4 NA NA NA 1 NA NA HH_size
#11 5 NA NA NA NA 1 NA HH_size
#12 6 NA NA NA NA NA 1 HH_size
I want to add multiple empty columns to a tibble.
The names of the new columns are stored in 'columnsToAdd'
> columnsToAdd
[1] "column1" "column2" "column3" "column4" "column5"
When I run the following code lines, ...
library(dplyr)
someTibble <- tibble(name = paste("Name", 1:10))
columnsToAdd <- paste("column", 1:30, sep = "")
someTibble %>%
tibble::add_column(columnsToAdd = NA)
... I get this result, ...
# A tibble: 10 x 2
name columnsToAdd
<chr> <lgl>
1 Name 1 NA
2 Name 2 NA
3 Name 3 NA
4 Name 4 NA
5 Name 5 NA
6 Name 6 NA
7 Name 7 NA
8 Name 8 NA
9 Name 9 NA
10 Name 10 NA
... instead, I want to get the following result:
# A tibble: 10 x 6
name column1 column2 column3 column4 column5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name 1 NA NA NA NA NA
2 Name 2 NA NA NA NA NA
3 Name 3 NA NA NA NA NA
4 Name 4 NA NA NA NA NA
5 Name 5 NA NA NA NA NA
6 Name 6 NA NA NA NA NA
7 Name 7 NA NA NA NA NA
8 Name 8 NA NA NA NA NA
9 Name 9 NA NA NA NA NA
10 Name 10 NA NA NA NA NA
Just create them as columns. This works for base data frames and data.table tables:
Eg:
> someTibble <- tibble(name = paste("Name", 1:10))
> columnsToAdd <- paste("column", 1:3,sep="")
That gives me one column:
> head(someTibble)
# A tibble: 6 x 1
name
<chr>
1 Name 1
2 Name 2
3 Name 3
4 Name 4
5 Name 5
6 Name 6
Then I do:
> someTibble[,columnsToAdd]=NA
and magically the columns appear:
> head(someTibble)
# A tibble: 6 x 4
name column1 column2 column3
<chr> <lgl> <lgl> <lgl>
1 Name 1 NA NA NA
2 Name 2 NA NA NA
3 Name 3 NA NA NA
4 Name 4 NA NA NA
5 Name 5 NA NA NA
6 Name 6 NA NA NA
Note this is not really magic, it is standard base R behaviour since from before R was born.
We can pass the column names as a list then use the !!! "triple bang"
columnsToAdd <- paste("column", 1:5, sep = "")
someTibble %>%
tibble::add_column(!!!set_names(as.list(rep(NA, length(columnsToAdd))),nm=columnsToAdd))
# A tibble: 6 x 6
name column1 column2 column3 column4 column5
<chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 Name 1 NA NA NA NA NA
2 Name 2 NA NA NA NA NA
3 Name 3 NA NA NA NA NA
4 Name 4 NA NA NA NA NA
5 Name 5 NA NA NA NA NA
6 Name 6 NA NA NA NA NA
However, I think #MrGumble may miss something as from add_column {tibble}
... Name-value pairs, passed on to tibble(). All values must have one element for each row in the data frame, or be of length 1. These arguments are passed on to tibble(), and therefore also support unquote via !! and unquote-splice via !!!.
and here is an example from tibble {tibble}
tibble(!!! list(x = rlang::quo(1:10), y = quote(x * 2)))
What you are seeing in tibble::add_column(columnsToAdd = NA) is the quasi-something evaluation that dplyr and tidyr introduced. If you check the definition:
> args(add_column)
function (.data, ..., .before = NULL, .after = NULL)
you'll see that it doesn't expect a certain variable. It literally expects the actual variable name, without quotation.
An entirely different approach is to create a matrix (or data.frame, whatever tickles your fancy), and smack it onto the side of someTibble:
extra <- matrix(NA_real_, nrow=nrow(someTibble), ncol=length(columnsToAdd), dimnames=list(NULL, columnsToAdd))
dplyr::bind_cols(someTibble, as.data.frame(extra))
Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.