Related
I have two dataframes like the following:
df1 <- data.frame(ID = c(1:4),
Year = 2001,
a_Var1 = c("A","B","C","D"),
a_Var2 = c("T","F","F","T"))
df2 <- data.frame(ID = c(1:4),
Year = 2002,
b_Var1 = c("E","F","G","H"))
The desired end product is
df_combined <- data.frame(ID = c(1,1,2,2,3,3,4,4),
Year = c(2001,2002,2001,2002,2001,2002,2001,2002),
Var1 = c("A","E","B","F","C","G","D","H"),
Var2 = c("T",NA,"F",NA,"F",NA,"T",NA))
Question is how to 'rbind' in such a way that the prefix a_ or b_ is removed and Var1, Var2, etc become the new columns.
Tried plyr's rbind.fill but that doesn't solve the problem.
Here is one option. Place the datasets in a list, rename by removing the prefix part including the _ and arrange by 'ID'
library(tidyverse)
map_df(list(df1, df2), ~ .x %>%
rename_all(~ str_remove(.x, "^[^_]+_"))) %>%
arrange(ID)
# ID Year Var1 Var2
#1 1 2001 A T
#2 1 2002 E <NA>
#3 2 2001 B F
#4 2 2002 F <NA>
#5 3 2001 C F
#6 3 2002 G <NA>
#7 4 2001 D T
#8 4 2002 H <NA>
I'm trying to replace ids for their respective values. The problem is that each id has a different value according to the previous column type, like this:
>df
type id
1 q1 1
2 q1 2
3 q2 1
4 q2 3
5 q3 1
6 q3 2
Here's the type ids with its value:
>q1
id value
1 1 yes
2 2 no
>q2
id value
1 1 one hour
2 2 two hours
3 3 more than two hours
>q3
id value
1 1 blue
2 2 yellow
I've tried something like this:
df <- left_join(subset(df, type %in% c("q1"), q1, by = "id"))
But it removes the other values.
I' like to know how to do a one liner solution (or kind of) because there are more than 20 vectors with types description.
Any ideias on how to do it?
This is the df i'm expecting:
>df
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type and id:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.
tempList = split(df, df$type)
do.call(rbind,
lapply(names(tempList), function(nm)
merge(tempList[[nm]], get(nm))))
# id type value
#1 1 q1 yes
#2 2 q1 no
#3 1 q2 one hour
#4 3 q2 more than two hours
#5 1 q3 blue
#6 2 q3 yellow
Get the values of 'q\d+' data.frame object identifiers in a list, bind them together into a single data.frame with bind_rows while creating the 'type' column as the identifier name and right_join with the dataset object 'df'
library(tidyverse)
mget(paste0("q", 1:3)) %>%
bind_rows(.id = 'type') %>%
right_join(df)
# type id value
#1 q1 1 yes
#2 q1 2 no
#3 q2 1 one hour
#4 q2 3 more than two hours
#5 q3 1 blue
#6 q3 2 yellow
You can do it by a series of left joins:
df1 = left_join(df, q1, by='id') %>% filter(type=="q1")
> df1
type id value
1 q1 1 yes
2 q1 2 no
df2 = left_join(df, q2, by='id') %>% filter(type=="q2")
> df2
type id value
1 q2 1 one hour
2 q2 3 more than two hours
df3 = left_join(df, q3, by='id') %>% filter(type=="q3")
> df3
type id value
1 q3 1 blue
2 q3 2 yellow
> rbind(df1,df2,df3)
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
One liner would be:
rbind(left_join(df, q1, by='id') %>% filter(type=="q1"),
left_join(df, q2, by='id') %>% filter(type=="q2"),
left_join(df, q3, by='id') %>% filter(type=="q3"))
If you have more vectors then probably you should loop through the names of vector types and execute left_join and bind_rows one by one as:
vecQs = c(paste("q", seq(1,3,1),sep="")) #Types of variables q1, q2 ...
result = tibble()
#Execute left_join for the types and store it in result.
for(i in vecQs) {
result = bind_rows(result, left_join(df,eval(as.symbol(i)) , by='id') %>% filter(type==!!i))
}
This will give:
> result
# A tibble: 6 x 3
type id value
<chr> <int> <chr>
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>
I have the following sets of data:
df1 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2011,2012,2012), variable_1= c(1,3,5,7))
df2 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2012,2012,2013), variable_2= c(2,4,6,8))
df3 <- data.frame( country = c("A", "C","C"), year = c(2011,2011,2013), variable_3= c(9,9,9))
I want to reshape them into a panel data model, so I can get the following result:
df4 <- data.frame( country = c("A","A","A","B","B","B","C","C","C"), year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013), variable_1 = c(1,5,NA,3,7,NA,NA,NA,NA), variable_2 = c(2,6,NA,NA,4,8,NA,NA,NA), variable_3 = c(9,NA,NA,NA,NA,NA,9,NA,9) )
I have searched for this info, but the topics I found (Reshaping panel data) didn´t help me.
Any ideas on how to do that? My real data sets have thousands of lines ("countries"), several variables, years and NA´s, so please take that into account.
Try
library(tidyr)
library(dplyr)
Reduce(full_join, list(df1, df2, df3)) %>%
complete(country, year)
Which gives:
#Source: local data frame [9 x 5]
#
# country year variable_1 variable_2 variable_3
# (chr) (dbl) (dbl) (dbl) (dbl)
#1 A 2011 1 2 9
#2 A 2012 5 6 NA
#3 A 2013 NA NA NA
#4 B 2011 3 NA NA
#5 B 2012 7 4 NA
#6 B 2013 NA 8 NA
#7 C 2011 NA NA 9
#8 C 2012 NA NA NA
#9 C 2013 NA NA 9
I have two data frames, each with the same two columns: county codes and frequencies. They aren't identical, but some of the county code values show up in both data frames. Like this:
"county_code","freq"
"01011",2
"01051",1
"01073",9
"01077",1
"county_code","freq"
"01011",4
"01056",2
"01073",1
"01088",6
I want to merge them into a new data frame, such that if a county code appears in both data frames, their respective frequencies are added together. If the county code just appears in one or the other of the data frames, I want to add it (and its frequency) to the new data frame unchanged. The result should look like this:
"county_code","freq"
"01011",6
"01051",1
"01056",2
"01073",10
"01077",1
"01088",6
The result doesn't have to be ordered. I tried to use reshape for this, but I wasn't sure that was the right approach. Thoughts?
Combine the two data frames with rbind, then use aggregate to collapse multiple rows with the same county_code:
aggregate(freq~county_code, rbind(d1, d2) , FUN=sum)
## county_code freq
## 1 1011 6
## 2 1051 1
## 3 1073 10
## 4 1077 1
## 5 1056 2
## 6 1088 6
(Using the definitions in MrFlick's answer.)
Using base functions, you can do a merge() then transform(). here are your sample input data.frames
d1 <- data.frame(
county_code = c("1011", "1051", "1073", "1077"),
freq = c(2L, 1L, 9L, 1L)
)
d2 <- data.frame(
county_code = c("1011", "1056", "1073", "1088"),
freq = c(4L, 2L, 1L, 6L)
)
then you would just do
transform(merge(d1, d2, by="county_code", all=T),
freq = rowSums(cbind(freq.x, freq.y), na.rm=T),
freq.x = NULL, freq.y = NULL
)
to get
county_code freq
1 1011 6
2 1051 1
3 1056 2
4 1073 10
5 1077 1
6 1088 6
Here is one way. I used rbind(),merge() and dplyr.
# sample data
country <- c("01011", "01051", "01073", "01077")
value <- c(2,1,9,1)
foo <- data.frame(country, value, stringsAsFactors=F)
country <- c("01011","01056","01073","01088")
value <- c(4,2,1,6)
foo2 <- data.frame(country, value, stringsAsFactors=F)
library(dplyr)
group_by(rbind_list(foo, foo2), country) %>%
summarize(count = sum(value))
ana
country count
1 01011 6
2 01051 1
3 01056 2
4 01073 10
5 01077 1
6 01088 6
The other idea I had was the following.
ana2 <- merge(foo, foo2, all = TRUE, by = "country")
country value.x value.y
1 01011 2 4
2 01051 1 NA
3 01056 NA 2
4 01073 9 1
5 01077 1 NA
6 01088 NA 6
bob2 <- ana2 %>%
rowwise() %>%
mutate(count = sum(value.x,value.y, na.rm = TRUE))
country value.x value.y count
1 01011 2 4 6
2 01051 1 NA 1
3 01056 NA 2 2
4 01073 9 1 10
5 01077 1 NA 1
6 01088 NA 6 6