Fill missing values using last or previous observation R - r

Suppose I have the following table:
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
and I have another table:
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
the result I want to get is:
ID Name Country
1 A Nor
2 B Bel
3 C Bel
4 D Bel
Basically i would like to create a third table which will take as a priority the second table but will fill the missing fields with the second table based on ID. Any help on how to do this in base R will be much appreciated.

You can get the logical vector representing the locations of the NA values using is.na(df2).
You can then set the NA elements of df2 to be the corresponding elements in df.
df <- data.frame(
ID = 1:4,
Name = LETTERS[1:4],
Country = "Nor",
stringsAsFactors = F)
df2 <- data.frame(
ID = 1:4,
Name = c("A", NA, NA, NA),
Country = c(NA, "Bel", "Bel", "Bel"),
stringsAsFactors = F)
df2[is.na(df2)] <- df[is.na(df2)]
df2
#> ID Name Country
#> 1 1 A Nor
#> 2 2 B Bel
#> 3 3 C Bel
#> 4 4 D Bel

You can try a tidyverse solution
library(tidyverse)
d1 %>%
left_join(d2, by="ID") %>%
mutate(Country=case_when(
is.na(Country.y) ~ as.character(Country.x),
is.na(Name.y) ~ as.character(Country.y)
)) %>%
select(ID, Name=Name.x, Country)
ID Name Country
1 1 A Nor
2 2 B Bel
3 3 C Bel
4 4 D Bel
The case_when part is easily and freely expandable.
Data
d1 <- read.table(text="ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor", header=T)
d2 <- read.table(text="ID Name Country
1 A NA
2 NA Bel
3 NA Bel
4 NA Bel", header=T)

Supposing the order is strictly the same and that df1 and df2 have the same size and that df1 has all the names defined (if not you need to go through a left_join). And well it is not base R but dplyr is a must have ;)
df3 <- dplyr::mutate(df1, Country = ifelse(is.na(df2$Country), Country, df2$Country))
Basically taking df1 as baseline (so as to keep the Names column, and replacing the column Country with the value for df2 unless there is NA.
(if you already have dplyr called then remove dplyr::).
with df1
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
and df2
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
ps: I voted for #Paul for the base solution ... very neat.

Related

r - 'rbind' dataframes with different prefix in column names

I have two dataframes like the following:
df1 <- data.frame(ID = c(1:4),
Year = 2001,
a_Var1 = c("A","B","C","D"),
a_Var2 = c("T","F","F","T"))
df2 <- data.frame(ID = c(1:4),
Year = 2002,
b_Var1 = c("E","F","G","H"))
The desired end product is
df_combined <- data.frame(ID = c(1,1,2,2,3,3,4,4),
Year = c(2001,2002,2001,2002,2001,2002,2001,2002),
Var1 = c("A","E","B","F","C","G","D","H"),
Var2 = c("T",NA,"F",NA,"F",NA,"T",NA))
Question is how to 'rbind' in such a way that the prefix a_ or b_ is removed and Var1, Var2, etc become the new columns.
Tried plyr's rbind.fill but that doesn't solve the problem.
Here is one option. Place the datasets in a list, rename by removing the prefix part including the _ and arrange by 'ID'
library(tidyverse)
map_df(list(df1, df2), ~ .x %>%
rename_all(~ str_remove(.x, "^[^_]+_"))) %>%
arrange(ID)
# ID Year Var1 Var2
#1 1 2001 A T
#2 1 2002 E <NA>
#3 2 2001 B F
#4 2 2002 F <NA>
#5 3 2001 C F
#6 3 2002 G <NA>
#7 4 2001 D T
#8 4 2002 H <NA>

Left join with multiple conditions in R

I'm trying to replace ids for their respective values. The problem is that each id has a different value according to the previous column type, like this:
>df
type id
1 q1 1
2 q1 2
3 q2 1
4 q2 3
5 q3 1
6 q3 2
Here's the type ids with its value:
>q1
id value
1 1 yes
2 2 no
>q2
id value
1 1 one hour
2 2 two hours
3 3 more than two hours
>q3
id value
1 1 blue
2 2 yellow
I've tried something like this:
df <- left_join(subset(df, type %in% c("q1"), q1, by = "id"))
But it removes the other values.
I' like to know how to do a one liner solution (or kind of) because there are more than 20 vectors with types description.
Any ideias on how to do it?
This is the df i'm expecting:
>df
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type and id:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.
tempList = split(df, df$type)
do.call(rbind,
lapply(names(tempList), function(nm)
merge(tempList[[nm]], get(nm))))
# id type value
#1 1 q1 yes
#2 2 q1 no
#3 1 q2 one hour
#4 3 q2 more than two hours
#5 1 q3 blue
#6 2 q3 yellow
Get the values of 'q\d+' data.frame object identifiers in a list, bind them together into a single data.frame with bind_rows while creating the 'type' column as the identifier name and right_join with the dataset object 'df'
library(tidyverse)
mget(paste0("q", 1:3)) %>%
bind_rows(.id = 'type') %>%
right_join(df)
# type id value
#1 q1 1 yes
#2 q1 2 no
#3 q2 1 one hour
#4 q2 3 more than two hours
#5 q3 1 blue
#6 q3 2 yellow
You can do it by a series of left joins:
df1 = left_join(df, q1, by='id') %>% filter(type=="q1")
> df1
type id value
1 q1 1 yes
2 q1 2 no
df2 = left_join(df, q2, by='id') %>% filter(type=="q2")
> df2
type id value
1 q2 1 one hour
2 q2 3 more than two hours
df3 = left_join(df, q3, by='id') %>% filter(type=="q3")
> df3
type id value
1 q3 1 blue
2 q3 2 yellow
> rbind(df1,df2,df3)
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
One liner would be:
rbind(left_join(df, q1, by='id') %>% filter(type=="q1"),
left_join(df, q2, by='id') %>% filter(type=="q2"),
left_join(df, q3, by='id') %>% filter(type=="q3"))
If you have more vectors then probably you should loop through the names of vector types and execute left_join and bind_rows one by one as:
vecQs = c(paste("q", seq(1,3,1),sep="")) #Types of variables q1, q2 ...
result = tibble()
#Execute left_join for the types and store it in result.
for(i in vecQs) {
result = bind_rows(result, left_join(df,eval(as.symbol(i)) , by='id') %>% filter(type==!!i))
}
This will give:
> result
# A tibble: 6 x 3
type id value
<chr> <int> <chr>
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow

identifying location of NA values in a data frame by ID (not row number) and column name

I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>

Reshaping df into data panel model

I have the following sets of data:
df1 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2011,2012,2012), variable_1= c(1,3,5,7))
df2 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2012,2012,2013), variable_2= c(2,4,6,8))
df3 <- data.frame( country = c("A", "C","C"), year = c(2011,2011,2013), variable_3= c(9,9,9))
I want to reshape them into a panel data model, so I can get the following result:
df4 <- data.frame( country = c("A","A","A","B","B","B","C","C","C"), year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013), variable_1 = c(1,5,NA,3,7,NA,NA,NA,NA), variable_2 = c(2,6,NA,NA,4,8,NA,NA,NA), variable_3 = c(9,NA,NA,NA,NA,NA,9,NA,9) )
I have searched for this info, but the topics I found (Reshaping panel data) didn´t help me.
Any ideas on how to do that? My real data sets have thousands of lines ("countries"), several variables, years and NA´s, so please take that into account.
Try
library(tidyr)
library(dplyr)
Reduce(full_join, list(df1, df2, df3)) %>%
complete(country, year)
Which gives:
#Source: local data frame [9 x 5]
#
# country year variable_1 variable_2 variable_3
# (chr) (dbl) (dbl) (dbl) (dbl)
#1 A 2011 1 2 9
#2 A 2012 5 6 NA
#3 A 2013 NA NA NA
#4 B 2011 3 NA NA
#5 B 2012 7 4 NA
#6 B 2013 NA 8 NA
#7 C 2011 NA NA 9
#8 C 2012 NA NA NA
#9 C 2013 NA NA 9

Merging two data frames according to row values

I have two data frames, each with the same two columns: county codes and frequencies. They aren't identical, but some of the county code values show up in both data frames. Like this:
"county_code","freq"
"01011",2
"01051",1
"01073",9
"01077",1
"county_code","freq"
"01011",4
"01056",2
"01073",1
"01088",6
I want to merge them into a new data frame, such that if a county code appears in both data frames, their respective frequencies are added together. If the county code just appears in one or the other of the data frames, I want to add it (and its frequency) to the new data frame unchanged. The result should look like this:
"county_code","freq"
"01011",6
"01051",1
"01056",2
"01073",10
"01077",1
"01088",6
The result doesn't have to be ordered. I tried to use reshape for this, but I wasn't sure that was the right approach. Thoughts?
Combine the two data frames with rbind, then use aggregate to collapse multiple rows with the same county_code:
aggregate(freq~county_code, rbind(d1, d2) , FUN=sum)
## county_code freq
## 1 1011 6
## 2 1051 1
## 3 1073 10
## 4 1077 1
## 5 1056 2
## 6 1088 6
(Using the definitions in MrFlick's answer.)
Using base functions, you can do a merge() then transform(). here are your sample input data.frames
d1 <- data.frame(
county_code = c("1011", "1051", "1073", "1077"),
freq = c(2L, 1L, 9L, 1L)
)
d2 <- data.frame(
county_code = c("1011", "1056", "1073", "1088"),
freq = c(4L, 2L, 1L, 6L)
)
then you would just do
transform(merge(d1, d2, by="county_code", all=T),
freq = rowSums(cbind(freq.x, freq.y), na.rm=T),
freq.x = NULL, freq.y = NULL
)
to get
county_code freq
1 1011 6
2 1051 1
3 1056 2
4 1073 10
5 1077 1
6 1088 6
Here is one way. I used rbind(),merge() and dplyr.
# sample data
country <- c("01011", "01051", "01073", "01077")
value <- c(2,1,9,1)
foo <- data.frame(country, value, stringsAsFactors=F)
country <- c("01011","01056","01073","01088")
value <- c(4,2,1,6)
foo2 <- data.frame(country, value, stringsAsFactors=F)
library(dplyr)
group_by(rbind_list(foo, foo2), country) %>%
summarize(count = sum(value))
ana
country count
1 01011 6
2 01051 1
3 01056 2
4 01073 10
5 01077 1
6 01088 6
The other idea I had was the following.
ana2 <- merge(foo, foo2, all = TRUE, by = "country")
country value.x value.y
1 01011 2 4
2 01051 1 NA
3 01056 NA 2
4 01073 9 1
5 01077 1 NA
6 01088 NA 6
bob2 <- ana2 %>%
rowwise() %>%
mutate(count = sum(value.x,value.y, na.rm = TRUE))
country value.x value.y count
1 01011 2 4 6
2 01051 1 NA 1
3 01056 NA 2 2
4 01073 9 1 10
5 01077 1 NA 1
6 01088 NA 6 6

Resources