Replace missing data by using another data table for multiple columns - r

I have many columns in a table where there is missing data. I want to be able to pull in the information from another table if the data is missing for a particular record based on ID. I thought about possibly joining the two tables and writing a for loop where if column X is NA then pull in information from column Y, however, I have many columns and would require writing many of these conditions.
I want to create a function or a loop where I can pass in the data column names with the missing data and be able to pass in the column name from another table to get the information from.
Reproducible Example:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,NA,NA,1968,1992)
Month <- c(1,NA,8,12,NA,5)
Day <- c(3,NA,NA,NA,NA,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
ID <- c(2,3,4,5)
Year <- c(NA,1994,1967,NA)
Month <- c(4,NA,NA,10)
Day <- c(23,12,16,9)
Old_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Expected Output:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,1994,1967,1968,1992)
Month <- c(1,4,8,12,10,5)
Day <- c(3,23,12,16,9,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)

Using rbind combine two dataframe , then we using group_by with summarise_all
library(dplyr)
rbind(New_Data,Old_Data)%>%group_by(ID)%>%dplyr::summarise_all(function(x) x[!is.na(x)][1])
# A tibble: 6 x 4
ID Year Month Day
<dbl> <dbl> <dbl> <dbl>
1 1 1990 1 3
2 2 1987 4 23
3 3 1994 8 12
4 4 1967 12 16
5 5 1968 10 9
6 6 1992 5 30

An option using dplyr::left_join and dplyr::coalesce can be as:
library(dplyr)
New_Data %>% left_join(Old_Data, by="ID") %>%
mutate(Year = coalesce(Year.x, Year.y),
Month = coalesce(Month.x, Month.y),
Day = coalesce(Day.x, Day.y)) %>%
select(ID, Year, Month, Day)
# ID Year Month Day
# 1 1 1990 1 3
# 2 2 1987 4 23
# 3 3 1994 8 12
# 4 4 1967 12 16
# 5 5 1968 10 9
# 6 6 1992 5 30

Here's a solution using only base functions from another SO question
I modified it to your needs (created a function, and made an argument for the key column name):
fill_missing_data = function(df1, df2, keyColumn) {
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != keyColumn]
dfmerge<- merge(df1,df2,by="ID",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
return(dfmerge)
}
result = fill_missing_data(New_Data, Old_Data, "ID")

Related

I have three datasets of libraries for the past 3 years. I want to take those datasets to make a new dataframe

I have three datasets of ontario libraries for the past 3 years. The data sets have various information about the libraries, their address, city, card holders,etc. I created a dataset to combine all of the data sets into one new data set called data combined.
like so
data_2017<- read.csv("Downloads/2017.csv")
data_2016<- read.csv("Downloads/2016.csv")
data_2015<- read.csv("Downloads/2015.csv")
common_columns <- Reduce(intersect, list(colnames(data_2017), colnames(data_2016),colnames(data_2015)))
data_combined <- rbind(
subset(data_2017, select = common_columns),
subset(data_2016, select = common_columns),
subset(data_2015, select = common_columns)
)
write.csv(data_combined, "Downloads.csv")
What I need help with is that I need write a sequence of code which will create a single data set that can be used to output a table that lists the number of libraries in each city for the last 3 years. In excel I would use the count function to see the amount of libraries each cities has... to create a new table. I need help with the equivalent in R. I want to make a new table that will have the cities names on the row header and the columns will be the sum of the libraries for each year 2015, 2016 and 2017.
I want to make a new dataframe like this:
INSTEAD OF 1999, 2000 and 2001.. I want it to say 2015, 2016 and 2017
Here is where you can find the data set for 2015, 2016 and 2017 here is where you can find the datasets.. only use 2015, 2016 and 2017
thanks
This sounds like Calculate the mean by group for summarizing by group, then Reshape multiple value columns to wide format for pivoting from long to wide. However, this is complicated by the fact that some data have commas, rendering them as character instead of numeric, so rbinding them will be problematic. Here's a pipe that should take care of all of that.
I've downloaded those three files to my ~/Downloads/ directory, then
library(dplyr)
alldat <- lapply(grep("ontario", list.files("~/Downloads/", full.names=TRUE), value = TRUE), read.csv)
common_columns <- Reduce(intersect, sapply(alldat, names))
data_combined <- alldat %>%
lapply(function(dat) as.data.frame(
lapply(dat, function(z) if (all(grepl("^[0-9.,]*$", z))) type.convert(gsub(",", "", z), as.is = TRUE) else z)
)) %>%
lapply(subset, select = common_columns) %>%
bind_rows() %>%
tibble() %>%
count(City = A1.10.City.Town, Year = Survey.Year.From) %>%
tidyr::pivot_wider(City, names_from = Year, values_from = n)
data_combined
# # A tibble: 336 x 4
# City `2015` `2016` `2017`
# <chr> <int> <int> <int>
# 1 Addison 1 1 1
# 2 Ajax 1 1 1
# 3 Alderville 1 1 1
# 4 Algoma Mills 1 1 1
# 5 Alliston 2 2 2
# 6 Almonte 1 1 1
# 7 Amaranth 1 1 1
# 8 Angus 1 1 1
# 9 Apsley 1 1 1
# 10 Arnprior 2 2 2
# # ... with 326 more rows

Vectorised calculation

I have data in a wide format containing a Category column listing types of transport and then columns with the name of that type of transport and totals.
I want to create the Calc column where each row is summed across the columns but the value for where the category and column name is the same is excluded.
So for the row total of Car, the sum would be train + bus. The row total of Train would be Car + Bus.
If there is a type of transport in the Category column which isnt listed as a column name, then there should be a NA in the Calc column.
The dataframe is as below, with the Calc column with the results added as expected.
Category<-c("Car","Train","Bus","Bicycle")
Car<-c(9,15,25,5)
Train<-c(8,22,1,7)
Bus<-c(5,2,4,8)
Calc<-c(13, 17,26,NA)
df<-data.frame(Category,Car,Train,Bus,Calc, stringsAsFactors = FALSE)
Can anyone suggest how to add the Calc column as per above? Ideally a vectorised calculation without a loop.
Here is an alternative in base R. You can use apply row-wise through your data.frame. If the Category is one of your columns, then calculate the sum by excluding both the Category column as well as the column corresponding to the Category column. Otherwise, use NA.
df$Calc <- apply(
df,
1,
\(x) {
if (x["Category"] %in% names(x)) {
sum(as.numeric(x[setdiff(names(x), c(x["Category"], "Category"))]))
} else {
NA_integer_
}
}
)
df
Output
Category Car Train Bus Calc
1 Car 9 8 5 13
2 Train 15 22 2 17
3 Bus 25 1 4 26
4 Bicycle 5 7 8 NA
Here is a tidyverse solution:
df<-data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
library(dplyr)
library(tidyr)
df |>
pivot_longer(cols = !Category,
names_to = "cat2",
values_to = "value") |>
group_by(Category) |>
mutate(value = case_when((Category %in% cat2) ~ value,
TRUE ~ NA_real_)) |>
filter(cat2 != Category) |>
summarize(Calc = sum(value)) |>
left_join(df)
# A tibble: 4 × 5
Category Calc Car Train Bus
<chr> <dbl> <dbl> <dbl> <dbl>
1 Bicycle NA 5 7 8
2 Bus 26 25 1 4
3 Car 13 9 8 5
4 Train 17 15 22 2
Using rowSums and a matrix for indexing.
# Example data
Category <- c("Car","Train","Bus","Bicycle")
Car <- c(9,15,25,5)
Train <- c(8,22,1,7)
Bus <- c(5,2,4,8)
df <- data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
# add the "Calc" column
df$Calc <- rowSums(df[,2:4]) - df[,2:4][matrix(c(1:nrow(df), match(df$Category, colnames(df)[2:4])), ncol = 2)]
df
#> Category Car Train Bus Calc
#> 1 Car 9 8 5 13
#> 2 Train 15 22 2 17
#> 3 Bus 25 1 4 26
#> 4 Bicycle 5 7 8 NA

R join tables by variable with multiple year observations

I have multiple tables all with the same variable names that I want to join by an ID, but each table represents another year. If I use an inner.join, it will correctly only keep those IDs in each table, but it will then create new variables for observations (i.e. X becomes X.x and X.y in the same row). I could use rbind, but that would keep all the data when I only want those that appear in each table.
library(dplyr)
df1 <- data.frame(x1 = 1:3,
x2 = c(12,14,11),
year = 2020)
df2 <- data.frame(x1 = 2:4,
x2 = c(15,17,13),
year = 2021)
dfall <- inner_join(df1,df2,by="x1")
This results in:
x1 x2.x year.x x2.y year.y
2 14 2020 15 2021
3 11 2020 17 2021
But I want this:
x1 x2 year
2 14 2020
2 15 2021
3 11 2020
3 17 2021
Is there a join where I can do this?
dplyr::bind_rows and then filter would work:
bind_rows(df1, df2) %>%
filter(x1 %in% intersect(df1$x1, df2$x1))
You can pipe the output to arrange(x1) to sort the output if needed.
Output
x1 x2 year
1 2 14 2020
2 3 11 2020
3 2 15 2021
4 3 17 2021
library(tidyr) # pivot_longer
inner_join(df1,df2,by="x1") %>%
pivot_longer(-x1, names_pattern="(.*)\\.(.*)",
names_to=c(".value", "val")) %>%
select(-val)
# # A tibble: 4 x 3
# x1 x2 year
# <int> <dbl> <dbl>
# 1 2 14 2020
# 2 2 15 2021
# 3 3 11 2020
# 4 3 17 2021
Try this. It's an inner join of your two approaches so far.
dfall <- inner_join(rbind(df1, df2) , inner_join(df1, df2 , by="x1") %>% select(x1))
Here's another option. It creates a column n which is equal to the number of times that each x1 appears, and then filters only those which appear as many times as there distinct values for year. You could change n==length(unique(year)) to n>=2 if you wanted any records that appear in more than one year/table, as opposed to those which appear in every year/table. This one is nice because it is easy to scale up to a large number of input tables.
dfall <- rbind(df1, df2) %>%
add_count(x1) %>%
filter(n==length(unique(year))) %>%
select(-n)

How to find the first occurrence of a negative value for each factor

I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13

R: Consolidating duplicate observations?

I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.
I've tried:
df.new <- group_by(df,"ID")
and also:
library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]
and unfortunately neither have worked. Any help is appreciated!
Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.
idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,] # get all non duplicated values
temp <- dupes %>% group_by(key) %>% # roll up the duplicated ones.
fill_(colnames(dupes), .direction = "down") %>%
fill_(colnames(dupes), .direction = "up") %>%
slice(1)
Then it is easy to merge back the temp and the non_dupes.
EDIT
I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.
What about?
df %>%
group_by(ID) %>%
summarise_each(funs(paste0(., collapse = "/")))
Or reproducible...
iris %>%
group_by(Species) %>%
summarise_each(funs(paste0(., collapse = "/")))

Resources