How to populate a column using multiple conditionals across 2 dataframes? - r

Im trying to populate a column with values based on two conditionals across two separate dataframes. So,
df1$day == df2$day & df1$hour == df2$hour then fill df1$X with df2$depth
I struggle because I am not asking it to populate it with a generic value (i.e. if x==y, then y2=1). I am trying to get it select values across multiple rows. A mock example:
df1 df2
day hour X day hour depth
1 10 NA 1 10 50
1 11 NA 1 11 10
2 5 NA 1 3 100
5 9 NA 5 9 50
6 20 NA 7 17 80
7 17 NA 10 4 65
Any help would be greatly appreciated.

An easier option is join from data.table
library(data.table)
setDT(df1)[df2, X := depth, on = .(day, hour)]
df1
# day hour X
#1: 1 10 50
#2: 1 11 10
#3: 2 5 NA
#4: 5 9 50
#5: 6 20 NA
#6: 7 17 80
In base R, we can use match
df1$X <- with(df1, df2$depth[match(paste(day, hour), paste(df2$day, df2$hour))])
data
df1<- data.frame(day = c(1, 1, 2, 5:7), hour = c(10:11, 5, 9, 20, 17),
X = NA_integer_)
df2 <- data.frame(day = c(1, 1, 1, 5, 7, 10), hour = c(10, 11, 3, 9,
17, 4), depth = c(50, 10, 100, 50, 80, 65))

Using dplyr, we can do a left_join and then rename the depth column as X
library(dplyr)
left_join(df1, df2, by = c("day", "hour")) %>%
select(-X) %>%
rename(X = depth)
# day hour X
#1 1 10 50
#2 1 11 10
#3 2 5 NA
#4 5 9 50
#5 6 20 NA
#6 7 17 80
If the X column is not always NA you could use coalesce.
left_join(df1, df2, by = c("day", "hour")) %>%
mutate(X = coalesce(depth, X)) %>%
select(names(df1))
Or in base R :
merge(df1, df2, all.x = TRUE)[-3]

Related

R dataframe: how to "populate" missing data in df1 using df2

I am trying to populate the missing values of df1 with df2.
Whenever there is a valid value for the same cell in both df, I need to keep the value as in df1.
If there is a column in df2 that is not present in df1, this new column (z) has to be added to df1.
This would be a simple example:
id <- c (1, 2, 3, 4, 5)
x <- c (10, NA, 20, 50, 70)
y <- c (3, 5, NA, 6, 9)
df1 <- data.frame(id, x, y)
id <- c ( 2, 3, 5)
x <- c (10, NA, NA)
z <- c (NA, 6, 7)
df2 <- data.frame(id, x, z)
I would like to obtain "df3":
id x y z
1 1 10 3 NA
2 2 10 5 NA
3 3 20 6 6
4 4 50 6 NA
5 5 70 9 7
I tried several "merge" options that didn't work.
A 'merge' option after several extract and replace steps could be
idx <- is.na(df1[df2$id,])
df1[df2$id,][idx] <- df2[idx]
out <- merge(df1, df2[, c("id", "z")], by = "id", all.x = TRUE)
Result
out
# id x y z
#1 1 10 3 NA
#2 2 10 5 NA
#3 3 20 6 6
#4 4 50 6 NA
#5 5 70 9 7

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

Count number of times in a month and year that time series data is above a threshold

I have a large dataframe in R with daily time series data of rainfall for a number of locations (each in their own column). I would like to know the number of times the rainfall is less than, or is greater than a threshold value for each location in each month and also by year.
My dataframe is large so I have provided example data here:
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
How would I find out the number of times the value in A, B, C and D is greater than 4 for each month and then also for each year.
I think I should then be able to summarise this into two new tables.
One like this (example, ignore numbers):
A B C D
2000-01 1 0 5 0
2000-02 2 16 25 0
2000-03 1 5 26 0
And one like this (example, ignore numbers):
A B C D
2000 44 221 67 0
2001 67 231 4 132
2002 99 111 66 4
2003 33 45 45 4
I think I should be using dplyr for this? But I'm not sure how to get the dates to work.
A solution using the dplyr and lubridate package. The key is to create Year and Month columns, group by those columns, and use summarise_all to summarize the data.
# Create the example data frame, set the seed for reproducibility
set.seed(199)
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
library(dplyr)
library(lubridate)
# Summarise for each month
df2 <- df %>%
mutate(Year = year(Date_ex), Month = month(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year, Month) %>%
summarise_all(funs(sum(. > 4))) %>%
ungroup()
df2
# # A tibble: 61 x 6
# Year Month A B C D
# <dbl> <dbl> <int> <int> <int> <int>
# 1 2000 1 13 8 13 0
# 2 2000 2 12 7 8 0
# 3 2000 3 7 9 9 0
# 4 2000 4 9 12 10 0
# 5 2000 5 11 12 8 0
# 6 2000 6 12 9 16 0
# 7 2000 7 10 11 10 0
# 8 2000 8 8 12 14 0
# 9 2000 9 12 12 12 0
# 10 2000 10 9 9 7 0
# # ... with 51 more rows
# Summarise for each year and month
df3 <- df %>%
mutate(Year = year(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year) %>%
summarise_all(funs(sum(. > 4)))
df3
# # A tibble: 6 x 5
# Year A B C D
# <dbl> <int> <int> <int> <int>
# 1 2000 120 119 125 0
# 2 2001 119 123 113 0
# 3 2002 135 122 105 0
# 4 2003 114 112 104 0
# 5 2004 115 125 124 0
# 6 2005 9 14 11 0
Here are a few solutions.
1) aggregate This solution uses only base R. The new Date column is the date for the first of the month or first of the year.
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "month"))), sum)
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "year"))), sum)
1a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
aggregate(df[-1] > 4, list(Date = as.yearmon(df[[1]])), sum)
aggregate(df[-1] > 4, list(Date = toyear(df[[1]])), sum)
2) rowsum This is another base R solution. The year/month or year is given by the row names.
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y-%m"))
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y"))
2a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
rowsum((df[-1] > 4) + 0, as.yearmon(df[[1]]))
rowsum((df[-1] > 4) + 0, toyear(df[[1]]))
3) aggregate.zoo Convert to a zoo object and use aggregate.zoo. Note that yearmon class internally represents a year and month as the year plus 0 for Jan, 1/12 for Feb, 2/12 for March, etc. so taking the integer part gives the year.
library(zoo)
z <- read.zoo(df)
aggregate(z > 4, as.yearmon, sum)
toyear <- function(x) as.integer(as.yearmon(x))
aggregate(z > 4, toyear, sum)
The result is a zoo time series with a yearmon index in the first case and an integer index in the second. If you want a data frame use fortify.zoo(ag) where ag is the result of aggregate.
4) dplyr toyear is from (3).
library(dplyr)
library(zoo)
df %>%
group_by(YearMonth = as.yearmon(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
df %>%
group_by(Year = toyear(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
Data.table is missing so I'm adding this. Comments are in the code. I used set.seed(1) to generate the samples.
library(data.table)
setDT(df)
# add year and month to df
df[, `:=`(month = month(Date_ex),
year = year(Date_ex))]
# monthly returns, remove date_ex
monthly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year, month), .SDcols = -("Date_ex")]
year month A B C D
1: 2000 1 10 10 11 0
2: 2000 2 10 11 8 0
3: 2000 3 11 11 11 0
4: 2000 4 10 11 8 0
5: 2000 5 7 10 8 0
6: 2000 6 9 6 7 0
.....
# yearly returns, remove Date_ex and month
yearly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year), .SDcols = -c("Date_ex", "month")]
year A B C D
1: 2000 114 118 113 0
2: 2001 127 129 120 0
3: 2002 122 108 126 0
4: 2003 123 128 125 0
5: 2004 123 132 131 0
6: 2005 14 15 15 0

Conditionally replace values in data frame from separate data frame in R [duplicate]

This question already has answers here:
Transfer values from one dataframe to another
(4 answers)
Closed 6 years ago.
I would like to conditionally replace values in a column (df1$y) of a data frame (df1) with values from another data frame (df2$y). These data frames have a shared ID column (x). df1 has more rows than df2 and has values that are not in df2.
The condition is: if df1$x == df2$x, replace df1$y with df2$y, if there is no match do nothing
Ex:
> df1 <- data.frame(x = c(1, 1, 2, 3, 4, 4, 4, 5, 6, 7),
y = c(100, 100, 50, 50, 75, 75, 75, 50, 100, 25))
> df1
x y
1 1 100
2 1 100
3 2 50
4 3 50
5 4 75
6 4 75
7 4 75
8 5 50
9 6 100
10 7 25
> df2 <- data.frame(x = c(2, 4, 6, 7), y = c(25, 100, 75, 100))
> df2
x y
1 2 25
2 4 100
3 6 75
4 7 100
The desired output is:
df1
x y
1 100
1 25
2 50
3 50
4 100
4 100
4 100
5 50
6 75
7 100
This is my first question posted here, and please excuse me if this has been answered in another question.
This question is different from Transfer values from one dataframe to another because I am asking how to conditionally replace df1$y values, but keep values where df1.x != df2.x. In the linked to question, values that do not fit the condition are NA values. See below for an example using sqldf() based on an answer to the link above.
> sqldf('SELECT df1.x , df2.y
+ FROM df1
+ LEFT JOIN df2
+ ON df2.x = df1.x')
x y
1 1 NA
2 1 NA
3 2 25
4 3 NA
5 4 100
6 4 100
7 4 100
8 5 NA
9 6 75
10 7 100
You can merge the two data frames firstly and then use the dplyr package for replacing the elements.
library(dplyr)
df1 <- merge(df1, df2, by = "x", all = T) %>% mutate(y = ifelse(is.na(y.y), y.x, y.y)) %>% select(x, y)

R: Replacing NA values by mean of hour with dplyr

I'm learning the dplyr package in R and I really like it. But now I'm dealing with NA values in my data.
I would like to replace any NA by the average of the corresponding hour, for example with this very easy example:
#create an example
day = c(1, 1, 2, 2, 3, 3)
hour = c(8, 16, 8, 16, 8, 16)
profit = c(100, 200, 50, 60, NA, NA)
shop.data = data.frame(day, hour, profit)
#calculate the average for each hour
library(dplyr)
mean.profit <- shop.data %>%
group_by(hour) %>%
summarize(mean=mean(profit, na.rm=TRUE))
> mean.profit
Source: local data frame [2 x 2]
hour mean
1 8 75
2 16 130
Can I use the dplyr transform command to replace the NA's of day 3 in the profit with 75 (for 8:00) and 130 (for 16:00)?
Try
shop.data %>%
group_by(hour) %>%
mutate(profit= ifelse(is.na(profit), mean(profit, na.rm=TRUE), profit))
# day hour profit
#1 1 8 100
#2 1 16 200
#3 2 8 50
#4 2 16 60
#5 3 8 75
#6 3 16 130
Or you could use replace
shop.data %>%
group_by(hour) %>%
mutate(profit= replace(profit, is.na(profit), mean(profit, na.rm=TRUE)))
A (less elegant) approach with base functions:
transform(shop.data,
profit = ifelse(is.na(profit),
ave(profit, hour, FUN = function(x) mean(x, na.rm = TRUE)),
profit))
# day hour profit
# 1 1 8 100
# 2 1 16 200
# 3 2 8 50
# 4 2 16 60
# 5 3 8 75
# 6 3 16 130

Resources