This question already has answers here:
Replace all 0 values to NA
(11 answers)
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Closed 17 days ago.
So I have a dataframe structured like this:
df <- data.frame("id" = c(rep("a",4),rep("b",4)),
"Year" = c(2020,2019,2018,2017,
2020,2019,2018,2017),
"value" = c(30,20,0,0,
70,50,30,0))
> df
id Year value
1 a 2020 30
2 a 2019 20
3 a 2018 0
4 a 2017 0
5 b 2020 70
6 b 2019 50
7 b 2018 30
8 b 2017 0
What I want to do is create a new column which has the same values as the value column, except wherever there is a 0 value it looks at the closest year with a non-zero value and applies that value to all 0 rows by each id. So the output should be:
> df
id Year value newoutput
1 a 2020 30 30
2 a 2019 20 20
3 a 2018 0 20
4 a 2017 0 20
5 b 2020 70 70
6 b 2019 50 50
7 b 2018 30 30
8 b 2017 0 30
So for id a we see that years 2018, 2017 both have 0 values so need to be amended. The next year which has a non zero value is 2019, so we take the value in that year which is 20 and apply it to both 2018, 2017. Similar for id b.
Any ideas on how to do this using dplyr?
A possible solution, based on dplyr and cummax:
library(dplyr)
df %>%
group_by(id) %>%
mutate(newoutput = value + cummax((value == 0) * lag(value, default = T))) %>%
ungroup
#> # A tibble: 8 × 4
#> id Year value newoutput
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 2020 30 30
#> 2 a 2019 20 20
#> 3 a 2018 0 20
#> 4 a 2017 0 20
#> 5 b 2020 70 70
#> 6 b 2019 50 50
#> 7 b 2018 30 30
#> 8 b 2017 0 30
Related
gfg_data <- data.frame(
year = c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021, 2022, 2022, 2022),
Timings = c(5, 6, 4, 2, 3, 4, 11, 13, 15, 14, 17, 12)
)
This is a much more simplified dataset compared to what I'm using. Essentially, I'd like to find out the years that are most similar in terms of timings. So, I'd like to be able to see that 2019 and 2020 are similar and 2021/2022 are similar. My original dataset has 500 variables, so it won't be as simple as looking through the data and noting down what is similar.
Given distance 5 (exclusive) as the threshold for clustering the values, you can try igraph like below
library(igraph)
df %>%
mutate(group = graph_from_adjacency_matrix(as.matrix(dist(Timings)) < 5, "undirected") %>%
components() %>%
membership())
which gives
year Timings group
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
If you already have the number of clusters in you mind, say, 2, you can use kmeans like below
> transform(df, group = as.integer(factor(kmeans(Timings, 2)$cluster)))
year Timings group
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
1) max absolute diff Assuming that each year has the same number of rows in consistent order we can calculate the max absolute difference between the timings in each pair of years and then sort the results. m is a matrix with one column per year and out is a vector of the maximum absolute differences for each pair. outDF represents out as a data frame. From outDF and the bar plot we see that 2019/2020 and 2021/2022 are closer to each other than the other pairs.
maxabs <- function(ix, i = ix[1], j = ix[2]) max(abs(m[, i] - m[, j]))
m <- do.call("cbind", with(gfg_data, split(Timings, year)))
out <- combn(colnames(m), 2, maxabs)
names(out) <- combn(colnames(m), 2, paste, collapse = ":")
outDF <- stack(sort(out))[2:1]
outDF
## ind values
## 1 2019:2020 3
## 2 2021:2022 4
## 3 2019:2021 11
## 4 2019:2022 11
## 5 2020:2021 11
## 6 2020:2022 14
with(outDF, barplot(values, names = ind))
2) multcomp Another possibility is to use multcomp to perform multiple comparison significance tests on the group means. The data frame with years and letters shows that 2019 and 2020 are not significantly different and similarly for 2021 and 2022. The plot at the end shows boxplots for each year and with the significance grouping lettrers above each one at the top.
library(multcomp)
mdl <- lm(Timings ~ year, transform(gfg_data, year = factor(year)))
comp <- glht(mdl, mcp(year = "Tukey"))
CLD <- cld(comp)
stack(CLD$mcletters$monospacedLetters)[2:1]
## ind values
## 1 2019 a
## 2 2020 a
## 3 2021 b
## 4 2022 b
plot(CLD)
3) emmeans Using emmeans we can create a plot of confidence intervals for each pair mean differences such that the years in those pairs who interval crosses zero are not significantly different. mdl is from above.
library(emmeans)
p <- pairs(emmeans(mdl, "year"), adjust = "tukey")
plot(p)
An approach is using hierarchical clustering, e.g. with hclust using k=2
cbind(gfg_data, grp = cutree(hclust(dist(gfg_data$Timings)), k=2))
year Timings grp
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
or with height h=7 that can be found by inspecting the actual clustering (red line is at 7).
plot(hclust(dist(gfg_data$Timings)))
lines(c(1,10), c(7,7), col="red", lw=3)
cbind(gfg_data, grp = cutree(hclust(dist(gfg_data$Timings)), h=7))
year Timings grp
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
I have a dataset where I observe individuals for different years (e.g., individual 1 is observed in 2012 and 2014, while individuals 2 and 3 are only observed in 2016). I would like to expand the data for each individual (i.e., each individual would have 3 rows: 2012, 2014 and 2016) in order to create a panel data with an indicator for whether an individual is observed or not.
My initial dataset is:
year
individual_id
rank
2012
1
11
2014
1
16
2016
2
76
2016
3
125
And I would like to get something like that:
year
individual_id
rank
present
2012
1
11
1
2014
1
16
1
2016
1
.
0
2012
2
.
0
2014
2
.
0
2016
2
76
1
2012
3
.
0
2014
3
.
0
2016
3
125
1
So far I have tried to play with "expand":
bys researcher: egen count=count(year)
replace count=3-count+1
bys researcher: replace count=. if _n>1
expand count
which gives me 3 rows per individual. Unfortunately this copies one of the initial row, but I am unable to go from there to the final desired dataset.
Thanks in advance for your help!
You can use expand.grid to create a data frame of all combinations your inputs. Then full join the tables together and add a condition to determine if the individual was present that year or not.
library(dplyr)
dt = data.frame(
year = c(2012,2014,2016,2016),
individual_id = c(1,1,2,3),
rank = c(11,16,76,125)
)
exp = expand.grid(year = c(2012,2014,2016), individual_id = c(1:3))
dt %>%
full_join(exp, by = c("year","individual_id")) %>%
mutate(present = ifelse(!is.na(rank), 1, 0)) %>%
arrange(individual_id, year)
year individual_id rank present
1 2012 1 11 1
2 2014 1 16 1
3 2016 1 NA 0
4 2012 2 NA 0
5 2014 2 NA 0
6 2016 2 76 1
7 2012 3 NA 0
8 2014 3 NA 0
9 2016 3 125 1
Wasn't sure the best way to word this, but I'd like to multiple / divide two columns by each other, lagged by one row (in my dataset this means varx/vary - 1 row).
The end result should be an additional column, with one NA value (for the first year which isn't present)
I'm having trouble indexing it, but I think it would go something along these lines...
e.g.
df <- data_frame(year = c(2010:2020), var_x = c(20:30), var_y = c(2:12))
#not correct
diff <- df[,2, 2:ncol(df)-1] * df[,3, 1:ncol(df)]
dplyr would look something like...
df %>%
mutate(forecast = (var_x * ncol(var_y)-1))
incorrect result:
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 40
2 2011 21 3 63
3 2012 22 4 88
4 2013 23 5 115
5 2014 24 6 144
6 2015 25 7 175
7 2016 26 8 208
8 2017 27 9 243
9 2018 28 10 280
10 2019 29 11 319
11 2020 30 12 360
Error in mutate_impl(.data, dots) :
Column `forecast` must be length 11 (the number of rows) or one, not 0
Thanks, your guidance is appreciated.
From recommended comment above:
df %>%
mutate(forecast = var_y * lag(var_x))
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 NA
2 2011 21 3 60
3 2012 22 4 84
4 2013 23 5 110
5 2014 24 6 138
6 2015 25 7 168
7 2016 26 8 200
8 2017 27 9 234
9 2018 28 10 270
10 2019 29 11 308
11 2020 30 12 348
I have coordinates for each site and the year each site was sampled (fake dataframe below).
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
> dfA
LAT LONG YEAR
1 1 11 2001
2 2 12 2002
3 3 13 2003
4 4 14 2004
5 5 15 2005
6 1 11 2006
7 2 12 2007
8 3 13 2008
9 4 14 2009
10 5 15 2010
11 1 11 2011
12 2 12 2012
13 3 13 2013
14 4 14 2014
15 5 15 2015
16 1 16 2016
17 2 17 2017
18 3 18 2018
19 4 19 2019
20 5 20 2020
I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code
dfB <- dfA %>%
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB)
LAT LONG Freq
1 1 11 3
2 1 16 1
3 2 12 3
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
I am now trying to get the year for each unique location. I.e. I ultimately want this:
LAT LONG Freq . Year
1 1 11 3 . 2001,2006,2011
2 1 16 1 . 2016
3 2 12 3 . 2002,2007,2012
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
This is what I've tried:
1) Find which rows in dfA that corresponds with dfB:
dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))
> idx
[1] 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 2 4 6 8 10
So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].
I've tried this to extract info:
for (row in 1:20){
year<-as.character(dfA$YEAR[row])
tmp<-dfB$obs_Year[idx[row]]
if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-year
}
if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
}
}
I keep getting this error code:
number of items to replace is not a multiple of replacement length
Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!
You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.
The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.
The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.
Code below:
library(dplyr)
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
mutate( # The mutate verb is for building new variables.
year_string = paste(YEAR, collapse = ","), # the function paste()
# collapses the vector YEAR into a string
# the argument collapse = "," says to
# separate each element of the string with a comma
Freq = n()) %>% # I compute the frequency as you did
select(LAT, LONG, Freq, year_string) %>%
# Now I select only the columns that index
# location, frequency and the combined years
unique() # Now I filter for only unique observations. Since I have not picked
# YEAR in the select function only unique locations are retained
dfB
#> # A tibble: 10 x 4
#> # Groups: LAT, LONG [10]
#> LAT LONG Freq year_string
#> <int> <int> <int> <chr>
#> 1 1 11 3 2001,2006,2011
#> 2 2 12 3 2002,2007,2012
#> 3 3 13 3 2003,2008,2013
#> 4 4 14 3 2004,2009,2014
#> 5 5 15 3 2005,2010,2015
#> 6 1 16 1 2016
#> 7 2 17 1 2017
#> 8 3 18 1 2018
#> 9 4 19 1 2019
#> 10 5 20 1 2020
Created on 2019-01-21 by the reprex package (v0.2.1)
Assume I have data:
data.frame(Plot = rep(1:2,3),Index = rep(1:3, each = 2), Val = c(1:6)*10)
Plot Index Val
1 1 1 10
2 2 1 20
3 1 2 30
4 2 2 40
5 1 3 50
6 2 3 60
I want to create new columns combining/aggregating all Val that share a common Index for a given Plot. I want to do this for each Index.
Plot Val1 Val2 Val3
1 1 10 30 50
2 2 20 40 60
I would like any remaining columns (e.g., just Plot in this simplified example) to remain in my final data.frame.
My Attempt
I know I can do this step-wise using aggregate() and merge(), but is there a way to do this using a single (or minimal) call(s)?
Any approach is great, but I always like to see an elegant base R approach if one exists...
Update:
I'm looking for a solution that also works well when other columns are involved:
dat2 = data.frame(Plot = rep(1:2,each = 8),Year = rep(rep(2010:2011, each = 4),2),
Index = rep(rep(1:2,2),4), Val = rep(c(1:4)*10,4))
Plot Year Index Val
1 1 2010 1 10
2 1 2010 2 20
3 1 2010 1 30
4 1 2010 2 40
5 1 2011 1 10
6 1 2011 2 20
7 1 2011 1 30
8 1 2011 2 40
9 2 2010 1 10
10 2 2010 2 20
11 2 2010 1 30
12 2 2010 2 40
13 2 2011 1 10
14 2 2011 2 20
15 2 2011 1 30
16 2 2011 2 40
#Resulting in (if aggregating by sum, for example):
Plot Year Val1 Val2
1 1 2010 40 60
2 1 2011 40 60
3 2 2010 40 60
4 2 2011 40 60
Also, ideally, the new columns could be named based on the Index value.
So if my index were instead A:C, my new columns would be ValA, ValB, and ValC
It seems you want a base R solution: then you can do something like:
m = aggregate(Val~.,dat2,sum)
reshape(m,v.names = "Val",idvar = c("Plot","Year"),timevar = "Index",direction = "wide")
Plot Year Val.1 Val.2
1 1 2010 40 60
2 2 2010 40 60
3 1 2011 40 60
4 2 2011 40 60
But you can use other functions:
do.call(data.frame,aggregate(Val~Plot+Year,m,I))
Plot Year Val.1 Val.2
1 1 2010 40 60
2 2 2010 40 60
3 1 2011 40 60
4 2 2011 40 60
Or using the reshape2 library, you can tackle the problem as:
library(reshape2)
dcast(dat2,Plot+Year~Index,sum,value.var = "Val")
Plot Year 1 2
1 1 2010 40 60
2 1 2011 40 60
3 2 2010 40 60
4 2 2011 40 60
One can think of using gather, unite and spread functions to get the desired result as mentioned by OP.
library(tidyverse)
df <- data.frame(Plot = rep(1:2,3),Index = rep(1:3, each = 2), Val = c(1:6)*10)
df %>% gather(key, value, -Plot, -Index) %>%
unite("key", c(key,Index), sep="") %>%
spread(key, value)
# Plot Val1 Val2 Val3
# 1 1 10 30 50
# 2 2 20 40 60
Note: There are other short options (as correctly pointed out by #Onyambu) but then again per OP's desire column's names required to be changed.
spread(df, Index, Val)
# Plot 1 2 3
# 1 1 10 30 50
# 2 2 20 40 60
aggregate(Val~Plot,df,I)
# Plot Val.1 Val.2 Val.3
# 1 1 10 30 50
# 2 2 20 40 60
Updated: Based on 2nd data frame from OP.
dat2 = data.frame(Plot = rep(1:2,each = 8),Year = rep(rep(2010:2011, each = 4),2),
Index = rep(rep(1:2,2),4), Val = rep(c(1:4)*10,4))
library(tidyverse)
library(reshape2)
dat2 %>% gather(key, value, -Plot, -Index, -Year) %>%
unite("key", c(key,Index), sep="") %>%
dcast(Plot+Year~key, value.var = "value")
# Plot Year Val1 Val2
# 1 1 2010 2 2
# 2 1 2011 2 2
# 3 2 2010 2 2
# 4 2 2011 2 2