I am relatively new to R and probably the solution to this problem is rather simple.
Let's imagine that i have nest dataset of bird two species (a and b) like this:
df
year nestid sp egg chick
2013 a1 a 2 1
2013 a2 a NA 1
2013 a3 a NA 0
2013 a4 a NA 1
2013 a5 a NA 0
2013 b1 b 2 0
2013 b2 b NA 1
2013 b3 b NA 2
2013 b4 b NA 1
2014 a1 a NA 1
2014 a2 a NA 1
2014 a3 a 1 1
2014 a4 a NA 1
2014 a5 a NA 1
2014 b1 b NA 1
2014 b2 b NA 2
2014 b3 b NA 2
2014 b4 b NA 1
I want to infer number of eggs for those 'NAs' from number of chicks. It makes sense to replace "NA" by 2 if there were "2" chicks as they lay 2 eggs max.
But i want to replace NAs by "2" for randomly selected 80% of nests with 1 chick and replace by "1" for remaining 20% of the nests with 1 chick for species "a" in year 2013. But this ratio would be 40% and 60% for clutch sizes of 2 and 1 respectively for species "a" in 2014.
I tried like this but could not work out how to code properly.
df%>% mutate(egg=ifelse(egg==0 & chick==2, 2, egg))
df%>%
mutate(egg=ifelse(egg==0 & chick==1 & year==2013, sample_frac(.8)==2, egg))
Any help would be greatly appreciated!
Many thanks
One of the approach could be
set.seed(123)
#missing egg & chick = 2
df$egg <- with(df,ifelse(is.na(egg) & chick == 2, 2, egg))
#2013 data having species = 'a', missing egg & chick = 1
x <- with(df, which(is.na(egg) & chick == 1 & sp == 'a' & year == 2013))
x_sample <- sample(x, round(0.8 * length(x)))
df$egg[x_sample] <- 2
df$egg[setdiff(x, x_sample)] <- 1
#2014 data having species = 'a', missing egg & chick = 1
x <- with(df, which(is.na(egg) & chick == 1 & sp == 'a' & year == 2014))
x_sample <- sample(x, round(0.4 * length(x)))
df$egg[x_sample] <- 2
df$egg[setdiff(x, x_sample)] <- 1
which gives
> df
year nestid sp egg chick
1 2013 a1 a 2 1
2 2013 a2 a 2 1
3 2013 a3 a NA 0
4 2013 a4 a 2 1
5 2013 a5 a NA 0
6 2013 b1 b 2 0
7 2013 b2 b NA 1
8 2013 b3 b 2 2
9 2013 b4 b NA 1
10 2014 a1 a 1 1
11 2014 a2 a 2 1
12 2014 a3 a 1 1
13 2014 a4 a 2 1
14 2014 a5 a 1 1
15 2014 b1 b NA 1
16 2014 b2 b 2 2
17 2014 b3 b 2 2
18 2014 b4 b NA 1
Sample data:
df <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L), nestid = c("a1", "a2", "a3", "a4", "a5",
"b1", "b2", "b3", "b4", "a1", "a2", "a3", "a4", "a5", "b1", "b2",
"b3", "b4"), sp = c("a", "a", "a", "a", "a", "b", "b", "b", "b",
"a", "a", "a", "a", "a", "b", "b", "b", "b"), egg = c(2L, NA,
NA, NA, NA, 2L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA
), chick = c(1L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 1L)), .Names = c("year", "nestid", "sp",
"egg", "chick"), class = "data.frame", row.names = c(NA, -18L
))
Related
I have a long-format repeated measures dataset, similar to this:
ID Stimuli Score Correct
<fct> <chr> <int> <int>
1 1 A1 0.046 1
2 1 A1 0.037 1
3 1 A2 -0.261 0
4 1 A2 0.213 0
5 1 A3 0.224 0
6 1 A3 0.001 1
7 2 A1 -1.38 0
8 2 A1 -0.81 0
9 2 A2 -0.03 1
10 2 A2 0.88 0
11 2 A3 -0.00 1
12 2 A3 0.49 0
I created the Correct variable based on whether the Score for each row was within a specific range (if Score is between -.10 and +.10 = 1, otherwise 0).
What I want now is to change the values in Correct for each stimulus (A1, A2, A3) in Stimuli and per ID number. Specifically, whenever there is a 1 in ANY of the rows of Correct, all values should become 1 BUT ONLY for that corresponding stimulus and ID. In other words, in the example above, rows 1-2 of Correct would stay the same (1,1), rows 3-4 would stay the same (0,0), but rows 5-6 would become 1s for Stimuli A3 for ID 1 only. For ID 2, the only change would be for stimulus A2 (that should become 1,1).
I've tried several things but I can't think of an easy way to do this. There are similar posts about replacing values in a data frame but haven't seen one where I can do it by specific values in other variables within the same data frame.
You can try using dplyr::group_by with any(Correct == 1)
library(dplyr)
df %>%
group_by(ID, Stimuli) %>%
mutate(Correct = +any(Correct == 1))
#------
ID Stimuli Score Correct
<int> <chr> <dbl> <dbl>
1 1 A1 0.046 1
2 1 A1 0.037 1
3 1 A2 -0.261 0
4 1 A2 0.213 0
5 1 A3 0.224 1
6 1 A3 0.001 1
7 2 A1 -1.38 0
8 2 A1 -0.81 0
9 2 A2 -0.03 1
10 2 A2 0.88 1
11 2 A3 0 1
12 2 A3 0.49 1
Data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), Stimuli = c("A1", "A1", "A2", "A2", "A3", "A3", "A1",
"A1", "A2", "A2", "A3", "A3"), Score = c(0.046, 0.037, -0.261,
0.213, 0.224, 0.001, -1.38, -0.81, -0.03, 0.88, 0, 0.49), Correct = c(1L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
It should also work, simply
library(dplyr)
df %>%
group_by(ID, Stimuli) %>%
mutate(Correct = max(Correct))
I have two lists of data frames. Each list has 6 data frames.
The dataframes has the same columns, but in list1 the dataframes has info from 2015 to 2017 and list2 has info of 2018. Like below
List1$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
List2$A
Name Value Year
AAA 543 2018
BBB 248 2018
I want to merge the dataframes from both lists. So I want in the end just one list of dataframes with all the info for all years.
Some dataframes from list1 has already info of 2018, so when I merge them with the others I want those 2018 values to be replaced.
Newlist$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
BBB 248 2018
I tried this but didn't work
data<- lapply(list1,list2, function (x,y) merge(x,y))
How can I do this?
It's always helpful to include a sample of data with dput, but here's an attempt without the data's confirmation:
library(tidyverse)
map2(list1, list2, ~bind_rows(.y, .x) %>% group_by(Name, Year) %>% slice(1))
We bind the rows (with list2 first), then grouping by Name and Year and taking the first occurrence with slice, which should take the first value for any Name/Year repeated measures from the 2nd data frame.
We could first bind everything into a long data frame and remove the entries for "2018" that first occur if there's an entry in list 2.
To do this we could list the lists and rbind them after adding an ID column that later helps to remove the duplicates of year "2018" that stem from list 1 with by/ave, but keep those which don't occur in list 2.
The trick of the latter is to us a rev(seq_along(x)).
To demonstrate I have created sample data that probably resembles your data.
# list the lists
L <- list(L1=L1, L2=L2)
# add id column to sublists
L <- lapply(seq(L), function(x)
Map(`[<-`, L[[x]], "list", value=substr(names(L)[x], 2, 2)))
# rbind lists to long data frame
d <- do.call(rbind, unlist(L, recursive=FALSE))
# remove 2018 duplicates of list L1, keep if no 2018 in list L2
do.call(rbind, by(d, d$name, function(y) {
i <- cbind(y, id=ave(y$year, y$year, FUN=function(z) rev(seq_along(z))))
i[!i$id == 2, ]
}))
Result
# name value year list id
# A.A.1 A 998 2015 1 1
# A.A.4 A 456 2016 1 1
# A.A.7 A 312 2017 1 1
# A.A.13 A 478 2018 2 1
# B.A.2 B 1592 2015 1 1
# B.A.5 B 1072 2016 1 1
# B.A.8 B 673 2017 1 1
# B.A.21 B 445 2018 2 1
# C.A.3 C 957 2015 1 1
# C.A.6 C 199 2016 1 1
# C.A.9 C 2165 2017 1 1
# C.A.31 C 342 2018 2 1
# D.B.1 D 877 2015 1 1
# D.B.4 D 876 2016 1 1
# D.B.7 D 482 2017 1 1
# D.B.13 D 1077 2018 2 1
# E.B.2 E 370 2015 1 1
# E.B.5 E 1475 2016 1 1
# E.B.8 E 768 2017 1 1
# E.B.11 E 385 2018 1 1 <- this stems from list 1!
# F.B.3 F 421 2015 1 1
# F.B.6 F 930 2016 1 1
# F.B.9 F 1105 2017 1 1
# F.B.31 F 1836 2018 2 1
Data
l1 <- list(A = structure(list(name = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(1371, 565, 363, 633, 404, 106, 1512, 95, 2018,
63, 1305, 2287), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)), B = structure(list(name = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("D", "E", "F"), class = "factor"),
value = c(1389, 279, 133, 636, 284, 2656, 2440, 1320, 307,
1781, 172, 1215), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)))
L2 <- list(A = structure(list(name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), value = c(1895, 430, 257), year = c(2018,
2018, 2018)), class = "data.frame", row.names = c(NA, -3L)),
B = structure(list(name = structure(c(1L, 3L), .Label = c("D",
"E", "F"), class = "factor"), value = c(1763, 640), year = c(2018,
2018)), row.names = c(1L, 3L), class = "data.frame"))
L2$B <- L2$B[-2, ] # remove intentionally value
I have a dataset on this form:
set.seed(4561) # Make the results reproducible
df=data.frame(
colour=rep(c("green","red","blue"),each=3),
year=rep("2017",9),
month=rep(c(1,2,3),3),
price=c(200,254,188,450,434,490,100,99,97),
work=ceiling(runif(9,30,60)),
gain=ceiling(runif(9,1,10)),
work_weighed_price=NA,
gain_weighed_price=NA
)
For each colour, year, month I have a price (output variable) and two input variables called gain and work. In reality I have many more input variables, but this suffices to show what I desire to do with my dataframe.
> df
colour year month price work gain work_weighed_price gain_weighed_price
1 green 2017 1 200 33 9 NA NA
2 green 2017 2 254 56 5 NA NA
3 green 2017 3 188 42 8 NA NA
4 red 2017 1 450 39 3 NA NA
5 red 2017 2 434 45 2 NA NA
6 red 2017 3 490 36 8 NA NA
7 blue 2017 1 100 50 8 NA NA
8 blue 2017 2 99 45 8 NA NA
9 blue 2017 3 97 56 4 NA NA
I wish to calculate the weighted gain and work (and also the weighted price), where the weight is the price for that month and year, divided by the sum of price across colours:
desired_output=data.frame(
year=rep("2017",3),
month=rep(c(1,2,3),1),
price=c(200*(200/(200+450+100))+450*(450/(200+450+100))+100*(100/(200+450+100)),
254*(254/(254+434+99))+434*(434/(254+434+99))+99*(99/(254+434+99)),
188*(188/(188+490+97))+490*(490/(188+490+97))+97*(97/(188+490+97))),
work_weighed_price=c(47*(200/(200+450+100))+44*(450/(200+450+100))+52*(100/(200+450+100)),
44*(254/(254+434+99))+42*(434/(254+434+99))+32*(99/(254+434+99)),
38*(188/(188+490+97))+52*(490/(188+490+97))+52*(97/(188+490+97))) ,
gain_weighed_price=c(5*(200/(200+450+100))+8*(450/(200+450+100))+10*(100/(200+450+100)),
3*(254/(254+434+99))+7*(434/(254+434+99))+9*(99/(254+434+99)),
2*(188/(188+490+97))+4*(490/(188+490+97))+9*(97/(188+490+97)))
)
> desired_output
year month price work_weighed_price gain_weighed_price
1 2017 1 336.6667 45.86667 7.466667
2 2017 2 333.7649 41.38755 5.960610
3 2017 3 367.5523 48.60387 4.140645
How would I attack this in R?
You can use the weighted.mean function
df %>%
group_by(year, month) %>%
summarise_at(vars(price, work, gain),
funs(price_weighted = weighted.mean(., price)))
# # A tibble: 3 x 5
# # Groups: year [?]
# year month price_price_weighted work_price_weighted gain_price_weighted
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 337 45.9 7.47
# 2 2017 2 334 41.4 5.96
# 3 2017 3 368 48.6 4.14
Or, in data.table
library(data.table)
setDT(df)
df[, lapply(.SD, weighted.mean, price)
, .SDcols = c('price', 'work', 'gain')
, by = .(year, month)]
# year month price work gain
# 1: 2017 1 336.6667 45.86667 7.466667
# 2: 2017 2 333.7649 41.38755 5.960610
# 3: 2017 3 367.5523 48.60387 4.140645
An approach using dplyr. Your use of runif in your example df without setting seed and the fact that it doesn't line up with your desired output is causing some confusion. In the code below, I use a df that's consistent with your desired output.
library(dplyr)
df %>%
group_by(year, month) %>%
mutate(weight = price / sum(price)) %>%
mutate_at(vars(price, work, gain), funs(weighed_price = . * weight)) %>%
summarise_at(vars(ends_with("weighed_price")), sum)
# # A tibble: 3 x 5
# # Groups: year [?]
# year month work_weighed_price gain_weighed_price price_weighed_price
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 45.9 7.47 337.
# 2 2017 2 41.4 5.96 334.
# 3 2017 3 48.6 4.14 368.
df:
structure(list(colour = c("green", "green", "green", "red", "red",
"red", "blue", "blue", "blue"), year = c(2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L), month = c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), price = c(200L, 254L, 188L, 450L,
434L, 490L, 100L, 99L, 97L), work = c(47L, 44L, 38L, 44L, 42L,
52L, 52L, 32L, 52L), gain = c(5L, 3L, 2L, 8L, 7L, 4L, 10L, 9L,
9L), work_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA
), gain_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("colour",
"year", "month", "price", "work", "gain", "work_weighed_price",
"gain_weighed_price"), class = "data.frame", row.names = c(NA,
-9L))
A base R solution could be the following sequence of tapply instructions.
fun_price <- function(x){
s <- sum(x)
sum(x*(x/s))
}
fun_weighted <- function(x, w){
s <- sum(w)
sum(x*(w/s))
}
desired <- data.frame(year = unique(df$year), month = sort(unique(df$month)))
desired$price <- with(df, tapply(price, month, FUN = fun_price))
desired$work_weighed_price <- with(df, tapply(work, month, FUN = fun_weighted, w = price))
desired$gain_weighed_price <- with(df, tapply(gain, month, FUN = fun_weighted, w = price))
desired
# year month price work_weighed_price gain_weighed_price
#1 2017 1 336.6667 40.74092 6.622405
#2 2017 2 333.7649 48.56834 4.984429
#3 2017 3 367.5523 44.65052 6.659170
I have following data frame df
ID Year Var value
1 2011 x1 1.2
1 2011 x2 2
1 2012 x1 1.5
1 2012 x2 2.3
3 2013 x1 3
3 2014 x1 4
4 2015 x1 5
5 2016 x1 6
4 2016 x1 2
I want to transform the data in following format
ID Year x1 x2
1 2011 1.2 2
1 2011 2 NA
1 2012 1.5 2.3
3 2013 3 NA
3 2014 4 4
4 2015 5 NA
4 2016 2 NA
5 2016 6 NA
Please help
Using the tidyr library, I believe this is what you are looking for:
library(tidyr)
df <- data.frame(stringsAsFactors=FALSE,
ID = c(1L, 1L, 1L, 1L, 3L, 3L, 4L, 5L, 4L),
Year = c(2011L, 2011L, 2012L, 2012L, 2013L, 2014L, 2015L, 2016L, 2016L),
Var = c("x1", "x2", "x1", "x2", "x1", "x1", "x1", "x1", "x1"),
value = c(1.2, 2, 1.5, 2.3, 3, 4, 5, 6, 2)
)
df2 <- df %>%
spread(Var, value)
I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))