This is based on the answer to a previous question.
df
year code
2009 a
2009 a
2009 b
2010 b
2010 b
2011 b
2011 c
2011 c
I want to select codes common to all years within df. Here it is "b". One solution is:
Reduce(intersect, list(unique(df$code[df$year==2009]),
unique(df$code[df$year==2010]),
unique(df$code[df$year==2011])))
In practice, df contains about 15 years, thousands of codes, millions of rows, and multiple columns. For starters, the above command becomes quite long when all the years are included. Plus it's memory-consuming and slow. Is there sparser/faster code to do this?
As another idea, you could work on a structure of occurences per year that can be handy and more efficient down the road instead of many pairwise intersections:
lvls = list(y = unique(df$year), c = levels(df$code))
library(Matrix)
tab = sparseMatrix(i = match(df$year, lvls$y),
j = match(df$code, lvls$c),
x = TRUE,
dimnames = lvls)
tab
#3 x 3 sparse Matrix of class "lgCMatrix"
# c
#y a b c
# 2009 | | .
# 2010 . | .
# 2011 . | |
And, then, :
colSums(tab) == nrow(tab)
# a b c
#FALSE TRUE FALSE
or, in this case, better:
colnames(tab)[diff(tab#p) == nrow(tab)]
#[1] "b"
"df" is:
df = structure(list(year = c(2009L, 2009L, 2009L, 2010L, 2010L, 2011L,
2011L, 2011L), code = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("year",
"code"), class = "data.frame", row.names = c(NA, -8L))
Using tidyverse functions and considerng dft1 as your input, you can try:
dft1 %>%
unique() %>%
group_by(code) %>%
filter( n_distinct(year) == length(unique(dft1$year)))
which gives:
year code
<int> <chr>
1 2009 b
2 2010 b
3 2011 b
Related
I have two lists of data frames. Each list has 6 data frames.
The dataframes has the same columns, but in list1 the dataframes has info from 2015 to 2017 and list2 has info of 2018. Like below
List1$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
List2$A
Name Value Year
AAA 543 2018
BBB 248 2018
I want to merge the dataframes from both lists. So I want in the end just one list of dataframes with all the info for all years.
Some dataframes from list1 has already info of 2018, so when I merge them with the others I want those 2018 values to be replaced.
Newlist$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
BBB 248 2018
I tried this but didn't work
data<- lapply(list1,list2, function (x,y) merge(x,y))
How can I do this?
It's always helpful to include a sample of data with dput, but here's an attempt without the data's confirmation:
library(tidyverse)
map2(list1, list2, ~bind_rows(.y, .x) %>% group_by(Name, Year) %>% slice(1))
We bind the rows (with list2 first), then grouping by Name and Year and taking the first occurrence with slice, which should take the first value for any Name/Year repeated measures from the 2nd data frame.
We could first bind everything into a long data frame and remove the entries for "2018" that first occur if there's an entry in list 2.
To do this we could list the lists and rbind them after adding an ID column that later helps to remove the duplicates of year "2018" that stem from list 1 with by/ave, but keep those which don't occur in list 2.
The trick of the latter is to us a rev(seq_along(x)).
To demonstrate I have created sample data that probably resembles your data.
# list the lists
L <- list(L1=L1, L2=L2)
# add id column to sublists
L <- lapply(seq(L), function(x)
Map(`[<-`, L[[x]], "list", value=substr(names(L)[x], 2, 2)))
# rbind lists to long data frame
d <- do.call(rbind, unlist(L, recursive=FALSE))
# remove 2018 duplicates of list L1, keep if no 2018 in list L2
do.call(rbind, by(d, d$name, function(y) {
i <- cbind(y, id=ave(y$year, y$year, FUN=function(z) rev(seq_along(z))))
i[!i$id == 2, ]
}))
Result
# name value year list id
# A.A.1 A 998 2015 1 1
# A.A.4 A 456 2016 1 1
# A.A.7 A 312 2017 1 1
# A.A.13 A 478 2018 2 1
# B.A.2 B 1592 2015 1 1
# B.A.5 B 1072 2016 1 1
# B.A.8 B 673 2017 1 1
# B.A.21 B 445 2018 2 1
# C.A.3 C 957 2015 1 1
# C.A.6 C 199 2016 1 1
# C.A.9 C 2165 2017 1 1
# C.A.31 C 342 2018 2 1
# D.B.1 D 877 2015 1 1
# D.B.4 D 876 2016 1 1
# D.B.7 D 482 2017 1 1
# D.B.13 D 1077 2018 2 1
# E.B.2 E 370 2015 1 1
# E.B.5 E 1475 2016 1 1
# E.B.8 E 768 2017 1 1
# E.B.11 E 385 2018 1 1 <- this stems from list 1!
# F.B.3 F 421 2015 1 1
# F.B.6 F 930 2016 1 1
# F.B.9 F 1105 2017 1 1
# F.B.31 F 1836 2018 2 1
Data
l1 <- list(A = structure(list(name = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(1371, 565, 363, 633, 404, 106, 1512, 95, 2018,
63, 1305, 2287), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)), B = structure(list(name = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("D", "E", "F"), class = "factor"),
value = c(1389, 279, 133, 636, 284, 2656, 2440, 1320, 307,
1781, 172, 1215), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)))
L2 <- list(A = structure(list(name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), value = c(1895, 430, 257), year = c(2018,
2018, 2018)), class = "data.frame", row.names = c(NA, -3L)),
B = structure(list(name = structure(c(1L, 3L), .Label = c("D",
"E", "F"), class = "factor"), value = c(1763, 640), year = c(2018,
2018)), row.names = c(1L, 3L), class = "data.frame"))
L2$B <- L2$B[-2, ] # remove intentionally value
How would I go about summing the values of one column for all rows containing a name that is part of a target group, as well as a specific year which is in another column?
Example: I would like to sum the values of a, b, c for 2015 to make a new Category, "e", and to do the same for 2016.
Year Category Value
2015 a 2
2015 b 3
2015 c 2
2015 d 1
2016 a 7
2016 b 2
2016 c 1
2016 d 1
To give something like this:
Year Category Value
2015 d 1
2015 e 7
2016 d 1
2016 e 10
Thanks!
Try aggregate, defining first a group of categories.
target <- c("a", "b", "c")
group <- factor(dat$Category %in% target,
levels = c(TRUE, FALSE),
labels = c("e", "d"))
agg <- aggregate(Value ~ group + Year, dat, sum)[c(2, 1, 3)]
agg
# Year group Value
#1 2015 e 7
#2 2015 d 1
#3 2016 e 10
#4 2016 d 1
Edit.
If you have many categories and want to collapse some of them while leaving the others as they are, CRAN package forcats function fct_collapse is a good way to do it.
group <- forcats::fct_collapse(dat$Category,
"e" = target)
group
#[1] e e e d e e e d
#Levels: e d
Then aggregate as above.
Data.
dat <-
structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2016L), Category = structure(c(1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), Value = c(2L,
3L, 2L, 1L, 7L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-8L))
Here is a more compact dplyr option
dat %>%
mutate(Category = ifelse(Category %in% c("a", "b", "c"), "e", # put in c() the Categories you want to sum
as.character(Category))) %>%
group_by(Year, Category) %>%
summarise(Value = sum(Value))
# A tibble: 4 x 3
# Groups: Year [?]
Year Category Value
<int> <chr> <int>
1 2015 d 1
2 2015 e 7
3 2016 d 1
4 2016 e 10
I have a DF like following across time period from 1996 to 2016 with different firms:
year firms
----------
1996 a
1996 b
1996 c
.......
2016 c
My question is how can I select the firms that across the whole time period from 1996 to 2016? In other words, I would like to setup a balanced panel from an unbalanced panel?
The only way I can do so far is like:
Reduce(intersect, list(a,b,c))
if I extract the firms into multiple vectors according to the years. But it's obviously too fussy.
The following code first find the name of firms with data entries in all years then subset the data
library(data.table)
#generate sample data
set.seed(1)
dt <- data.table(year = sample(1996:2016, 500, TRUE),
firms = sample(letters[1:10], 500, TRUE))
dt <- dt[!duplicated(dt)][order(year, firms)]
print(dt)
# find the common element
common_element <- dt[, length(unique(year)) == length(1996:2016), by = firms][V1 == TRUE, firms]
print(common_element)
## [1] "a" "j"
# subset the data
dt_subset <- dt[firms %in% common_element]
You can use table and match the elements with the same length with the length of unique years, i.e.
table(df$firm)
#a b c
#5 3 3
table(df$firm) == length(unique(df$year))
# a b c
# TRUE FALSE FALSE
t1 <- table(df$firm) == length(unique(df$year))
names(t1)[t1]
#[1] "a"
df[df$firm %in% names(t1)[t1],]
# year firm
#1 1996 a
#4 1997 a
#7 1998 a
#10 1999 a
#13 2000 a
DATA
dput(df)
structure(list(year = c(1996L, 1996L, 1996L, 1997L, 1997L, 1998L,
1998L, 1999L, 2000L, 2000L, 2000L), firm = c("a", "b", "c", "a",
"b", "a", "c", "a", "a", "b", "c")), .Names = c("year", "firm"
), row.names = c(1L, 2L, 3L, 4L, 5L, 7L, 8L, 10L, 13L, 14L, 15L
), class = "data.frame")
I have data like this:
Group Year Month Mean_Price
A 2013 6 200
A 2013 6 200
A 2014 2 100
A 2014 2 100
B 2014 1 130
I want to add another column which gets the last entry from the group above, like this:
Group Year Month Mean_Price Last_Mean_price
A 2013 6 200 x
A 2013 6 200 x
A 2014 2 100 200
A 2014 2 100 200 ---This is where I am facing problem as doing dplyr + lag is just getting the last row entry and not the entry of th *last group's* last row.
B 2014 1 130 x
B 2014 4 140 130
All help will be appreciated. Thanks!
I had asked a related question here: Get the (t-1) data within groups
But then I wasn't grouping by years and months
This may be one way to go. I am not sure how you want to group your data. Here, I chose to group your data with GROUP, Year, and Month. First, I want to create a vector with all last elements from each group, which is foo.
group_by(mydf, Group, Year, Month) %>%
summarize(whatever = last(Mean_Price)) %>%
ungroup %>%
select(whatever) %>%
unlist -> foo
# whatever1 whatever2 whatever3 whatever4
# 200 100 130 140
Second, I arranged foo for our later process. Basically, I added x in the first position and deleted the last element in foo.
### Arrange a vector
foo <- c("x", foo[-length(foo)])
Third, I added row numbers for each group in mydf using mutate(). Then, I relaxed all numbers but 1 with x.
group_by(mydf, Group, Year, Month) %>%
mutate(ind = row_number(),
ind = replace(ind, which(row_number(ind) != 1), "x")) -> temp
Finally, I identified rows which have 1 in ind and assigned the vector, foo to the rows.
temp$ind[temp$ind == 1] <- foo
temp
# Group Year Month Mean_Price ind
# (fctr) (int) (int) (int) (chr)
#1 A 2013 6 200 x
#2 A 2013 6 200 x
#3 A 2014 2 100 200
#4 A 2014 2 100 x
#5 B 2014 1 130 100
#6 B 2014 4 140 130
DATA
mydf <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Year = c(2013L, 2013L, 2014L, 2014L,
2014L, 2014L), Month = c(6L, 6L, 2L, 2L, 1L, 4L), Mean_Price = c(200L,
200L, 100L, 100L, 130L, 140L)), .Names = c("Group", "Year", "Month",
"Mean_Price"), class = "data.frame", row.names = c(NA, -6L))
I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))