Year sum dataframe per label - r

Having a data structure as this one:
dtest <- data.frame(label=c("yahoo","google","yahoo","yahoo","google","google","yahoo","yahoo"), year=c(2000,2001,2000,2001,2003,2003,2003,2003))
How is it possible to extract a new dataframe like this one:
doutput <- data.frame(label=c("yahoo","yahoo","yahoo","yahoo","google","google","google","google"), year=c(2000,2001,2002,2003,2000,2001,2002,2003), volume=c(2,1,0,3,0,1,0,2))
> doutput
label year volume
1 yahoo 2000 2
2 yahoo 2001 1
3 yahoo 2002 0
4 yahoo 2003 3
5 google 2000 0
6 google 2001 1
7 google 2002 0
8 google 2003 2

One way is with dplyr:
library(dplyr)
dtest %>%
group_by(label, year) %>%
tally(name = "volume")
# A tibble: 5 x 3
# Groups: label [2]
label year volume
<fct> <dbl> <int>
1 google 2001 1
2 google 2003 2
3 yahoo 2000 2
4 yahoo 2001 1
5 yahoo 2003 2

Here is a solution with base R:
as.data.frame(table(transform(dtest,
year = factor(year, levels = seq(min(year), max(year))))))
The result:
label year Freq
1 google 2000 0
2 yahoo 2000 2
3 google 2001 1
4 yahoo 2001 1
5 google 2002 0
6 yahoo 2002 0
7 google 2003 2
8 yahoo 2003 2

Related

Paste values in a column based on other observations in the dataframe in R

I have a very large (~30M observations) dataframe in R and I am having trouble with a new column I want to create.
The data is formatted like this:
Country Year Value
1 A 2000 1
2 A 2001 NA
3 A 2002 2
4 B 2000 4
5 B 2001 NA
6 B 2002 NA
7 B 2003 3
My problem is that I would like to impute the NAs in the value column based on other values in that column. Specifically, if there is a non-NA value for the same country I would like that to replace the NA in later years, until there is another non-NA value.
The data above would therefore be transformed into this:
Country Year Value
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3
To solve this, I first tried using a loop with a lookup function and also some if_else statements, but wasn't able to get it to behave as I expected. In general, I am struggling to find an efficient solution that will be able to perform the task in the order of minutes-hours and not days.
Is there an easy way to do this?
Thanks!
Using tidyr's fill:
library(tidyverse)
df %>%
group_by(Country) %>%
fill(Value)
Result:
# A tibble: 7 × 3
# Groups: Country [2]
Country Year Value
<chr> <dbl> <dbl>
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3

Cumulative sum for 2 criteria in R

I have a database where I want to calculate the cumulative sum of 2 criteria
dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
"honda","honda","honda","honda",
"lada","lada","lada","lada"),
year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))
You can see down the data:
dfdata
car year id
1 toyota 2000 a
2 toyota 2000 b
3 toyota 2001 a
4 toyota 2001 c
5 toyota 2002 a
6 honda 2001 d
7 honda 2001 d
8 honda 2002 d
9 honda 2002 e
10 lada 2003 f
11 lada 2004 f
12 lada 2005 f
13 lada 2006 f
Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.
I want the sum of cars companies I've seen in one year
I want the cumulative sum of cars companies I've seen across the years
I want the cumulative sum of the cars companies I've seen more than once (counting the ones I've seen in the same year and the other years AND another column counting the ones that I've seen ONLY in the other years)
Here is how I got point 1. and point 2.
dfdata %>%
group_by(car, year) %>%
dplyr::summarise(nb = n()) %>%
dplyr::mutate(cs = cumsum(nb)) %>%
ungroup()
nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.
# A tibble: 9 x 4
car year nb cs
<fct> <dbl> <int> <int>
1 honda 2001 2 2
2 honda 2002 2 4
3 lada 2003 1 1
4 lada 2004 1 2
5 lada 2005 1 3
6 lada 2006 1 4
7 toyota 2000 2 2
8 toyota 2001 2 4
9 toyota 2002 1 5
But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.
Final output should be based on grouping ID (to answer point 3):
car year nb cs curetrap curetrap.no.same.year
1 honda 2001 2 2 1 0
2 honda 2002 2 4 2 1
3 lada 2003 1 1 0 0
4 lada 2004 1 2 1 1
5 lada 2005 1 3 2 2
6 lada 2006 1 4 3 3
7 toyota 2000 2 2 0 0
8 toyota 2001 2 4 1 1
9 toyota 2002 1 5 2 2
This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.
How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?
So far this is what I have:
tab.df = cbind(table(dfdata$id,dfdata$year),
car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)
2000 2001 2002 2003 2004 2005 2006 car
a 1 1 1 0 0 0 0 toyota
b 1 0 0 0 0 0 0 toyota
c 0 1 0 0 0 0 0 toyota
d 0 2 1 0 0 0 0 honda
e 0 0 1 0 0 0 0 honda
f 0 0 0 1 1 1 1 lada
Which shows all the times I've seen a car in a year for a certain ID.
You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.
The following code gives the result you want
dfdata %>%
group_by(car, id) %>%
arrange(year, .by_group=TRUE) %>%
dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1]) %>%
group_by(car, year) %>%
dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year)) %>%
dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>%
ungroup()
NB: duplicating variable cs = nb is just a trick to write easily the subsequent call to mutate_at

How many counts are in each types in each year? With either group_by() or Split() [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.
We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1
How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)
You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))

error spreading data set in R

I have a long data set that is broken down by geographical location and year, with about 5 variables of interest (see structure blow), every time I try to convert it to wide form, I get told that there's duplication so it can't.
df
Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2
I want to convert it into something like this
yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
2001
2002
2003
2004
Looking for something like this...?
> reshape(df, v.names= c("Obs1", "Obs2"), idvar="Yr", timevar ="Geo", direction="wide")
Yr Obs1.Dist1 Obs2.Dist1 Obs1.Dist2 Obs2.Dist2
1 2001 1 3 1.0 3.0
2 2002 2 5 0.9 5.0
3 2003 4 2 6.0 8.0
4 2004 2 1 2.0 0.2
Here is a solution using tidyr. Because spread works with one key-value pair, you need to first gather the Obs and unite the dist with it so that you have one key-value pair to work with. I also set the column names to be lower case as shown in the requested output.
library(tidyverse)
tbl <- read_table2(
"Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2"
)
tbl %>%
gather("obsnum", "obs", Obs1, Obs2) %>%
unite(colname, Geo, obsnum, sep = "") %>%
spread(colname, obs) %>%
`colnames<-`(str_to_lower(colnames(.)))
#> # A tibble: 4 x 5
#> yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 1. 3. 1.00 3.00
#> 2 2002 2. 5. 0.900 5.00
#> 3 2003 4. 2. 6.00 8.00
#> 4 2004 2. 1. 2.00 0.200
Created on 2018-04-19 by the reprex package (v0.2.0).

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

Resources