I have a dataset that identifies observations based on two variables: Time and Country. The variable of interest is dichotomous, and has the value 0 if the event didn't occur and 1 if it did.
For some countries more than one observation is reported per year.
The data can be summarized like this:
Country
Time
Conflict
Bio Weapons
A
2000
1
0
A
2000
2
0
B
2000
3
1
C
2000
4
0
D
2000
5
1
D
2000
6
0
D
2000
7
0
D
2000
8
1
Is it possible two colapse these multiple observations into one observation per year and country with either outcome 0 (if the event never occured) or 1(if the event occured at least once)? Like this?:
Country
Time
Bio Weapons
A
2000
0
B
2000
1
C
2000
0
D
2000
1
Thank you in advance !
Your output is a bit unlcear since it doesn't match with what your description is, but this is what I think you want:
dat %>%
dplyr::group_by(Country, Time) %>%
dplyr::summarise(Bio_Weapons = dplyr::if_else(1 %in% Bio.Weapons, 1, 0))
# A tibble: 4 x 3
# Groups: Country [4]
Country Time Bio_Weapons
<chr> <int> <dbl>
1 A 2000 0
2 B 2000 1
3 C 2000 0
4 D 2000 1
And since I like data.table solutions:
dat[, .(Bio_Weapons = fifelse(1 %in% Bio.Weapons, 1, 0)), by=c("Country", "Time")]
Country Time Bio_Weapons
1: A 2000 0
2: B 2000 1
3: C 2000 0
4: D 2000 1
An option without ifelse
library(dplyr)
dat %>%
group_by(Country, Time) %%
summarise(Bio_Weapons = +(1 %in% Bio.Weapons))
Related
I have a dataframe with more than 2 000 000 records. Here is sample data:
year <- c(2002, 2002, 2001, 2001, 2000)
type<- c(“red”, “red”, “blue”, “blue”, “blue”)
mydata <- data.frame(year, type)
I need to extract the type per year, which would look something like this:
2002:
“red”: 2, “blue”: 0
2001:
“red”: 0, “blue”: 2
2000:
“red”: 0, “blue”: 1
I am able to extract it separately using table():
table(mydata$year)
table(mydata$type)
However I do not come up with a way to do it in one table.
Try aggregate like below
aggregate(type ~ ., mydata, function(x) table(factor(x, levels = unique(type))))
which gives
year type.red type.blue
1 2000 0 1
2 2001 0 2
3 2002 2 0
Another base R option using xtabs
xtabs(~ year + type, mydata)
gives
type
year blue red
2000 1 0
2001 2 0
2002 0 2
Here's another approach
> library(dplyr)
> data.frame(table(mydata)) %>%
pivot_wider(names_from = type, values_from = Freq)
# A tibble: 3 x 3
year blue red
<fct> <int> <int>
1 2000 1 0
2 2001 2 0
3 2002 0 2
We could also use table
table(mydata)
So here is the data:
Year State Grade Yes
2000 AZ A 1
2000 AZ A 0
2000 AZ A 1
2000 AZ B 1
2000 AZ B 1
2000 CA A 1
2000 CA A 0
2000 CA B 0
2000 NY A 1
2000 NY A 1
2001 NY B 1
What I'm trying to do is create a table that shows the sum of the 1's in the Yes column as a fraction of each group. The resulting table will show a value for each group based upon year, state and grade. It will look like this:
Year Grade AZ CA NY
2000 A 0.667 0.5 1
2000 B 1 0 1
2001 A 0 0 0
2001 B 0 0 1
There is more to the data including multiple values for Year, Grade and State so the table will be much larger but essentially it will return a proportion for each Group based on these three variables.
My code so far looks like this:
library(tidyverse)
data %>%
group_by(Year, State, Grade) %>%
summarise(x = Yes / count(Yes)) %>%
spread(State, x)
you were close...
The second row of code is optional, to get all combinations..
just get the sum of Yes, and divide by the number of rows per group (= n()).. Then spread, and if you want NA = 0, don't forget the fill = 0 at the end.
df %>%
complete( Year, nesting( State, Grade ), fill = list( Yes = 0 ) ) %>%
group_by( Year, State, Grade ) %>%
summarise( x = sum( Yes ) / n() ) %>%
spread( State, x, fill = 0 )
# # A tibble: 4 x 5
# # Groups: Year [2]
# Year Grade AZ CA NY
# <int> <chr> <dbl> <dbl> <dbl>
# 1 2000 A 0.667 0.5 1
# 2 2000 B 1 0 0
# 3 2001 A 0 0 0
# 4 2001 B 0 0 1
I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1
My sample dataset
df <- data.frame(period=rep(1:3,3),
product=c(rep('A',9)),
account= c(rep('1001',3),rep('1002',3),rep('1003',3)),
findme= c(0,0,0,1,0,1,4,2,0))
My Desired output dataset
output <- data.frame(period=rep(1:3,2),
product=c(rep('A',6)),
account= c(rep('1002',3),rep('1003',3)),
findme= c(1,0,1,4,2,0))
Here my conditions are....
I want to eliminate records 3 records from 9 based on below conditions.
If all my periods (1, 2 and 3) meet “findme” value is equal to ‘Zero’ and
if that happens to the same product and
and same account.
Rule 1: It should meet Periods 1, 2, 3
Rule 2: Findme value for all periods = 0
Rule 3: All those 3 records (Preiod 1,2,3) should have same Product
Rule 4: All those 3 recods (period 1,2,3) should have one account.
If I understand correctly, you want to drop all records from a product-account combination where findme == 0, if all periods are included in this combination?
library(dplyr)
df %>%
group_by(product, account, findme) %>%
mutate(all.periods = all(1:3 %in% period)) %>%
ungroup() %>%
filter(!(findme == 0 & all.periods)) %>%
select(-all.periods)
# A tibble: 6 x 4
period product account findme
<int> <fctr> <fctr> <dbl>
1 1 A 1002 1
2 2 A 1002 0
3 3 A 1002 1
4 1 A 1003 4
5 2 A 1003 2
6 3 A 1003 0
Here is an option with data.table
library(data.table)
setDT(df)[df[, .I[all(1:3 %in% period) & !all(!findme)], .(product, account)]$V1]
# period product account findme
#1: 1 A 1002 1
#2: 2 A 1002 0
#3: 3 A 1002 1
#4: 1 A 1003 4
#5: 2 A 1003 2
#6: 3 A 1003 0
I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
rank observations in group
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
I have the first two columns and want the third one:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
Thanks for any pointers!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave() and seq_along().
foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))
foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Another option using plyr
library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))
person year obs obs2
1 pers1 1999 1 1
2 pers1 2000 2 2
3 pers1 2003 3 3
4 pers2 1998 1 1
5 pers2 2011 2 2
A few alternatives with the data.table and dplyr packages.
data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:
> foo
person year rn
1: pers1 1999 1
2: pers1 2000 2
3: pers1 2003 3
4: pers2 1998 1
5: pers2 2011 2
If you want a true rank, you should use the frank function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:
> foo
Source: local data frame [5 x 3]
Groups: person [2]
person year rn
(fctr) (dbl) (int)
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Would by do the trick?
> foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2))
> foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
> by(foo, foo$person, nrow)
foo$person: pers1
[1] 3
------------------------------------------------------------
foo$person: pers2
[1] 2
Another option using aggregate and rank in base R:
foo$obs <- unlist(aggregate(.~person, foo, rank)[,2])
# person year obs
# 1 pers1 1999 1
# 2 pers1 2000 2
# 3 pers1 2003 3
# 4 pers2 1998 1
# 5 pers2 2011 2