I have a dataframe with more than 2 000 000 records. Here is sample data:
year <- c(2002, 2002, 2001, 2001, 2000)
type<- c(“red”, “red”, “blue”, “blue”, “blue”)
mydata <- data.frame(year, type)
I need to extract the type per year, which would look something like this:
2002:
“red”: 2, “blue”: 0
2001:
“red”: 0, “blue”: 2
2000:
“red”: 0, “blue”: 1
I am able to extract it separately using table():
table(mydata$year)
table(mydata$type)
However I do not come up with a way to do it in one table.
Try aggregate like below
aggregate(type ~ ., mydata, function(x) table(factor(x, levels = unique(type))))
which gives
year type.red type.blue
1 2000 0 1
2 2001 0 2
3 2002 2 0
Another base R option using xtabs
xtabs(~ year + type, mydata)
gives
type
year blue red
2000 1 0
2001 2 0
2002 0 2
Here's another approach
> library(dplyr)
> data.frame(table(mydata)) %>%
pivot_wider(names_from = type, values_from = Freq)
# A tibble: 3 x 3
year blue red
<fct> <int> <int>
1 2000 1 0
2 2001 2 0
3 2002 0 2
We could also use table
table(mydata)
Related
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 1 year ago.
I'm fairly new to R and using the dplyr package currently. I have a dataframe that looks something like this simplified table:
year
category
2009
A
2009
B
2009
B
2010
A
2010
B
2011
A
2011
C
2011
C
I want to count for each year hence I used:
df %>% count(year, category)
and got
year
category
count
2009
A
1
2009
B
2
2010
A
1
2010
B
1
2011
A
1
2011
C
2
However I would like to use the year as column names, to get the following:
2009
2010
2011
A
1
1
1
B
2
1
0
C
0
0
2
What is an easy way to get this? I would like to get this in absolute numbers, and if possible as a normalized table (percentages of the total of each year).
I hope you guys can help me out!
df %>% count(year, category) %>%
pivot_wider(
category,
names_from = year,
names_prefix = "year_",
values_from = n,
values_fill = 0
)
# A tibble: 3 x 4
category year_2009 year_2010 year_2011
<chr> <int> <int> <int>
1 A 1 1 1
2 B 2 1 0
3 C 0 0 2
Using reshape:
df2 = df %>% count(year, category)
df2 = reshape(df2, idvar='category', timevar='year', direction='wide')
rownames(df2) = df2$category
df2[is.na(df2)] = 0
df2 = df2[,c(2:4)]
I have a dataset that identifies observations based on two variables: Time and Country. The variable of interest is dichotomous, and has the value 0 if the event didn't occur and 1 if it did.
For some countries more than one observation is reported per year.
The data can be summarized like this:
Country
Time
Conflict
Bio Weapons
A
2000
1
0
A
2000
2
0
B
2000
3
1
C
2000
4
0
D
2000
5
1
D
2000
6
0
D
2000
7
0
D
2000
8
1
Is it possible two colapse these multiple observations into one observation per year and country with either outcome 0 (if the event never occured) or 1(if the event occured at least once)? Like this?:
Country
Time
Bio Weapons
A
2000
0
B
2000
1
C
2000
0
D
2000
1
Thank you in advance !
Your output is a bit unlcear since it doesn't match with what your description is, but this is what I think you want:
dat %>%
dplyr::group_by(Country, Time) %>%
dplyr::summarise(Bio_Weapons = dplyr::if_else(1 %in% Bio.Weapons, 1, 0))
# A tibble: 4 x 3
# Groups: Country [4]
Country Time Bio_Weapons
<chr> <int> <dbl>
1 A 2000 0
2 B 2000 1
3 C 2000 0
4 D 2000 1
And since I like data.table solutions:
dat[, .(Bio_Weapons = fifelse(1 %in% Bio.Weapons, 1, 0)), by=c("Country", "Time")]
Country Time Bio_Weapons
1: A 2000 0
2: B 2000 1
3: C 2000 0
4: D 2000 1
An option without ifelse
library(dplyr)
dat %>%
group_by(Country, Time) %%
summarise(Bio_Weapons = +(1 %in% Bio.Weapons))
For my dataset I want a row for each year for each ID and I want to determine if they lived in an urban area or not (0/1). Because some ID’s moved within a year and therefore have two rows for that year, I want to identify if they have two rows for that specific year, which mean they lived in an urban and non-urban area in that year (so I can manually determine in Excel at where they belong).
I’ve already excluded the exact double rows (so they moved in a certain year, but the urbanisation didn’t change).
df <- df %>% distinct(ID, YEAR, URBAN, .keep_all = TRUE)
structure(t2A)
# A tibble: 3,177,783 x 4
ID ZIPCODE YEAR URBAN
<dbl> <chr> <chr> <dbl>
1 1 1234AB 2013 0
2 1 1234AB 2014 0
3 1 1234AB 2015 0
4 1 1234AB 2016 0
5 1 1234AB 2017 0
6 1 1234AB 2018 0
7 2 5678CD 2013 0
8 2 5678CD 2014 0
9 2 5678CD 2015 0
10 2 5678CD 2016 0
# ... with 3,177,773 more rows
structure(list(ID= c(1, 1, 1, 1
), YEAR = c("2013", "2014", "2015", "2016"), URBAN = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Can you guys help me with identifying ID’s that have two rows for a specific year/have a 0 and 1 in the same year?
Edit: the example doesn't show any ID's with urbanisation 1, but there are and not all ID's are included all years :)
Below might be useful:
df <- df %>%
dplyr::group_by(ID, YEAR) %>%
dplyr::mutate(nIds=dplyr::n(),#count the occurance at unique ID and year combination
URBAN_Flag=sum(URBAN), ##Urban flag for those who are from urban
moved=dplyr::if_else(nIds>1,1,0)) %>%
dplyr::select(-c(nIds))
You can deselect the columns if not needed
First, we create some dummy data
library(tidyverse)
db <- tibble(
id = c(1, 1, 1, 2, 2, 2),
year = c(2000, 2000, 2001, 2001, 2002, 2003),
urban = c(0, 1, 0, 0, 0, 0)
)
We see that person one moved in 2000.
id year urban
<dbl> <dbl> <dbl>
1 1 2000 0
2 1 2000 1
3 1 2001 0
4 2 2001 0
5 2 2002 0
6 2 2003 0
Now, we can group by id and year and count the number of rows. We can use the count value to create a dummy whether or not they moved in a given year.
db %>%
group_by(id, year) %>%
summarize(rows = n()) %>%
mutate(
moved = ifelse(rows == 2, 1, 0)
)
Which gives the result:
id year rows moved
<dbl> <dbl> <int> <dbl>
1 1 2000 2 1
2 1 2001 1 0
3 2 2001 1 0
4 2 2002 1 0
5 2 2003 1 0
I have a CSV like this, saved as an object in R named df1.
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
I want to:
Count each column where there are no "0" for year 2014. For example, for column X, the count = 2 (not 3 because I want 2014 data only). For column Y the count is 4. For column Z the count is 1.
Sum all the counts for each column
This is what I tried:
count_total <- sum(df1$x != 0 &
df1$y != 0 &
df1&z != 0 &
df1$Year == 2014)
count_total
I want the output to be simply be 1 (i.e. the 2nd row in df has no 0's)
However, this does not align with my countifs on excel. In excel, it's like this:
=COUNTIFS('df1'!$A$2:$A$859,"<>0",'df1'!$B$2:$B$859,"<>0",
'df1'!$C$2:$C$859,"<>0",'df1'!$D$2:$D$859,2014)
Wondering if I mistyped something on R? I'm a dyplr user but can't find anything particularly useful on google.
Thank you very much!
One way is using rowSums on subset of data
sum(rowSums(subset(df1, Year == 2014) == 0) == 0)
#[1] 1
You can do this with aggregate then colSums to get the totals by column.
agg <- aggregate(. ~ Year, df1, function(x) sum(x != 0))
agg
# Year X Y Z
#1 2014 2 4 2
#2 2015 1 1 0
colSums(agg[-1])
#X Y Z
#3 5 2
Data.
df1 <- read.table(text = "
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
",header = TRUE)
dplyrapproach:
library(dplyr)
df1 %>%
group_by(Year) %>%
summarise_at(vars(X:Z), function (x) sum(x != 0))
Output:
# A tibble: 2 x 4
# Year X Y Z
# <int> <int> <int> <int>
# 1 2014 2 4 2
# 2 2015 1 1 0
Alternative using summaryBy.
library(doBy)
summaryBy(list(c('X','Y','Z'), c('Year')), df1, FUN= function(x) sum(x!=0), keep.names=T)
Year X Y Z
1 2014 2 4 2
2 2015 1 1 0
When needed use colSums as explained before.
Lets say I have the following data of sock increase per drawer
>socks
year drawer_nbr sock_total
1990 1 2
1991 1 2
1990 2 3
1991 2 4
1990 3 2
1991 3 1
I would like to have a binary variable that identifies if the socks have increased in each drawer. 1 if they increased and 0 if not. The result would be
>socks
drawer_nbr growth
<dbl> <factor>
1 0
2 1
3 0
I am getting stuck on comparing sock_total of one year vs sock_total of another year. I know that I need to use dplyr::summaries(), but I am having difficulty with what goes inside that function.
If you are comparing year 1991 with 1990, you can do:
socks %>%
group_by(drawer_nbr) %>%
summarise(growth = +(sock_total[year == 1991] - sock_total[year == 1990] > 0))
# A tibble: 3 x 2
# drawer_nbr growth
# <int> <int>
#1 1 0
#2 2 1
#3 3 0
You could use a mix of dplyr and tidyr:
library(tidyr)
library(dplyr)
socks %>%
group_by(drawer_nbr) %>%
spread(year, sock_total) %>%
mutate(growth = `1991` - `1990`)
Or if you only wanted growth to be binary:
socks %>%
group_by(drawer_nbr) %>%
spread(year, sock_total) %>%
mutate(growth = ifelse((`1991` - `1990`) > 0,
1, 0))