How to count levels of a factor in a data.frame, grouped by another value of that data.frame [GNU R] [duplicate] - r

This question already has answers here:
simple data.frame reshape
(3 answers)
Reshaping data frame with duplicates
(4 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 9 years ago.
I have a data.frame like this
VAR1 VAR2
1999 USA
1999 USA
1999 UK
2000 GER
2000 USA
2000 GER
2000 USA
2001 USA
How do I count any level of VAR2 for each year?
What I want is a plot, where the x-axe is the year, and the y-axe is the count of any level in VAR2

The data.table solution
library(data.table)
new.dat = data.table(dat)[,length(unique(var2)),by=var1]
new.dat=as.matrix(new.dat)
plot(x=new.dat[,1],y=new.dat[,2])

The simplest way I can think of:
let dat = your data frame
with(dat,table(VAR1,VAR2))
The output will look something like this:
VAR2
VAR1 GER UK USA
1999 0 1 2
2000 2 0 2
2001 0 0 1
Hope this helps.

There are a large number of ways and this question is undoubtedly a duplicate. What have you tried? You can use dcast in the reshape2 pacakge.
require(reshape2)
dcast( df , Country ~ Year , length )
# Country 1999 2000 2001
#1 GER 0 2 0
#2 UK 1 0 0
#3 USA 2 2 1

Related

Create multiple columns of values that are in second column and fill new data frame with number of occurences according to first column [duplicate]

This question already has answers here:
Frequency counts in R [duplicate]
(2 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I am new to stack overflow and sorry if I am not asking question properly.
I have two columns country and Year.
INDIA 1970
USA 1970
USA 1971
INDIA 1970
.
.
UK 1972
I want new data frame like this and I need to fill it with occurrences.
1970 1971 1972....
INDIA 2
USA 1 1
UK 1
An option could be to use reshape2::dcast with fun.aggregate argument set as length:
library(reshape2)
dcast(df, Country~Year, length)
# Country 1970 1971 1972
# 1 INDIA 2 0 0
# 2 UK 0 0 1
# 3 USA 1 1 0
Data:
df <- read.table(text =
"Country Year
INDIA 1970
USA 1970
USA 1971
INDIA 1970
UK 1972",
header = TRUE, stringsAsFactors = FALSE)

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Should I use for loop? OR apply? [duplicate]

This question already has answers here:
Split dataframe by levels of a factor and name dataframes by those levels
(3 answers)
Closed 5 years ago.
this is my first post.
I have this dataframe of the Nhl draft.
What I would like to do is to use some sort of recursive function to create 10 objects.
So, I want to create these 10 objects by subsetting the Nhl dataframe by Year.
Here are the first 6 rows of the data set (nhl_draft)
Year Overall Team
1 2000 1 New York Islanders
2 2000 2 Atlanta Thrashers
3 2000 3 Minnesota Wild
4 2000 4 Columbus Blue Jackets
5 2000 5 New York Islanders
6 2000 6 Nashville Predators
Player PS
1 Rick DiPietro 49.3
2 Dany Heatley 95.2
3 Marian Gaborik 103.6
4 Rostislav Klesla 34.5
5 Raffi Torres 28.4
6 Scott Hartnell 74.5
I want to create 10 objects by subsetting out the Years, 2000 ~ 2009.
I tried,
for (i in 2000:2009) {
nhl_draft.i <- subset(nhl_draft, Year == "i")
}
BUT this doesn't do anything. What's the problem with this for-loop? Can you suggest any other ways?
Please tell me if this is confusing after all, this is my first post......
The following code may fix your error.
# Create an empty list
nhl_list <- list()
for (i in 2000:2009) {
# Subset the data frame based on Year
nhl_draft_temp <- subset(nhl_draft, Year == i)
# Assign the subset to the list
nhl_list[[as.character(i)]] <- nhl_draft_temp
}
But you can consider split, which is more concise.
nhl_list <- split(nhl_draft, f = nhl_draft$Year)

How do I melt or reshape binned data in R? [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 7 years ago.
I have binned data reflecting the width of rivers across each continent. Below is a sample dataset. I pretty much just want to get the data into the form I have shown.
dat <- read.table(text =
"width continent bin
5.32 Africa 10
6.38 Africa 10
10.80 Asia 20
9.45 Africa 10
22.66 Africa 30
9.45 Asia 10",header = TRUE)
How do I melt the above toy dataset to create this dataframe?
Bin Count Continent
10 3 Africa
10 1 Asia
20 1 Asia
30 1 Africa
We could use either one of the aggregate by group.
The data.table option would be to convert the 'data.frame' to 'data.table' (setDT(dat)), grouped by 'continent' and 'bin' variables, we get the number of elements per group (.N)
library(data.table)
setDT(dat)[,list(Count=.N) ,.(continent,bin)]
# continent bin Count
#1: Africa 10 3
#2: Asia 20 1
#3: Africa 30 1
#4: Asia 10 1
Or a similar option with dplyr by grouping the variables and then use n() instead of .N to get the count.
library(dplyr)
dat %>%
group_by(continent, bin) %>%
summarise(Count=n())
Or we can use aggregate from base R and using the formula method, we get the length.
aggregate(cbind(Count=width)~., dat, FUN=length)
# continent bin Count
#1 Africa 10 3
#2 Asia 10 1
#3 Asia 20 1
#4 Africa 30 1
From #Frank's and #David Arenburg's comments, some additional options using data.table and dplyr. We convert the dataset to data.table (setDT(dat)), convert to 'wide' format with dcast, then reconvert it back to 'long' using melt, and subset the roww (value>0)
library(data.table)
melt(dcast(setDT(dat),continent~bin))[value>0]
Using count from dplyr
library(dplyr)
count(dat, bin, continent)
With sqldf:
library(sqldf)
sqldf("SELECT bin, continent, COUNT(continent) AS count
FROM dat
GROUP BY bin, continent")
Output:
bin continent count
1 10 Africa 3
2 10 Asia 1
3 20 Asia 1
4 30 Africa 1

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Resources