How do I melt or reshape binned data in R? [duplicate] - r

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 7 years ago.
I have binned data reflecting the width of rivers across each continent. Below is a sample dataset. I pretty much just want to get the data into the form I have shown.
dat <- read.table(text =
"width continent bin
5.32 Africa 10
6.38 Africa 10
10.80 Asia 20
9.45 Africa 10
22.66 Africa 30
9.45 Asia 10",header = TRUE)
How do I melt the above toy dataset to create this dataframe?
Bin Count Continent
10 3 Africa
10 1 Asia
20 1 Asia
30 1 Africa

We could use either one of the aggregate by group.
The data.table option would be to convert the 'data.frame' to 'data.table' (setDT(dat)), grouped by 'continent' and 'bin' variables, we get the number of elements per group (.N)
library(data.table)
setDT(dat)[,list(Count=.N) ,.(continent,bin)]
# continent bin Count
#1: Africa 10 3
#2: Asia 20 1
#3: Africa 30 1
#4: Asia 10 1
Or a similar option with dplyr by grouping the variables and then use n() instead of .N to get the count.
library(dplyr)
dat %>%
group_by(continent, bin) %>%
summarise(Count=n())
Or we can use aggregate from base R and using the formula method, we get the length.
aggregate(cbind(Count=width)~., dat, FUN=length)
# continent bin Count
#1 Africa 10 3
#2 Asia 10 1
#3 Asia 20 1
#4 Africa 30 1
From #Frank's and #David Arenburg's comments, some additional options using data.table and dplyr. We convert the dataset to data.table (setDT(dat)), convert to 'wide' format with dcast, then reconvert it back to 'long' using melt, and subset the roww (value>0)
library(data.table)
melt(dcast(setDT(dat),continent~bin))[value>0]
Using count from dplyr
library(dplyr)
count(dat, bin, continent)

With sqldf:
library(sqldf)
sqldf("SELECT bin, continent, COUNT(continent) AS count
FROM dat
GROUP BY bin, continent")
Output:
bin continent count
1 10 Africa 3
2 10 Asia 1
3 20 Asia 1
4 30 Africa 1

Related

dplyr, filter if both values are above a number [duplicate]

This question already has answers here:
dplyr filter with condition on multiple columns
(6 answers)
Closed 2 years ago.
I have a data set like such.
df = data.frame(Business = c('HR','HR','Finance','Finance','Legal','Legal','Research'), Country = c('Iceland','Iceland','Norway','Norway','US','US','France'), Gender=c('Female','Male','Female','Male','Female','Male','Male'), Value =c(10,5,20,40,10,20,50))
I need to be filter out all rows where both male value and female value are >= 10. For example, Iceland HR should be removed as well as Research France.
I've tried df %>% group_by(Business,Country) %>% filter((Value>=10)) but this filters out any value less than 10. any ideas?
Maybe this can help:
library(reshape2)
df2 <- reshape(df,idvar = c('Business','Country'),timevar = 'Gender',direction = 'wide')
df2 %>% mutate(Index=ifelse(Value.Female>=10 & Value.Male>=10,1,0)) %>%
filter(Index==1) -> df3
df4 <- reshape2::melt(df3[,-5],idvar=c('Business','Country'))
Business Country variable value
1 Finance Norway Value.Female 20
2 Legal US Value.Female 10
3 Finance Norway Value.Male 40
4 Legal US Value.Male 20
You could just use two ave steps, one with length, one with min.
df <- df[with(df, ave(Value, Country, FUN=length)) == 2, ]
df[with(df, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40
# 5 Legal US Female 10
# 6 Legal US Male 20
Notice that this also works if we disturb the data frame.
set.seed(42)
df2 <- df[sample(1:nrow(df)), ]
df2 <- df2[with(df2, ave(Value, Country, FUN=length)) == 2, ]
df2[with(df2, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 5 Legal US Female 10
# 6 Legal US Male 20
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40

Summing part of data frame over identical factor-levels to get rid of abundance in identical levels in data frame [duplicate]

This question already has answers here:
Aggregate by specific year in R
(2 answers)
Closed 5 years ago.
i have this as part of dataset of about 6000 rows:
ÅR LM RE AGE PA REC
1 2012 PKORT Stockholm <19 17973 35508
2 2012 PKORT Stockholm 20-24 31042 63229
3 2012 PKORT Stockholm 25-29 27305 64558
4 2012 PKORT Stockholm 30-34 18256 42726
5 2012 PKORT Stockholm 35-39 13200 32145
6 2012 PKORT Stockholm 40< 9458 24422
7 2012 PKORT Stockholm 40< 6123 16152
and i want to sum all the rows for PA and REC where AGE is "40<" to reduce the data frame from an abundance of identical factor levels.
I have tried aggregate, tapply and also assumed that R understands that both "40<" should be summed when lm-functions are applied.
This seems like a really easy operation, any help is appreciated.
We can do this with dplyr
library(dplyr)
df1 %>%
filter(AGE == "40<") %>%
group_by_(.dots = names(df1)[1:3]) %>%
summarise_at(vars(PA, REC) , sum)

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

How to count levels of a factor in a data.frame, grouped by another value of that data.frame [GNU R] [duplicate]

This question already has answers here:
simple data.frame reshape
(3 answers)
Reshaping data frame with duplicates
(4 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 9 years ago.
I have a data.frame like this
VAR1 VAR2
1999 USA
1999 USA
1999 UK
2000 GER
2000 USA
2000 GER
2000 USA
2001 USA
How do I count any level of VAR2 for each year?
What I want is a plot, where the x-axe is the year, and the y-axe is the count of any level in VAR2
The data.table solution
library(data.table)
new.dat = data.table(dat)[,length(unique(var2)),by=var1]
new.dat=as.matrix(new.dat)
plot(x=new.dat[,1],y=new.dat[,2])
The simplest way I can think of:
let dat = your data frame
with(dat,table(VAR1,VAR2))
The output will look something like this:
VAR2
VAR1 GER UK USA
1999 0 1 2
2000 2 0 2
2001 0 0 1
Hope this helps.
There are a large number of ways and this question is undoubtedly a duplicate. What have you tried? You can use dcast in the reshape2 pacakge.
require(reshape2)
dcast( df , Country ~ Year , length )
# Country 1999 2000 2001
#1 GER 0 2 0
#2 UK 1 0 0
#3 USA 2 2 1

Resources