cbind arguments in large dataframe - r

I have searched unsuccessfully for several days for an answer to this question: I have a dataframe with 279 columns and want to generate subtotals using aggregate(), or indeed, anything suitable. Here is a subset:
LGA off.cat sub.cat Jan1995 Feb1995
1 Albury Homicide Murder * 0 0
2 Albury Homicide Attempted murder 0 0
3 Albury Homicide Murder accessory, conspiracy 0 0
4 Albury Homicide Manslaughter * 0 0
5 Albury Assault Domestic violence related assault 7 7
6 Albury Assault Non-domestic violence related assault 29 20
7 Albury Assault Assault Police 12 3
8 Albury Sexual offences Sexual assault 4 3
The full dataframe contains dozens of LGA values, and many more date columns. I would like to obtain subtotals for each unique LGA value grouped by unique values of off.cat and sub.cat, summed over all dates. I tried using cbind in aggregate, but found no way to generate the 276 date column names that would not cause errors. Explicit column names worked fine. Apologies for the lack of clarity in the earlier post, and thanks to those who valiantly tried to interpret my meaning.

Your question is a bit unclear, but you may be successful using the formula syntax of aggregate. Here's an example:
df <- data.frame(group = letters[1:5],
x = 1:5,
y = 6:10,
z = 11:15)
group x y z
1 a 1 6 11
2 b 2 7 12
3 c 3 8 13
4 d 4 9 14
5 e 5 10 15
We now sum all three variables x, y and z by the levels of group, using setdiff to get a vector of column names except group, and pasting them together to use in as.formula:
aggregate(as.formula(paste(paste(setdiff(names(df), c("group")), collapse = "+"), "~ group")), data = df, sum)
group x + y + z
1 a 18
2 b 21
3 c 24
4 d 27
5 e 30
Hope this helps.

Related

Unable to Group and Sum Properly

I have data similar to this Sample Data:
Cities Country Date Cases
1 BE A 2/12/20 12
2 BD A 2/12/20 244
3 BF A 2/12/20 1
4 V 2/12/20 13
5 Q 2/13/20 2
6 D 2/14/20 4
7 GH N 2/15/20 6
8 DA N 2/15/20 624
9 AG J 2/15/20 204
10 FS U 2/16/20 433
11 FR U 2/16/20 38
I want to organize the data by on the date and country and then sum a country's daily case. However, I try something like, it reveal the total sum:
my_data %>%
group_by(Country, Date)%>%
summarize(Cases=sum(Cases))
Your summarize function is likely being called from another package (plyr?). Try calling dplyr::sumarize like this:
my_data %>%
group_by(Country, Date)%>%
dplyr::summarize(Cases=sum(Cases))
# A tibble: 7 x 3
# Groups: Country [7]
Country Date Cases
<fct> <fct> <int>
1 A 2/12/20 257
2 D 2/14/20 4
3 J 2/15/20 204
4 N 2/15/20 630
5 Q 2/13/20 2
6 U 2/16/20 471
7 V 2/12/20 13
I sympathize with you that this is can be very frustrating. I have gotten into a habit of always using dplyr::select, dplyr::filter and dplyr::summarize. Otherwise you spend needless time frustrated about why your code isn't working.
We can also use aggregate
aggregate(Cases ~ Country + Date, my_data, sum)

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

Calculating distance between two variables and generating new variable

I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider
firm us euro asia africa stock year
A 1 4 3 5 46 2001
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 (1,4,3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = 0.2045883 * 343+ 0.1052075 * 345 = 106.4704 for the year 2001 for firm A.
I want to get a table including spill like this
firm us euro asia africa stock year spill
A 1 4 3 5 46 2001 106.4704
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
Can anyone please advise?
Here are the codes for stata[https://www.statalist.org/forums/forum/general-stata-discussion/general/1409182-calculating-distance-between-two-variables-and-generating-new-variable]. I have about 3,000 firms and 30 years. It runs well but very slowly.
dt <- data.frame(id=c("A","A","B","B","C"),us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
Given the minimal info on how to calculate the similarity distance I've used a formula from Find cosine similarity between two arrays which will return different numbers than yours but should give the same resulting info.
I split the data by year so we can compare the unique ids. I take those individual lists and use lapply to run a for loop comparing all possibilities.
dt <- data.frame(id=c("A","A","B","B","C"), us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
geo <- c("us","euro","asia","africa")
s <- lapply(split(dt, dt$year), function(a) {
n <- nrow(a)
for(i in 1:n){
csim <- rep(0, n) # reset results of cosine similarity *stock vector
for(j in 1:n){
x <- unlist(a[i,geo])
y <- unlist(a[j,geo])
csim[j] <- (1-(x %*% y / sqrt(x%*%x * y%*%y)))*a[j,"stock"]
}
a$spill[i] <- sum(csim)
}
a
})
do.call(rbind, s)
# id us euro asia africa stock year spill
#2001.1 A 1 4 3 5 46 2001 106.47039
#2001.3 B 2 3 1 1 343 2001 77.93231
#2001.5 C 1 3 4 2 345 2001 72.96357
#2002.2 A 2 0 1 3 889 2002 12.28571
#2002.4 B 0 2 1 3 43 2002 254.00000

How to express a variable as a function of 2 others in a dataframe composed of 3 vectors

I know it is fundamental but I can't find the trick ...
Here is an exemple :
Species <- c("dark frog",rep(c("elephant","tiger","boa"),3),"black mamba")
Year <- c(rep(2011,4),rep(2012,3),rep(2013,4))
Abundance <- c(2,4,5,6,9,2,1,5,6,8,4)
df <- data.frame(Species, Year, Abundance)
I would like to obtain another dataframe (3 rows *5 columns) with the abundance values in function of the species as the column names (each species appearing thus only one time) and the years as the row names (appearing one time also).
May someone help me please ?
You mean something like this?
> xtabs(Abundance~Year+Species, data=df)
Species
Year black mamba boa dark frog elephant tiger
2011 0 6 2 4 5
2012 0 1 0 9 2
2013 4 8 0 5 6
The class for the above is a table, so if you prefer a data.frame instead, you can try:
library(tidyr)
new.df<- spread(df, key = Species, value = Abundance)
Year black mamba boa dark frog elephant tiger
1 2011 NA 6 2 4 5
2 2012 NA 1 NA 9 2
3 2013 4 8 NA 5 6
If you want 0s instead of NA add the following line:
new.df[is.na(new.df)]<- 0

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Resources