Calculating distance between two variables and generating new variable - r

I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider
firm us euro asia africa stock year
A 1 4 3 5 46 2001
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 (1,4,3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = 0.2045883 * 343+ 0.1052075 * 345 = 106.4704 for the year 2001 for firm A.
I want to get a table including spill like this
firm us euro asia africa stock year spill
A 1 4 3 5 46 2001 106.4704
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
Can anyone please advise?
Here are the codes for stata[https://www.statalist.org/forums/forum/general-stata-discussion/general/1409182-calculating-distance-between-two-variables-and-generating-new-variable]. I have about 3,000 firms and 30 years. It runs well but very slowly.
dt <- data.frame(id=c("A","A","B","B","C"),us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))

Given the minimal info on how to calculate the similarity distance I've used a formula from Find cosine similarity between two arrays which will return different numbers than yours but should give the same resulting info.
I split the data by year so we can compare the unique ids. I take those individual lists and use lapply to run a for loop comparing all possibilities.
dt <- data.frame(id=c("A","A","B","B","C"), us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
geo <- c("us","euro","asia","africa")
s <- lapply(split(dt, dt$year), function(a) {
n <- nrow(a)
for(i in 1:n){
csim <- rep(0, n) # reset results of cosine similarity *stock vector
for(j in 1:n){
x <- unlist(a[i,geo])
y <- unlist(a[j,geo])
csim[j] <- (1-(x %*% y / sqrt(x%*%x * y%*%y)))*a[j,"stock"]
}
a$spill[i] <- sum(csim)
}
a
})
do.call(rbind, s)
# id us euro asia africa stock year spill
#2001.1 A 1 4 3 5 46 2001 106.47039
#2001.3 B 2 3 1 1 343 2001 77.93231
#2001.5 C 1 3 4 2 345 2001 72.96357
#2002.2 A 2 0 1 3 889 2002 12.28571
#2002.4 B 0 2 1 3 43 2002 254.00000

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

How can i sum values of 1 column based on the categories of another column, multiple times, in R?

I guess my question its a little strange, let me try to explain it. I need to solve a simple equation for a longitudinal database (29 consecutive years) about food availability and international commerce: (importations-exportations)/(production+importations-exportations)*100[equation for food dependence coeficient, by FAO]. The big problem is that my database has the food products and its values of interest (production, importation and exportation) dissagregated, so i need to find a way to apply that equation to a sum of the values of interest for every year, so i can get the coeficient i need for every year.
My data frame looks like this:
element product year value (metric tons)
Production Wheat 1990 16
Importation Wheat 1990 2
Exportation Wheat 1990 1
Production Apples 1990 80
Importation Apples 1990 0
Exportation Apples 1990 72
Production Wheat 1991 12
Importation Wheat 1991 20
Exportation Wheat 1991 0
I guess the solution its pretty simple, but im not good enough in R to solve this problem by myself. Every help is very welcome.
Thanks!
This is a picture of my R session
require(data.table)
# dummy table. Use setDT(df) if yours isn't a data table already
df <- data.table(element = (rep(c('p', 'i', 'e'), 3))
, product = (rep(c('w', 'a', 'w'), each=3))
, year = rep(c(1990, 1991), c(6,3))
, value = c(16,2,1,80,0,72,12,20,0)
); df
element product year value
1: p w 1990 16
2: i w 1990 2
3: e w 1990 1
4: p a 1990 80
5: i a 1990 0
6: e a 1990 72
7: p w 1991 12
8: i w 1991 20
9: e w 1991 0
# long to wide
df_1 <- dcast(df
, product + year ~ element
, value.var = 'value'
); df_1
# apply calculation
df_1[, food_depend_coef := (i-e) / (p+i-e)*100][]
product year e i p food_depend_coef
1: a 1990 72 0 80 -900.000000
2: w 1990 1 2 16 5.882353
3: w 1991 0 20 12 62.500000

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

rowwise multiplication of two different dataframes dplyr

I have two dataframes and I want to multiply one column of one dataframe (pop$Population) with parts of the other dataframe, sometimes with the mean of one column or a subset (here e.g.: multiplication with mean of df$energy).
As I want to have my results per Year i need to additionally multiply it by 365 (days).
I need the results for each Year.
age<-c("6 Months","9 Months", "12 Months")
energy<-c(2.5, NA, 2.9)
Df<-data.frame(age,energy)
Age<-1
Year<-c(1990,1991,1993, 1994)
Population<-c(200,300,400, 250)
pop<-data.frame(Age, Year,Population)
pop:
Age Year Population
1 1 1990 200
2 1 1991 300
3 1 1993 400
4 1 1994 250
df:
age energy
1 6 Months 2.5
2 9 Months NA
3 12 Months 2.9
my thoughts were, but I got an Error:
pop$energy<-pop$Population%>%
rowwise()%>%
transmute("energy_year"= .%*% mean(Df$energy, na.rm = T)%*%365)
Error in UseMethod("rowwise") :
no applicable method for 'rowwise' applied to an object of class "c('double', 'numeric')"
I wished to result in a dataframe like this:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375
pop$Population is a vector and not a data frame hence the error.
For your use case the simplest thing to do would be:
pop %>% mutate(energy_year= Population * mean(Df$energy, na.rm = T) * 365)
This will give you the output:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Resources