I have a dataset that shows each bank's investment and dollar value associated with this investment. Currently the data looks like this. I have inv and amt variables stretching from 1 to 43.
bankid year location inv1 amt1 inv2 amt2 ... inv43 amt43
1 1990 NYC AIG 2000 GM 4000 Ford 6000
but I want the data to look like this
bankid year location inv number amt
1 1990 NYC AIG 1 2000
1 1990 NYC GM 2 4000
...
1 1990 NYC Ford 43 6000
In Stata, I would use this code
reshape long inv amt, i(bankid location year) j(number)
What would be the equivalent code in R?
reshape can do this. Here I am using the posted subset of your data, where you have time variables 1, 2, and 43:
x <- read.table(header=TRUE, text='bankid year location inv1 amt1 inv2 amt2 inv43 amt43
1 1990 NYC AIG 2000 GM 4000 Ford 6000 ')
x
## bankid year location inv1 amt1 inv2 amt2 inv43 amt43
## 1 1 1990 NYC AIG 2000 GM 4000 Ford 6000
v <- outer(c('inv', 'amt'), c(1,2,43), FUN=paste0)
v
## [,1] [,2] [,3]
## [1,] "inv1" "inv2" "inv43"
## [2,] "amt1" "amt2" "amt43"
reshape(x, direction='long', varying=c(v), sep='')
## bankid year location time inv amt id
## 1.1 1 1990 NYC 1 AIG 2000 1
## 1.2 1 1990 NYC 2 GM 4000 1
## 1.43 1 1990 NYC 43 Ford 6000 1
For your full table, the varying argument would be c(outer(c('inv', 'amt'), 1:43, FUN=paste0)) (but that won't work for the small example, as columns are missing).
Here, reshape infers the 'time' variable by inspecting the varying argument and finding common elements (inv and amt) on the left, and other elements on the right (1, 2, and 43). The sep argument says that there is no separator character (default sep character is .).
Related
I guess my question its a little strange, let me try to explain it. I need to solve a simple equation for a longitudinal database (29 consecutive years) about food availability and international commerce: (importations-exportations)/(production+importations-exportations)*100[equation for food dependence coeficient, by FAO]. The big problem is that my database has the food products and its values of interest (production, importation and exportation) dissagregated, so i need to find a way to apply that equation to a sum of the values of interest for every year, so i can get the coeficient i need for every year.
My data frame looks like this:
element product year value (metric tons)
Production Wheat 1990 16
Importation Wheat 1990 2
Exportation Wheat 1990 1
Production Apples 1990 80
Importation Apples 1990 0
Exportation Apples 1990 72
Production Wheat 1991 12
Importation Wheat 1991 20
Exportation Wheat 1991 0
I guess the solution its pretty simple, but im not good enough in R to solve this problem by myself. Every help is very welcome.
Thanks!
This is a picture of my R session
require(data.table)
# dummy table. Use setDT(df) if yours isn't a data table already
df <- data.table(element = (rep(c('p', 'i', 'e'), 3))
, product = (rep(c('w', 'a', 'w'), each=3))
, year = rep(c(1990, 1991), c(6,3))
, value = c(16,2,1,80,0,72,12,20,0)
); df
element product year value
1: p w 1990 16
2: i w 1990 2
3: e w 1990 1
4: p a 1990 80
5: i a 1990 0
6: e a 1990 72
7: p w 1991 12
8: i w 1991 20
9: e w 1991 0
# long to wide
df_1 <- dcast(df
, product + year ~ element
, value.var = 'value'
); df_1
# apply calculation
df_1[, food_depend_coef := (i-e) / (p+i-e)*100][]
product year e i p food_depend_coef
1: a 1990 72 0 80 -900.000000
2: w 1990 1 2 16 5.882353
3: w 1991 0 20 12 62.500000
I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider
firm us euro asia africa stock year
A 1 4 3 5 46 2001
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 (1,4,3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = 0.2045883 * 343+ 0.1052075 * 345 = 106.4704 for the year 2001 for firm A.
I want to get a table including spill like this
firm us euro asia africa stock year spill
A 1 4 3 5 46 2001 106.4704
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
Can anyone please advise?
Here are the codes for stata[https://www.statalist.org/forums/forum/general-stata-discussion/general/1409182-calculating-distance-between-two-variables-and-generating-new-variable]. I have about 3,000 firms and 30 years. It runs well but very slowly.
dt <- data.frame(id=c("A","A","B","B","C"),us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
Given the minimal info on how to calculate the similarity distance I've used a formula from Find cosine similarity between two arrays which will return different numbers than yours but should give the same resulting info.
I split the data by year so we can compare the unique ids. I take those individual lists and use lapply to run a for loop comparing all possibilities.
dt <- data.frame(id=c("A","A","B","B","C"), us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
geo <- c("us","euro","asia","africa")
s <- lapply(split(dt, dt$year), function(a) {
n <- nrow(a)
for(i in 1:n){
csim <- rep(0, n) # reset results of cosine similarity *stock vector
for(j in 1:n){
x <- unlist(a[i,geo])
y <- unlist(a[j,geo])
csim[j] <- (1-(x %*% y / sqrt(x%*%x * y%*%y)))*a[j,"stock"]
}
a$spill[i] <- sum(csim)
}
a
})
do.call(rbind, s)
# id us euro asia africa stock year spill
#2001.1 A 1 4 3 5 46 2001 106.47039
#2001.3 B 2 3 1 1 343 2001 77.93231
#2001.5 C 1 3 4 2 345 2001 72.96357
#2002.2 A 2 0 1 3 889 2002 12.28571
#2002.4 B 0 2 1 3 43 2002 254.00000
I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))
I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.
I currently have a data frame of ~83000 rows (13 columns) that has data from years 2000-2012 of crimes, each row is a crime and the zip code is reported (so the zip code xxxxx can be found in year 2001, 2003, and 2007 as an example).
Here is an example of my data:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 99502 1 3 5 2 9479
2009 2 99502 2 3 4 3 3220
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Right now, I am able to assign global variables to all of my zip codes (I am using R studio and my list of data shown is very long and it has significantly slowed the program).
Here is how I have assigned global variables to all of my zip codes:
for (n in all.data$Zip) {
x <- subset(all.data, n == all.data$Zip) #subsets the data
u <- x[1,3] #gets the zip code value
assign(paste0("Zip", u), x, envir = .GlobalEnv) #assigns it to a global environment
#need something here, MasterList <<- ?
}
I would like to contain all of these variables in a list. For example, if all my zip code variables were stored in list "MasterList":
MasterList["Zip11111"]
would yield the data frame:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Is this possible? What would be an alternative/faster/better way to do such? I was hoping that storing these variables in a list would be more efficient.
Bonus points: I know in my for loop I am reassigning variables that already exist to the exact same thing, wasting processing time. Any quick line I could add to speed this up?
Thanks in advance for your help!
With only base R:
dat <- read.table(text = "Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
+ 2000 1 99502 1 3 5 2 9479
+ 2009 2 99502 2 3 4 3 3220
+ 2000 1 11111 1 3 5 2 3479
+ 2004 2 11111 2 3 4 3 1020",header = TRUE,sep = "")
> dats <- split(dat,dat$Zip)
> dats
$`11111`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$`99502`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
> names(dats) <- paste0('Zip',names(dats))
> dats
$Zip11111
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$Zip99502
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
You could change for (n in all.data$Zip) to for (n in unique(all.data$Zip)). That would cut down on redundancy. Why don't you make a list before the loop, MasterList <- list() and then add to the list by
MasterList[[paste0("Zip", n)]] <- x
Yes, I used n for the zip code number because n is assigned each value in the vector you tell it (in your case all.data$Zip, in mine unique(all.data$Zip))
Probably the easiest way to make your list is using the plyr function, like so:
> set.seed(2)
> dat <- data.frame(zip=as.factor(sample(11111:22222,1000,replace=T)),var1=rnorm(1000),var2=rnorm(1000))
> head(dat)
zip var1 var2
1 13165 -0.4597894 -0.84724423
2 18915 0.6179261 0.07042928
3 17481 -0.7204224 1.58119491
4 12978 -0.5835119 0.02059799
5 21598 0.2163245 -0.12337051
6 21594 1.2449912 -1.25737890
> library(plyr)
> MasterList <- dlply(dat,.(zip))
> MasterList[["13165"]]
zip var1 var2
1 13165 -0.4597894 -0.8472442
However it sounds like speed is your motivation here and if so you'd probably be much better off not storing the data in some separate list object and converting your data frame to a data.table():
> library(data.table)
> dat.dt <- data.table(dat)
> dat.dt[zip==13165]
zip var1 var2
1: 13165 -0.4597894 -0.8472442