Interpretation of negative index when subsetting a data.frame [duplicate] - r

This question already has answers here:
In R, what does a negative index do?
(3 answers)
Closed 9 years ago.
I am very new to R and at times get stuck with the codes. I came across one of this code as below. What does -7 mean in the code below?
round(cor(longley[,-7]),3)
I understand:
round for rounding,
longley as data.frame,
3: digits for rounding, but not the -7.

In the context [, -7] it means drop the 7th column from the data frame longley (or take all columns but the 7th from longley).
This is R 101 and you'd do well to read some introductory material. For example, this is covered very early on in the An Introduction to R manual that comes with R or is accessible from the R website. Or you could read ?Extract.
Here is an example
> head(longley)
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
> names(longley)
[1] "GNP.deflator" "GNP" "Unemployed" "Armed.Forces" "Population"
[6] "Year" "Employed"
> names(longley)[7]
[1] "Employed"
> head(longley[, -7])
GNP.deflator GNP Unemployed Armed.Forces Population Year
1947 83.0 234.289 235.6 159.0 107.608 1947
1948 88.5 259.426 232.5 145.6 108.632 1948
1949 88.2 258.054 368.2 161.6 109.773 1949
1950 89.5 284.599 335.1 165.0 110.929 1950
1951 96.2 328.975 209.9 309.9 112.075 1951
1952 98.1 346.999 193.2 359.4 113.270 1952

The command longley[,-7] means: All columns from longley except the 7th. This is called negative indexing.
Have a look at ?Extract for further information.

Related

Joining two dataframes to plot a map with ggplot2

I want to make a worldmap visualization using an data frame, which look like this:
Country Year Sex Age Suicides Population Suicides_per_100k Country_Year HDI/Year Year_GDP
1 Albania 1987 Male 15-24 years 21 312900 6.71 Albania1987 NA 2156624900
2 Albania 1987 Male 35-54 years 16 308000 5.19 Albania1987 NA 2156624900
3 Albania 1987 Female 15-24 years 14 289700 4.83 Albania1987 NA 2156624900
4 Albania 1987 Male 75+ years 1 21800 4.59 Albania1987 NA 2156624900
5 Albania 1987 Male 25-34 years 9 274300 3.28 Albania1987 NA 2156624900
6 Albania 1987 Female 75+ years 1 35600 2.81 Albania1987 NA 2156624900
GDP_Per_Capita Generation Continent
1 796 Generation X Europe
2 796 Silent Europe
3 796 Generation X Europe
4 796 G.I. Generation Europe
5 796 Boomers Europe
6 796 G.I. Generation Europe
I tried to use the following code:
world <- ggplot2::map_data('world')
worldstart <- left_join(df,world,by = c("Country"="region")
This code created a new dataframe with 14 million observations.
But, I'd like to keep the same number of the dataset "df".
What is the best approach?
Indeed, the map_data functions returns the values for each point of each multipolygons in the world (~10k rows). As mentioned earlier, you cannot chose what point to keep.
You can use the sf library to go around this difficulty, keeping the geometry (here multipolygons) on one side and your data on the other.
My proposal would be the following :
library(dplyr)
library(sf)
library(ggplot2)
df <- tibble(Country = "Albania",
GDP_per_Capita = 796)
world <- maps::map('world', plot = F, fill = T) %>% st_as_sf(stringsAsFactors = F)
world_df <- df %>%
left_join(world, by = c("Country" = "ID"))
In my example, you would have only one row of data, but the geometry columns contains all necessary information for plotting.
sf and ggplot2 are well linked so you are good to go.
Best regards

R: precipitation data seasonal (DJFM) sums for each station

I have a data.frame "n_com", which includes columns for "year" (1951-2010), "month" (1,2,3,12) and 81 further value-columns for monthly precipitation sums of 81 weather-stations.
Jahr Monat 12_NS_Monat 13 NS Monat 14 NS Monat 15 NS Monat 16 NS Monat
1 1951 1 397 2045 1447 2666 236
2 1951 2 528 1043 464 1397 202
3 1951 3 819 480 953 1634 665
4 1951 12 363 252 881 610 350
5 1952 1 391 530 557 1321 339
6 1952 2 683 684 920 1125 805
Now, I need the seasonal sums for each year for the months december, january, february and march (DJFM) for each station. But the seasonal sums should include the information of the december-month of the previous year, while the informations about the other months should come from the current year.
(e.g.: seasonal sum of 1956 which includes december-data of 1955, while the other months are from 1956)
Finally, I want a data.frame with the following columns: "year", "station 1", "station 2" and so on..
It seems, that the function "dm2seasonal" of the package "hydroTSM" is the right for me to create seasonal sums. My problem is, that "hydroTSM" needs special formation of the data.frame (data in long format), but my data.frame is in wide-format. Can anayone help me formatting my data for the package "hydroTSM" or has got another solution to create seasonal sums?
greetz from Germany
More a hack than a solution, but you could probably just add 1 to the 'year' column for all rows with month = 12 :
n_com$yeartemp = n_com$year
n_com$yeartemp[n_com$month == 12] = n_com$year[n_com$month == 12] + 1
To change column names, see 'names`.
Then, to change to long format you can use melt in reshape2 package, using yeartemp as id variable.
Hth.
Hth.

decompose() for yearly time series in R

I'm trying to perform analysis on a time series data of inflation rates from the year 1960 to 2015. The dataset is a yearly time series over 56 years with 1 real value per each year, which is the following:
Year Inflation percentage
1960 1.783264746
1961 1.752021563
1962 3.57615894
1963 2.941176471
1964 13.35403727
1965 9.479452055
1966 10.81081081
1967 13.0532972
1968 2.996404315
1969 0.574712644
1970 5.095238095
1971 3.081105573
1972 6.461538462
1973 16.92815855
1974 28.60169492
1975 5.738605162
1976 -7.63438068
1977 8.321619342
1978 2.517518817
1979 6.253164557
1980 11.3652609
1981 13.11510484
1982 7.887270664
1983 11.86886396
1984 8.32157969
1985 5.555555556
1986 8.730811404
1987 8.798689021
1988 9.384775808
1989 3.26256011
1990 8.971233545
1991 13.87024609
1992 11.78781925
1993 6.362038664
1994 10.21150033
1995 10.22488756
1996 8.977149075
1997 7.16425362
1998 13.2308409
1999 4.669821024
2000 4.009433962
2001 3.684807256
2002 4.392199745
2003 3.805865922
2004 3.76723848
2005 4.246353323
2006 6.145522388
2007 6.369996746
2008 8.351816444
2009 10.87739112
2010 11.99229692
2011 8.857845297
2012 9.312445605
2013 10.90764331
2014 6.353194544
2015 5.872426595
'stock1' contains my data where the first column stands for Year, and the second for 'Inflation.percentage', as follows:
stock1<-read.csv("India-Inflation time series.csv", header=TRUE, stringsAsFactors=FALSE, as.is=TRUE)
The following is my code for creating the time series object:
stock <- ts(stock1$Inflation.percentage,start=(1960), end=(2015),frequency=1)
Following this, I am trying to decompose the time series object 'stock' using the following line of code:
decom_add <- (decompose(stock, type ="additive"))
Here I get an error:
Error in decompose(stock, type = "additive") : time series has no
or less than 2 periods
Why is this so? I initially thought it has something to do with frequency, but since the data is annual, the frequency has to be 1 right? If it is 1, then aren't there definitely more than 2 periods in the data?
Why isn't decompose() working? What am I doing wrong?
Thanks a lot in advance!
Please try for frequency=2, because frequency needs to be greater than 1. Because this action will change your model, for me the better way is to load data which contain and month column, so the frequency will be 12.

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources