Deleting specific column/row values with if conditions - r

This is probably straight forward, but I am struggling big time.
I have a data frame with different industries between 1999 and 2000.
fyear industry employees
1 1999 Agriculture 132.260
2 2000 Agriculture 154.590
3 2001 Agriculture 147.725
4 2002 Agriculture 142.098
5 2003 Agriculture 77.169
6 2004 Agriculture 82.979
7 2005 Agriculture 99.625
8 2006 Agriculture 98.195
9 2007 Agriculture 95.193
10 2008 Agriculture 104.459
11 2009 Agriculture 182.930
12 2010 Agriculture 180.648
13 2011 Agriculture 173.408
14 2012 Agriculture 181.483
15 2013 Agriculture 109.842
16 2014 Agriculture 90.177
17 2015 Agriculture 92.067
18 2016 Agriculture 83.568
19 2017 Agriculture 70.251
20 2018 Agriculture 65.082
21 2019 Agriculture 82.754
22 1999 Aircraft 653.194
23 2000 Aircraft 692.918
24 2001 Aircraft 666.751
25 2002 Aircraft 633.565
26 2003 Aircraft 687.611
27 2004 Aircraft 701.827
28 2005 Aircraft 725.825
29 2006 Aircraft 751.171
30 2007 Aircraft 744.060
31 2008 Aircraft 750.319
32 2009 Aircraft 677.598
33 2010 Aircraft 690.605
34 2011 Aircraft 712.501
35 2012 Aircraft 716.985
36 2013 Aircraft 709.918
I am trying to create some growth variables
df$employeegrowth <- df$employees / lag(df$employees) - 1
This naturally causes some issues for every "1999" rows, which I would like to replace with NA.
I am trying to solve this issue with an if formula:
df$employeegrowth <- if(df$fyear == "1999") {
df$employeegrowth <- "NA"
}
But this substitutes every value in the employee growth column with NA.
I do not want to delete the entire row as the other columns contain valuable information.
could someone point me in the right direction on this?

Use lag by group :
library(dplyr)
df %>%
group_by(industry) %>%
mutate(employeegrowth = employees/lag(employees) - 1)
# fyear industry employees employeegrowth
# <int> <chr> <dbl> <dbl>
# 1 1999 Agriculture 132. NA
# 2 2000 Agriculture 155. 0.169
# 3 2001 Agriculture 148. -0.0444
# 4 2002 Agriculture 142. -0.0381
# 5 2003 Agriculture 77.2 -0.457
# 6 2004 Agriculture 83.0 0.0753
# 7 2005 Agriculture 99.6 0.201
# 8 2006 Agriculture 98.2 -0.0144
# 9 2007 Agriculture 95.2 -0.0306
#10 2008 Agriculture 104. 0.0973
# … with 26 more rows
This will give NA for first value of fyear in each industry.

Related

Rename/recode variable value in R based on condition using dplyr

I have a dataset dataExtended with variable CountryOther and n which is a count of wines in that particular country. CountryOther is character type and n is integer. What I want to do, is to rename values in CountryOther to Other in case the n <=20. I would like to do it with dyplr package and I am not sure how to do it and if to use only mutate or mutate_at.
As long as I wasn't able to do wrote the condition as stated above, I tried to do it manually as follows but it didn't work:
dataExtended$CountryOther <- dataExtended$Country
dataExtended %>%
mutate(CountryOther = recode(CountryOther,
China = "Other",
Mexico = "Other",
Slovakia = "Other",
Bulgaria = "Other",
Canada = "Other",
Croatia = "Other",
Uruguay = "Other",
Georgia = "Other",
Turkey = "Other",
Moldova = "Other",
Slovenia = "Other",
Hungary = "Other",
Switzerland = "Other",
Greece = "Other",
Israel = "Other",
Lebanon= "Other"))
Using the Red.csv from your link imported with readr::read_csv() creates a data.frame / tibble
#> data
# A tibble: 8,666 × 8
Name Country Region Winery Rating NumberOf…¹ Price Year
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Pomerol 2011 France Pomerol Château La Providence 4.2 100 95 2011
2 Lirac 2017 France Lirac Château Mont-Redon 4.3 100 15.5 2017
3 Erta e China Rosso di Toscana 2015 Italy Toscana Renzo Masi 3.9 100 7.45 2015
4 Bardolino 2019 Italy Bardolino Cavalchina 3.5 100 8.72 2019
5 Ried Scheibner Pinot Noir 2016 Austria Carnuntum Markowitsch 3.9 100 29.2 2016
6 Gigondas (Nobles Terrasses) 2017 France Gigondas Vieux Clocher 3.7 100 19.9 2017
7 Marion's Vineyard Pinot Noir 2016 New Zealand Wairarapa Schubert 4 100 43.9 2016
8 Red Blend 2014 Chile Itata Valley Viña La Causa 3.9 100 17.5 2014
9 Chianti 2015 Italy Chianti Castello Montaùto 3.6 100 10.8 2015
10 Tradition 2014 France Minervois Domaine des Aires Hautes 3.5 100 6.9 2014
# … with 8,656 more rows, and abbreviated variable name ¹​NumberOfRatings
Now with dplyrs help
library(dplyr)
data %>%
add_count(Country, name = "WineCount") %>%
mutate(CountryOther = ifelse(WineCount <= 20, "Other", Country))
we get
# A tibble: 8,666 × 10
Name Country Region Winery Rating Numbe…¹ Price Year WineC…² Count…³
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int> <chr>
1 Pomerol 2011 France Pomerol Château La… 4.2 100 95 2011 2256 France
2 Lirac 2017 France Lirac Château Mo… 4.3 100 15.5 2017 2256 France
3 Erta e China Rosso di Toscana 2015 Italy Toscana Renzo Masi 3.9 100 7.45 2015 2650 Italy
4 Bardolino 2019 Italy Bardolino Cavalchina 3.5 100 8.72 2019 2650 Italy
5 Ried Scheibner Pinot Noir 2016 Austria Carnuntum Markowitsch 3.9 100 29.2 2016 220 Austria
6 Gigondas (Nobles Terrasses) 2017 France Gigondas Vieux Cloc… 3.7 100 19.9 2017 2256 France
7 Marion's Vineyard Pinot Noir 2016 New Zealand Wairarapa Schubert 4 100 43.9 2016 63 New Ze…
8 Red Blend 2014 Chile Itata Valley Viña La Ca… 3.9 100 17.5 2014 326 Chile
9 Chianti 2015 Italy Chianti Castello M… 3.6 100 10.8 2015 2650 Italy
10 Tradition 2014 France Minervois Domaine de… 3.5 100 6.9 2014 2256 France
# … with 8,656 more rows, and abbreviated variable names ¹​NumberOfRatings, ²​WineCount, ³​CountryOther
We can filter for WineCount <= 30:
# A tibble: 125 × 10
Name Country Region Winery Rating Numbe…¹ Price Year WineC…² Count…³
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int> <chr>
1 Steiner 2013 Hungary Sopron Wenin… 3.7 100 24.5 2013 9 Other
2 Viile Metamorfosis Merlot 2015 Romania Dealu Mare Vitis… 3.5 102 7.5 2015 23 Romania
3 Halkidiki Limnio - Merlot 2013 Greece Chalkidiki Tsant… 3.2 105 12.5 2013 13 Other
4 Cabernet Sauvignon 2013 Mexico Valle de Guad… L. A.… 3.4 1066 8.65 2013 1 Other
5 Driopi Classic Agiorgitiko Nemea 2017 Greece Nemea Κτημα… 3.7 107 11.5 2017 13 Other
6 Malbec de Purcari 2018 Moldova South Eastern Châte… 4.1 107 12.0 2018 8 Other
7 Cabernet Sauvignon de Purcari 2017 Moldova South Eastern Châte… 4.1 1082 13.0 2017 8 Other
8 Cabernet Sauvignon 2016 Romania Samburesti Caste… 3.3 112 7.9 2016 23 Romania
9 Aigle Les Murailles Rouge 2015 Switzerland Aigle Henri… 3.7 112 23.2 2015 12 Other
10 Γουμένισσα (Goumenissa) 2015 Greece Goumenissa Chatz… 3.7 115 20 2015 13 Other
to check the desired output: There are several rows filled with "Other" in column CountryOther.
in the end I created this code which works:
#New table with wine count
wineCount <- data %>% count(Country)
#Joining two tables together
dataExtended <- inner_join(wineCount, data, by = "Country")
# Creating new variable CountryOther
dataExtended$CountryOther <- dataExtended$Country
# Renaming count from n to WineCount
dataExtended <- rename(dataExtended, WineCount = n)
# Replacement of countries with WineCount<=20 to Other
dataExtended <- dataExtended %>%
mutate(CountryOther = ifelse(WineCount<=20, "Other", CountryOther))
# Final check
unique(dataExtended$CountryOther)
The problem was I needed to store changes into the dataframe, which I didn't do before (as you can see in my last comment):
dataExtended <- rename(dataExtended, WineCount = n)
and
dataExtended <- dataExtended %>%
mutate(CountryOther = ifelse(WineCount<=20, "Other", CountryOther))
I also tested your code and it works as well and additionally it looks neater. So thank you very much for your help.

Merging two data frames with different rows in R

I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo

Confused on percent difference calculations in R using dplyr::mutate

I'm attempting to find the percent differences of state characteristics (using a defined index created using factor analysis) between the years 2012 and 2017. However some states begin with a score of -0.617 (2012) and end with -1.25 (2017), creating a positive percent difference rather than a negative.
The only other thing I've tried is subtracting 1 from the fraction factor1/lag(factor1). Below is is the code I'm currently working with:
STFACTOR %>>%
dplyr::select(FIPSst, Geography, Year, factor1) %>>%
filter(Year == c(2012, 2017)) %>>%
group_by(Geography) %>>%
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
These are the changes and results from each change in code
mutate(pct_change = (1-factor1/lag(factor1)) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 102.
I would expect the final result to look like this:
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100)
Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.
we can use
mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)

Remove rows with NA values and delete those observations in another year [duplicate]

This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93

correlation between two data frames in R

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?
Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.
Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

Resources