Data transformation in R- columns - r

My dataframe with n dates
Date team_home team_away prob_home draw prob_away
01/01/2021 Brazil Germany 95.0 5.0 0.0
01/01/2021 England Belgium 50.0 10.0 40.0
02/01/2021 Belgium Canada 90.0 7.0 3.0
02/01/2021 Germany France 60.0 10.0 30.0
... .... ... ... ... ...
DESIRED DATAFRAME. Important: Only one date per row
Date prob_Brazil draw_Brazil_Germany prob_Germany prob_England draw_England_Belgium prob_Belgium ....
01/01/2021 95.0 5.0 0.0 50.0 10.0 40.0
02/01/2021 NA NA 60.0 NA NA 90.0
Thank you for your help!

You can use this but the output may not be quite desirable:
library(tidyr)
df %>%
pivot_wider(names_from = c(team_home, team_away),
values_from = c(prob_home, draw, prob_away))

Related

Extract and combine columns with the same name from a list of dataframes

I got a list of 17 dataframes that contain multiple macroeconomic variables for several countries, and the dataframes' structure is like:
df$CPI
Date US Argentina Vietnam India Indonesia Philippines
1564531200 1.8 54.4 2.4 3.1 3.3 2.4
1561852800 1.6 55.8 2.2 3.2 3.3 2.7
1559260800 1.8 57.3 2.9 3.0 3.3 3.2
df$CapitalAccount
Date US Argentina Brazil China Turkey Thai
2019-06-30 0 13.8 49.0 -58.5 -7.2 27.9
2019-03-31 0 32.2 98.1 -26.3 21.4 0.0
2018-12-31 2721 16.2 59.8 -213.1 0.5 0.0
2018-06-30 -5 10.9 82.0 -50.6 -2.7 0.0
I'm trying to re-organize those dataframes by country names, like:
US
Date CPI CapitalAccount .......(the other 14 macro variables)
2019-06-30
2019-03-31
2018-12-31
Argentina
Date CPI CapitalAccount .......(the other 14 macro variables)
2019-06-30
2019-03-31
2018-12-31
.
.
.
.
I've tried using a for loop to go through each dataframe in the list of dataframes and grab the column by colnames() of that dataframe, but it's not working and the result gives me many duplicate NAs and Dates.
For US:
for (i in 1:length(df)){
NewUS <- df[[i]][,which(colnames(df[[i]])=='US')]
US <- merge(US, NewUS)
i <- i+1
}
US
For Argentina:
for (i in 1:length(df)){
NewArgentina <- df[[i]][,which(colnames(df[[i]])=='Argentina')]
Argentina <- merge(Argentina, NewArgentina)
i <- i+1
}
Argentina
EDIT: per #Gregor's suggestion. I use idcol and fill = T to replace the for loop.
Hope this helps. In the code below, df1 and df2 are dummy data tables. In your case, they will be CPI, CapitalAccount...
First, we select the columns from each table, add a new column in each of the data table in the list called type and assign the economic variables in the column. Next, we use rbindlist() to bind the list now that your data tables have the exact columns.
library(data.table)
df1 <- data.table(date = rep(seq(from = as.Date('2019-01-01'), to = as.Date('2019-01-05'), by = 'day'), 5),
US = runif(25),
Argentina = runif(25),
Thailand = runif(25),
China = runif(25))
df2 <- data.table(date = rep(seq(from = as.Date('2019-01-01'), to = as.Date('2019-01-05'), by = 'day'), 5),
US = runif(25),
Argentina = runif(25),
Japan = runif(25))
l1 <- list(df1, df2)
names(l1) <- c('GDP', 'CPI')
x <- rbindlist(l1, idcol = 'type', fill = TRUE) # this works even when the columns are different for each table
Now we have all the data tables combine, we can reshape the table to make look like the result you wanted.
x1 <- melt(x, id.vars = c('date', 'type'), measure.vars = c('US', 'Argentina'), variable.name = 'country', value.name = 'value')
dcast(x1, date + country ~ type, value.var = 'value')
Consider base R with reshape, chain merge, and split for transformed named list of data frames. Helper functions include higher-order Map and Reduce.
proc_reshape <- function(df, nm) {
within(data.frame(reshape(df, varying = names(df)[-1], times = names(df)[-1],
v.names = nm, timevar = "Country", direction = "long"),
row.names = NULL), {
Date <- as.Date(as.POSIXct(Date, origin = "1970-01-01"))
rm(id)
})
}
# ELEMENTWISE LOOP THROUGH DFs AND THEIR NAMES
long_list <- Map(proc_reshape, my_list, names(my_list))
# CHAIN MERGE (FULL JOIN FOR MISMATCHED DATES BY COUNTRY)
merged_df <- Reduce(function(x, y) merge(x, y, by = c("Country", "Date"), all = TRUE),
long_list)
# CREATE NEW NAMED LIST OF DFs
new_list <- split(merged_df, merged_df$Country)
Output
new_list
$Argentina
Country Date CPI CapitalAccount
1 Argentina 2018-06-29 NA 10.9
2 Argentina 2018-12-30 NA 16.2
3 Argentina 2019-03-30 NA 32.2
4 Argentina 2019-05-31 57.3 NA
5 Argentina 2019-06-29 NA 13.8
6 Argentina 2019-06-30 55.8 NA
7 Argentina 2019-07-31 54.4 NA
$Brazil
Country Date CPI CapitalAccount
8 Brazil 2018-06-29 NA 82.0
9 Brazil 2018-12-30 NA 59.8
10 Brazil 2019-03-30 NA 98.1
11 Brazil 2019-06-29 NA 49.0
$China
Country Date CPI CapitalAccount
12 China 2018-06-29 NA -50.6
13 China 2018-12-30 NA -213.1
14 China 2019-03-30 NA -26.3
15 China 2019-06-29 NA -58.5
$India
Country Date CPI CapitalAccount
16 India 2019-05-31 3.0 NA
17 India 2019-06-30 3.2 NA
18 India 2019-07-31 3.1 NA
$Indonesia
Country Date CPI CapitalAccount
19 Indonesia 2019-05-31 3.3 NA
20 Indonesia 2019-06-30 3.3 NA
21 Indonesia 2019-07-31 3.3 NA
$Philippines
Country Date CPI CapitalAccount
22 Philippines 2019-05-31 3.2 NA
23 Philippines 2019-06-30 2.7 NA
24 Philippines 2019-07-31 2.4 NA
$Thai
Country Date CPI CapitalAccount
25 Thai 2018-06-29 NA 0.0
26 Thai 2018-12-30 NA 0.0
27 Thai 2019-03-30 NA 0.0
28 Thai 2019-06-29 NA 27.9
$Turkey
Country Date CPI CapitalAccount
29 Turkey 2018-06-29 NA -2.7
30 Turkey 2018-12-30 NA 0.5
31 Turkey 2019-03-30 NA 21.4
32 Turkey 2019-06-29 NA -7.2
$US
Country Date CPI CapitalAccount
33 US 2018-06-29 NA -5
34 US 2018-12-30 NA 2721
35 US 2019-03-30 NA 0
36 US 2019-05-31 1.8 NA
37 US 2019-06-29 NA 0
38 US 2019-06-30 1.6 NA
39 US 2019-07-31 1.8 NA
$Vietnam
Country Date CPI CapitalAccount
40 Vietnam 2019-05-31 2.9 NA
41 Vietnam 2019-06-30 2.2 NA
42 Vietnam 2019-07-31 2.4 NA
Demo

Averaging by column for set number of rows

I have a panel dataset where I want to average over a specified number of time periods (t) by variable (column).
An example:
Country Year Var 1 Var 2 Var 3
Austria 1984 1 3.6 95
Austria 1985 2 4.1 94.6
Austria 1986 1 2.6 93.6
Austria 1987 1 3 94.4
Austria 1988 1 3.9 95.2
What I want then is a new column/new dataframe with a new variable for the average for the 5 year period (1984-1988) for Var 1, a variable for the average of Var 2 and var 3 etc.
I also want to loop the function over such that I can apply it to the other countries in my dataset. It would be great if I could avoid that the averaging mixes up countries, so I was thinking of adding some matching string pattern (for code %in% AUT in this case for instance, I have a variable with country codes) but I couldn't figure out how to do it.
Thank you very much in advance
1) Using the sample input in the Note at the end, read in the country and year from the row names and round the year up to the end of the current 5 year period so that each year from 1984 to 1988 gets rounded up to 1988, etc. Then use aggregate to calculate the means of each column by both country and year. No packages are used.
By0 <- read.table(text = rownames(DF), col.names = c("Country", "Year"))
By <- transform(By0, Year = 5 * ((Year - min(Year)) %/% 5) + min(Year) + 4)
aggregate(DF, By, mean)
giving the following:
Country Year Var 1 Var 2 Var 3
1 Australia 1988 1.6 18.46 95.52
2 Austria 1988 1.2 3.44 94.56
2) or if what was wanted was to append the columns to the original data frame lapply over the columns using ave to take the mean by Country for each:
out <- cbind(DF, lapply(DF, function(x) with(By, ave(x, Country, Year, FUN = mean))))
names(out) <- c(names(DF), paste("Mean", names(DF)))
giving:
> out
Var 1 Var 2 Var 3 Mean Var 1 Mean Var 2 Mean Var 3
Austria 1984 1 3.6 95.0 1.2 3.44 94.56
Austria 1985 2 4.1 94.6 1.2 3.44 94.56
Austria 1986 1 2.6 93.6 1.2 3.44 94.56
Austria 1987 1 3.0 94.4 1.2 3.44 94.56
Austria 1988 1 3.9 95.2 1.2 3.44 94.56
Australia 1984 1 3.6 95.0 1.6 18.46 95.52
Australia 1985 2 4.1 94.6 1.6 18.46 95.52
Australia 1986 1 2.6 93.6 1.6 18.46 95.52
Australia 1987 1 3.0 94.4 1.6 18.46 95.52
Australia 1988 3 79.0 100.0 1.6 18.46 95.52
Note
The input used, shown reproducibly, is:
Lines <- "
Var 1,Var 2,Var 3
Austria 1984,1,3.6,95
Austria 1985,2,4.1,94.6
Austria 1986,1,2.6,93.6
Austria 1987,1,3,94.4
Austria 1988,1,3.9,95.2
Australia 1984,1,3.6,95
Australia 1985,2,4.1,94.6
Australia 1986,1,2.6,93.6
Australia 1987,1,3,94.4
Australia 1988,3,79,100"
DF <- read.csv(text = Lines, check.names = FALSE)

Removing rows based where data isn't sequential in R, dplyr

I have a data frame where I am trying to remove rows where the year is not sequential.
Here is a sample of my data frame:
Name Year Position Year_diff FBv ind1 velo_diff
1 Aaron Heilman 2005 RP 2 90.1 TRUE 0.0
2 Aaron Heilman 2003 SP NA 89.4 NA 0.0
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
5 Alexi Ogando 2015 RP 2 94.5 TRUE 0.0
6 Alexi Ogando 2013 SP NA 93.4 FALSE 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The expected output should be:
Name Year Position Year_diff FBv ind1 velo_diff
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The reason Alexi Ogando 2011-2012 is still there is because his sequence of SP to RP is met in line with consecutive years. Ogando's 2013-2015 SP to RP sequence is not met with consecutive years.
An element which might help is that each sequence where the years aren't sequential, the velo_diff will be 0.0
Would anybody know how to do this? All help is appreciated.
You can do a grouped filter, checking if the subsequent or previous year exists and if the Position matches accordingly:
library(dplyr)
df <- read.table(text = 'Name Year Position Year_diff FBv ind1 velo_diff
1 "Aaron Heilman" 2005 RP 2 90.1 TRUE 0.0
2 "Aaron Heilman" 2003 SP NA 89.4 NA 0.0
3 "Aaron Laffey" 2010 RP 1 86.8 TRUE -0.6
4 "Aaron Laffey" 2009 SP NA 87.4 NA 0.0
5 "Alexi Ogando" 2015 RP 2 94.5 TRUE 0.0
6 "Alexi Ogando" 2013 SP NA 93.4 FALSE 0.0
7 "Alexi Ogando" 2012 RP 1 97.0 TRUE 1.9
8 "Alexi Ogando" 2011 SP NA 95.1 NA 0.0', header = TRUE)
df %>% group_by(Name) %>%
filter(((Year - 1) %in% Year & Position == 'RP') |
((Year + 1) %in% Year & Position == 'SP'))
#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#>
#> Name Year Position Year_diff FBv ind1 velo_diff
#> <fctr> <int> <fctr> <int> <dbl> <lgl> <dbl>
#> 1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#> 2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#> 3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#> 4 Alexi Ogando 2011 SP NA 95.1 NA 0.0
We can use data.table
library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp = cumsum(Position == "RP"))]$V1]
# Name Year Position Year_diff FBv ind1 velo_diff
#1: Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2: Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3: Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4: Alexi Ogando 2011 SP NA 95.1 NA 0.0
Or using the same methodology with dplyr
library(dplyr)
df1 %>%
group_by(Name, grp = cumsum(Position == "RP")) %>%
filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
ungroup() %>%
select(-grp)
# A tibble: 4 × 7
# Name Year Position Year_diff FBv ind1 velo_diff
# <chr> <int> <chr> <int> <dbl> <lgl> <dbl>
#1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4 Alexi Ogando 2011 SP NA 95.1 NA 0.0

R: Creating a moving difference variable

I'm attempting to create a new row that is a moving differences between two values in panel data.
My data looks like this:
party_id year country position vote
101 1984 be 2.75 2.3
101 1988 be 2.75 0.8
101 1992 be 3.33 0.1
101 1996 be 3.67 0.1
102 1984 be 5.80 12.6
102 1988 be 5.80 15.7
I want a row which shows the difference in the vote share for two different years: e.g. 1988 and 1984. So that it shows changes in vote share.
So my data would look like:
party_id year country position vote vote_difference
101 1984 be 2.75 2.3 NA
101 1988 be 2.75 0.8 -1.5
101 1992 be 3.33 0.1 -0.7
101 1996 be 3.67 0.1 0.0
102 1984 be 5.80 12.6 NA
102 1988 be 5.80 15.7 3.1
Any ideas?
Thanks for the help
Here is a base R solution that applies the indicated function to vote grouped by party_id:
transform(DF, diff = ave(vote, party_id, FUN = function(x) c(NA, diff(x))))
giving:
party_id year country position vote diff
1 101 1984 be 2.75 2.3 NA
2 101 1988 be 2.75 0.8 -1.5
3 101 1992 be 3.33 0.1 -0.7
4 101 1996 be 3.67 0.1 0.0
5 102 1984 be 5.80 12.6 NA
6 102 1988 be 5.80 15.7 3.1
Note: The input DF in reproducible form is:
Lines <- "party_id year country position vote
101 1984 be 2.75 2.3
101 1988 be 2.75 0.8
101 1992 be 3.33 0.1
101 1996 be 3.67 0.1
102 1984 be 5.80 12.6
102 1988 be 5.80 15.7 "
DF <- read.table(text = Lines, header = TRUE)
Using data.table:
library(data.table); setDT(data)
data[ , vote_difference := diff(vote), by = party_id]

counting -99.9 in r

I am new in using R and I have a question for which I am trying to find the answer. I have a file organized as follows (it has thousands of rows but I just show a sample for simplicity):
YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
I would like to know for each column (T1, T2, and R1) in which Year, Month and Day the -99.9 are located; e.g. from 1980/1/3 to 1980/1/27 there are X -99.9 for T1, from 1990/2/10 to 1990/3/30 there are Y-99.9 for T1....and so on. Same for T2, and R.
How can do this in R?
This is only one file like this but I have almost 2000 files with the same problem (I know I need to loop it) but if I know how to do it for one file then I will just create a loop.
I really appreciate any help. Thank you very much in advance for reading and helping!!!
I took the liberty of renaming your last dataframe column "R1"
lapply(c('T1', 'T2', 'R1'), function(x) { dfrm[ dfrm[[x]]==-99.9 , # rows to select
1:3 ] }# columns to return
)
#-------------
[[1]]
YEAR Month day
3 1965 3 4
[[2]]
YEAR Month day
3 1965 3 4
[[3]]
YEAR Month day
3 1965 3 4
4 1965 3 5
5 1965 3 6
It wasn't clear whether you wanted the values or counts (and I don't think you can have both in the same report.) If you wanted to name the entries:
> misdates <- .Last.value
> names(misdates) <- c('T1', 'T2', 'R1')
If you wanted a count:
lapply(misdates, NROW)
$T1
[1] 1
$T2
[1] 1
$R1
[1] 3
(You might want to learn how to use NA values. Using numbers as missing values is not recommended data management.)
If I understand correctly, you want to obtain how many "-99.9"s you get per month AND by column,
Here's my code for S1 using plyr. You'll note that I expanded your example to get one more month of data.
library(plyr)
my.table <-read.table(text="YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
1966 1 7 -99.9 7.2 1.1 0.0
1966 1 8 -99.9 7.2 1.1 0.0
", header=TRUE, as.is=TRUE,sep = " ")
#Create a year/month column to summarise per month
my.table$yearmonth <-paste(my.table$YEAR,"/",my.table$Month,sep="")
S1 <-count(my.table[my.table$S1==-99.9,],"yearmonth")
S1
yearmonth freq
1 1965/3 1
2 1966/1 2

Resources