Creation of an index with a year base - r

I have a simple features panel data with the murders in the 32 Mexican states through 24 years. I want to create an index taking as a base the first year in my data (1994). To do so I am running the following code:
#Taking the data of murders in 1994 from each state and then paste it for all the years
mexico.sf$murders1994 <- mexico.sf$murders[mexico.sf$year==1994]
#Use the murders from each year divided by the murders in 1994 per state to create an index
mexico.sf$murdersrelativeto1994 <- (mexico.sf$murders / mexico.sf$murders1994)
Nevertheless when I run the first code I got the following error:
Error: Assigned data `mexico.sf$murders[mexico.sf$year == 1994]` must be compatible with existing data.
x Existing data has 800 rows.
x Assigned data has 32 rows.
i Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
It is clear to me that it is only taking 32 values because I am filtering per year, however, how can I copy those 32 data in all the sample?

Without looking at the actual data, I'm not sure I understand your purpose correctly. But if you need only to overwrite 800 values with the 32 values, maybe this way:
mexico.sf$murders1994 <- rep(mexico.sf$murders[mexico.sf$year==1994], 800/32)

Related

How to add only one observation at a time amongst several observations in R?

Say I have observations for several periods for financial data, how can I create a function in R that only adds one observation at a time throughout my dataset so that I can compare how a single observation impacts my original data?
Say for instance that I have something like this:
Apple Microsoft Tesla Amazon
2010 0.8533719 0.8078440 0.2620114 0.1869552
2011 0.7462573 0.5127501 0.5452448 0.1369686
2012 0.7580671 0.5062639 0.7847919 0.8362821
2013 0.3154078 0.6960258 0.7303597 0.6057027
2014 0.4741735 0.3906580 0.4515726 0.1396147
2015 0.4230036 0.4728911 0.1262413 0.7495193
2016 0.2396552 0.5001825 0.6732861 0.8535837
2017 0.2007575 0.8875209 0.5086837 0.2211072
#And I define my original covariance matrix as follows:
cov.m <- cov(x[1:5,])
#I would like to add only one new observation at a time, so the results should be:
cov(x[1:5,]), cov(x[1:6,]), cov(x[1:7,]), cov(x[1:8,])
I have tried using rbind and a repeat loop, but it seems like I still have to define every row to include in rbind, which is quite tedious if I want to test on say 100+ different observations as I then manually need to specify all the observations, and I would have no use for the repeat loop in that case either.
Does this get you closer to your expected output?
lapply(5:nrow(x), function(y) cov(x[1:y, ]))

apply nested within lapply not working in R

just earlier today I received a very helpful answer for a problem I was running into that allowed me to move onto the next step of one of my projects. However, I got stuck again later on in the project, and I'm wondering if any of you can help me move forward.
Context
Currently, I have a list of data frames that are full of soccer matches called wc_match_dataframes. Here is what one of the data frames looks like:
type_id tourn_id day month year team_A score_A score_B team_B win loss
f wc_1934 27 5 1934 Germany 5 2 Belgium Germany Belgium
I wasn't able to fit the data for the final three columns, draw, drawA, and drawB but basically the draw column is TRUE if the match is a draw, if not, it is FALSE. In the case of a draw, the win and loss columns are just filled by Draw. The drawA column is filled by team_A if the match was a draw, and likewise, the drawB column is filled by team_B.
The type_id is either f or q depending on if the match was a World Cup qualifier or a World Cup finals match. The tourn_id refers to the tournament the match was for, whether it was a qualifier or finals.
There are a total of 39 of these data frames, with a "finals" data frame for each of the 20 World Cup tournaments, and a "qualifiers" data frame for 19 tournaments (the first World Cup did not have qualifying).
What I Want To Do
I'm trying to populate a different list of data frames wc_dataframes with data for each of the 20 World Cups at the country level as opposed to the match level. Each of these twenty data frames will have the countries that made it to the finals of said tournament and their data like so:
Country
Wins in qualifying
Wins in finals
Losses in qualifying
Losses in finals
... and so on.
I have been able to populate the first country column for every World Cup no problem, but I'm running into issues for the rest of the columns.
Here is what I'm doing
This is the unlooped (only works for one World Cup) version of my code that works successfully:
wc_dataframes$wc_1930$fw <- apply(wc_dataframes$wc_1930, MARGIN = 1, function(country)
sum(wc_match_dataframes$`wc_1930 f`$w == country, na.rm = TRUE))
This is successfully populating the finals win column in the wc_dataframes$wc_1930 data frame by counting the number of wins.
Now, when I try and nest this under lapply to do it across all World Cup years like so:
lapply(names(wc_dataframes), function(year)
wc_dataframes$year$fw <- apply(wc_dataframes$year, MARGIN = 1, function(country)
sum(wc_match_dataframes$`year f`$w == country, na.rm = TRUE)))
It does not work for me. I suspect that the issue has to do with defining the year function and running into issues in the sum portion of my code. I come from a background in STATA so I am more used to running for loops and what not. I'm still getting used to R and lists and everything so I really appreciate the help.
Thank you!
Thank you so much in advance for the help, and happy holidays! :)
What you need is to output whatever you have replaced:
lapply(names(wc_dataframes), function(year){
wc_dataframes[[year]]$fw <- apply(wc_dataframes[[year]], MARGIN = 1, function(country)
sum(wc_match_dataframes[[paste(year,'f')]]$w == country, na.rm = TRUE));
wc_dataframes}
)

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Discriminant analysis and column name in the code

I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))

Loop to create series of graphs from different files

I am trying to plot histograms with long term (several years) mean precipitation (pp) for each day of the month from a series of files. Each file has data collected from a different place (and has a different code). Each of my files looks like this:
X code year month day pp
1 2867 1945 1 1 0.0
2 2867 1945 1 2 0.0
...
And I am using the following code:
files <- list.files(pattern=".csv")
par(mfrow=c(4,6))
for (i in 1:24) {
obs <- read.table(files[i],sep=",", header=TRUE)
media.dia <- ddply(obs, .(day), summarise, daily.mean<-mean(pp))
codigo <- unique(obs$code)
hist(daily.mean, main=c("hist per day of month", codigo))
}
I get 24 histograms with 24 different codes in the title, but instead of 24 DIFFERENT histograms from 24 different locations, I get the same histogram 24 times (with 24 different titles). Can anybody tell me why? Thanks!
There are at least two errors I can see in your code.
There is an error in your ddply statement.
You are passing the wrong variable to hist, thus plotting something that may or may not exist depending on previous session actions.
The problem in your ddply statement is that you are doing an invalid assign (using <- ). Fix this by using =:
media.dia<- ddply(obs, .(day),summarise, daily.mean = mean(pp))
Then edit your hist statement:
hist(media.dia$daily.mean,main=c("hist per day of month",codigo))
I suspect the problem is that you are not passing the correct parameter to hist. The reason that your code actually produces a plot at all, is because in some previous step in your session you must have created a variable called daily.mean (as Brandon points out in the comment.)
I think the daily.mean calculated in the ddply function is assigned in a separate environment, and does not exist in an environment hist can see.
Try daily.mean<<-mean(pp)

Resources