I have two datasets, data_issuer and ESG_data.
data_issuer contains bond pricing observations, the identity of the bond issuer is given with the EquityISIN variable. The same EquityISIN occurs multiple times in the dataset for two reasons 1. for each bond there are multiple pricing observations 2. each issuer can issue multiple bonds
The data_issuer dataset also contains a variable Year. which shows the year of the bond observation.
The first column of the ESG_data dataset, ISIN CODE, contains EquityISIN’s of public listed companies. The rest of the variables are years from 2000 til 2021. For clarity: the column names are 2000, 2001, 2002,.... 2021.
If I want to extract the ESG score of a specific company from the ESG_data dataset I look at the EquityISIN and the year, the intersection of the row and column holds the ESG score I am looking for.
I want to write a code which allows me to add a new variable to the data_issuer dataset called ESGscore. The ESG score has to be extracted from the ESG_data dataset using the EquityISIN value and the Year value.
I tried the following code but it returns an error:
merged_data <- merge(
data_issuer,
ESG_data,
by.x = c("EquityISIN", "Year"),
by.y = c("ISIN CODE", "2000":"2021")
)
# Error in merge.data.frame(data_issuer, ESG_data, by.x = c("EquityISIN", :
# 'by.x' and 'by.y' specify different numbers of columns
Also tried with the help of chat GPT but did not help unfortunately.
Related
I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)
I wish to separate a column of numeric data in R into 3 individual columns based on an additional variable in the data frame.
Here is some code to produce an example of the kind of data I am working with.
bus_reg_no <- c('G01', 'G01', 'G01', 'G02', 'G02', 'G02', 'G03', 'G03' , 'G03')
entity_name <- c('A','A','A','B','B','B','C','C','C')
TimePeriod <- c('2020','2019','2018','2020','2019','2018','2020','2019','2018')
market_test <- c(NA, 0.9330331232, 0.9969181046, 0.9429920482, 0.9689617356, 0.9764825438, NA, 0.2302289569, 0.7762837775)
dat <- cbind(bus_reg_no, entity_name, TimePeriod, market_test)
What I want to do with this data is convert the market_test variable into 3 columns for each year 2018, 2019, 2020. I then want a value of market_test for each year to be displayed once in every row which corresponds to entity_name.
So in my desired output, each row will have one value of entity name and the corresponding market_test values for 2018, 2019 and 2020 in separate columns. Therefore, my final dataset will also have a shorter length than the original one.
I have tried using the dcast function. It is working on this smaller sample data set but when I apply it to my full data set it is constantly returning the warning 'Aggregate function missing, defaulting to 'length'' and is showing an output of 0's and 1's not the actual values of the market test in the columns for the 3 years.
I have also tried the pivot_wider function:
pivot_wider(names_from = c(TimePeriod, entity_name), values_from = market_test)
but I am getting 'Error in rval[, idvar] : subscript out of bounds' with any combination.
I have looked at other related posts but none have helped me sorting out any errors.
Any help is much appreciated.
Thanks
Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.
I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0
I have a panel dataset containing data on civil war, with indices "side_a_id" and "year_month". Each observation is an individual 'event' of armed conflict, and variables include details on the actors involved, a unique ID for each individual event, and then for each event, the number of side_a deaths, side_b deaths, and civilian deaths.
Screenshot of sample dataset
I would like to aggregate the data on each separate variable on deaths ('deaths_a', 'deaths_b' and 'civilian_deaths') according to which year_month they are in. So taking an example from my dataset below: instead of having 3 separate rows for the interaction between Government of Haiti and Military Faction (dyad_id = 14), I would like one row that contains all the deaths of each party for a specific month. I have tried using the the aggregate() function, which seems to work, until I try to re-merge it with my full dataset.
df <- aggregate(cbind(deaths_a, deaths_b, deaths_civilians) ~ side_a_id +
year_month, panel_data, sum)
rebel <- full_join(panel_data, df, by = c("side_a_id", "year_month"))
Can anyone suggest a solution?