Adding rows values for specific entries - r

I have budget data on a set of districts. I also have a district, DH, that had 2 additional regions merged into it after 2012. The budget values are given separately in the data frame for the year 2011 for the three parts that were later merged into one. I want to add those values into the district DH's values for the year 2011.
I know I can use column sums, but I don't know how to use column sum for all variables using the if/else condition
columnSums(df) if District==1 | District==2
The above code is definitely not going to work because it is not in the correct form, but this is the basic gist of the code I want to use to sum all variables for the districts 1 and 2 and add it to the values of the district 'DH'.

You have to alter the district column or do create a new one that identifies the districts that belong together. Here is some pseudo code:
library(dplyr)
df %>%
mutate(District = if_else(District == 2, 1, District)) %>%
group_by(District) %>%
summarise(col_to_sum = sum(col_to_sum))

Related

How to set a different "n" value than data frame length for prop.table (or a similar proportion/percentage table function)?

I have a dataframe of length 3000 with different occupations of 2500 different people (many respondents have multiple jobs). There is no ID var, it is just 1 column (Occupation) with the list of occupations (e.g., lobbyist, teacher, teacher, lobbyist, government employee, etc.).
I would like to see what percentage of my n=2500 each occupation is. So, as many people have multiple jobs, all of the percentages should add up to over 100%.
Here is the prop table I created, however, it bases the calculations off of n=3000. Is there a way to set the prop.table to n=2500? If not, is there another function I should use?
This is my code:
# Create the Proportion table
Occupation_Perc <- t(prop.table(table(NewData$Occupation))) #* 100
# filter out uncommon occupations (I'm only interested in common ones)
Occupation_Perc <- data.frame(Occupation_Perc) %>%
filter(Freq > .01)
# Drop unnecessary column produced by prop.table, and rename other column
Occupation_Perc <- as.data.frame(Occupation_Perc) %>%
select(-Var1) %>%
rename(Occupation = Var2)

How do I summarize the actual values of a data column insteadt of just the number of rows?

I have a conflict record where each entry/row shows a conflict event. Each event also shows the number of deaths. The dataset covers the years 2017-2022 and 10 different regions. Thus, there are different numbers of entries shown for each year and region. I now want to generate a new data frame that shows me one entry per year for each region, in which all deaths of that year in that precise region are added & shown (one row per year per region). Basically, as a result, each region has 6 entries (one for each year).
I know this command combination to get the sum of the entries displayed:
data_mali%>%
dplyr::group_by(admin1, year)%>%
dplyr::summarise(dplyr::n())
"
However, I now need the actual values of the entries summed. How do I do that?
summarise() is a similar verb to mutate(). You can replace a variable or create a new variable using functions like sum(), mean(), and first().
Here is a minimal reproducible example:
data(mtcars) # example dataset
library(dplyr)
mtcars %>%
group_by(cyl, wt) %>%
summarise(mpg = mean(mpg),
gear = first(gear),
total_hp = sum(hp))

Adding rows to data frame with zero values

I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))

Return lowest 5 values and IDs for every variable in large df in R

I have a large dataframe (4631 rows x 2995 cols). The rows represent the zip codes of all the hospitals in the US and the columns represent the zip codes of patients. I have calculated the distance between the patient's home zips and the hospitals so that each cell value is a numeric value representing the miles between each patient's home and each hospital.
An example df is:
10960 11040 56277 55379
37160 674.14 238.04 25.89 5.31
37091 162.62 71.25 428.56 672.11
89148 931.31 0.03 389.25 1000.05
91776 15.05 508.74 315.61 101.01
What I want to do now is extract the lowest five values for each patient, which would represent the five closest hospitals for each patient. But not only do I need to extract the cell values but I also need the row names so I can know which zip codes those hospitals are in.
So for example, if I was only looking for the lowest two values for each patient/column, I would like to know that for patient 10960 the closest hospital is 15.05 miles away and is in the 91776 zip code, and the second closest hospital is 162.62 miles away and is in the 37091 zip code.
I have this data transposed so if it would be easier to do this by swapping the rows and columns that's fine by me. I don't need the code to do that.
I've found ways to get the lowest values using functions and apply and stuff but it doesn't give me the corresponding zip codes.
I would appreciate any help!
Thanks!
Something like this should do the trick:
library(dplyr)
library(tidyr)
df %>%
mutate(hospital = rownames(.)) %>%
gather("patient", "distance", -hospital) %>%
group_by(patient) %>%
arrange(distance) %>%
slice(1:5) %>%
ungroup
First add a hospital column from the rownames, and then in the gather step the columns for distance are being turned into rows - each colname becomes an entry under the new patient column and the distances in each column become part of the distance column. group_by and arrange find sort the distances within each patient, and slice takes the first 5 rows of each. The ungroup isn't required but it's nice to undo the group_by if the grouping is no longer necessary.
maybe this would work:
library(dplyr)
test <- lapply(1:length(df), function(i) {
x <- arrange(df, names(df)[i])
tibble(HospitalZipCode = rownames(x)[1:5],
Distance = x[1:5,i, drop=TRUE],
Order = 1:5,
PatientID=names(df)[i])
}) %>% bind_rows()
This should give you a table with 5 rows per patient. I added a column for the order of the hospitals (1 for closest, 2 for second, etc.)

Resources