In my dataframe, see table attached here, I have three columns: country, results and eurosceptic.
I would like to know if it is possible to merge together rows that share all but two observations, which should be the results and eurosceptic observations.
For example, a function that would leave me with two rows for Belgium. One in which the eurosceptic value is 1 and one where it's 0. Then, the results column in each of those rows would be the sum formed by the results of the former rows that shared either 1 or 0 for the eurosceptic variable.
So, the eurosceptic = 0 Belgium row would have its results observation equal the sum of the results observations of the rows in my current table, which were related to Belgium and all had the eurosceptic value as 0.
In short, a transformation of my df to one with two rows per country, the eurosceptic value as 0 and 1, where the results observation for each is the summed results observations of the previous rows with the corresponding country and eurosceptic values.
Is this possible?
Thank for your help in advance!
My table as it is now
We can group by 'Country', 'eurosceptic' get the sum of 'results'
library(dplyr)
df1 %>%
group_by(Country, eurosceptic) %>%
summarise(results = sum(results))
Related
I have (df) has (ID), (Adm_Date), (ICD_10), (points). and it has 1,000,000 rows.
(Points) represent value for (ICD_10)
(ID): each one has many rows
(Adm_Date) from 2010-01-01 to 2018-01-01.
I want the sum (points) without duplicate for filter rows starting from (Adm_date) to 2 years previous back from (Adm_Date) by (ID).
The periods like these:
01-01-2010 to 31-01-2012,
01-02-2010 to 29-02-2012,
01-03-2010 to 31-03-2012,...... so on to the last date 01-12-2016 to 31-12-2018.
my problem is with the filter of the dates. It does not filter the rows based on period date. It does sum (points) for each (ID) without duplicates for all data from the 2010 to 2018 period instead of summing them per period date for each (ID).
I used these codes
start.date= seq(as.Date (df$Adm_Date))
end.date = seq(as.Date (df$Adm_Date+ years(-2)))
Sum_df<- df %>% dplyr::filter(Adm_Date >=start.date & Adm_Date<=end.date) %>%
group_by(ID) %>%
mutate(sum_points = sum(points*!duplicated(ICD_10)))
but the filiter did not work, because it does sum (points) for each (ID) from all dates from the 2010 to 2018 instead of summing them per period date for each (ID).
sum_points will start from 01-01-2012, any Adm_Date >= 01-01-2012 I need to get their sum.
If I looked at the patient has ID=11. I will sum points from row 3 to row 23, Also I need to ignore repeat ICD_10 (e.g. G81, and I69 have repeated in this period). so results show like this
ID(11), Adm_Date(07-05-2012), sum_points(17), while the sum points for the same patient at Adm_Date(13-06-2013) I will sum from row 11 to row 27 because look back for 2 years from Adm_Date. So,
ID(11), Adm_Date(13-06-2013), sum_points(14.9)
I have about a half million of ID and more than a million rows.
I hope I explained it well. Thank you
enter image description here
I have budget data on a set of districts. I also have a district, DH, that had 2 additional regions merged into it after 2012. The budget values are given separately in the data frame for the year 2011 for the three parts that were later merged into one. I want to add those values into the district DH's values for the year 2011.
I know I can use column sums, but I don't know how to use column sum for all variables using the if/else condition
columnSums(df) if District==1 | District==2
The above code is definitely not going to work because it is not in the correct form, but this is the basic gist of the code I want to use to sum all variables for the districts 1 and 2 and add it to the values of the district 'DH'.
You have to alter the district column or do create a new one that identifies the districts that belong together. Here is some pseudo code:
library(dplyr)
df %>%
mutate(District = if_else(District == 2, 1, District)) %>%
group_by(District) %>%
summarise(col_to_sum = sum(col_to_sum))
I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))
Just wondering if there is a way to expand rows which have multiple observations, into rows of unique observations using R? I have data in an excel spreadsheet with the variable headings: Lease, Line, Bay, Date, Predators, Food.Index, DD, MM, YY.
On some dates, there have been multiple predators (from 1 to 4) recorded in the same row. Other days just have 0. On a day where there has been 4 predators recorded, I would like to somehow transform the data to show four unique observations (instead of one row with 4 recorded under "Predators").
I have 1669 rows of data and multiple rows need to be expanded
Example of Data set
Many thanks for your help in advance.
enter image description here
Assuming you have your data in a data.frame, df, one possible solution would be
df.expanded <- df[rep(row.names(df), df$Predators), ]
EDIT: If you also want to keep the rows with 0 predators, you can use pmax to always return at least one:
df.expanded <- df[rep(row.names(df), pmax(df$Predators, 1)),]
Here the pmax(df$Predators, 1) will return the elementwise maximum of df$Predators and 1 so that it returns a new vector where each element is at least 1 but takes the value of df$Predators if that number is greater than 1.
I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))