Loop of missing data by year - r

I need to run a loop for my dataset, ds. the dim of ds is 4000, 11. Each country of the world of represented. And each country has data from 1970 to 1999.
The data set has missing data amongst its 8 rows. I need to run a loop that calculates how much missing data there is PER year. the year is in df$year.
I am pretty sure the years (1970, 1971, 1972...) are numeric values.
This is my current code
missingds<-c()
for (i in 1:length(ds)){
missingds[names(ds)[i]]<-sum(is.na(ds[i]))/4000
}
This gives me the proportion of missing data per variable of ds. I just cannot figure out how to get it report the proportion of all the variables per year.
I do have an indicator variable ds$missing which reports 1 if there is an NA in any of the columns of that row and 0 if not.
A picture of ds

To count number of NA values in each column using dplyr you can do :
library(dplyr)
result <- data %>%
group_by(Year) %>%
summarise(across(gdp_growth:polity, ~sum(is.na(.))))
In base R you can use aggregate :
aggregate(cbind(gdp_growth, gdp_per_capita, inf_mort, pop_density, polity)~year,
data, function(x) sum(is.na(x)))
Replace sum with mean if you want to count the proportions of NA value in each year.

Using data.table
library(data.table)
setDT(data)[, lapply(.SD, function(x) sum(is.na(x))),
by = Year, .SDcols = gdp_growth:polity]

Related

R: Create column showing days leading up to/ since the maximum value in another column was reached?

I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))

R - Creating a new column based on a conditional observation and applying it to the master df

I have a very large dataframe (with ~15 million observations of 10 variables). The df is essentially results for a set of cities under various scenarios (conditions). Here is a simplified view of the df:
State City Result Year Condition1 Condition2 Condition3
AL Cottonwood 4.5 2000 p5 a10 d20
....
AL Cottonwood 2.5 2010 p10 a20 d50
I am trying to create a new column ("base") that is equal to a given city's result under the various scenarios for the year 2000. Because of the many scenarios, I am having a lot of difficulty doing this.
Thank you!
So you want a comparison on each row for those conditions but the year 2000?
The way I would go about it would be to join the dataframe onto itself filtered to the year 2000. Assuming you dataframe is called df
require(dplyr)
df_base <- df %>% left_join(
df %>%
filter(Year == 2000) %>% #get just year 2000 results
select(-Year) %>% #remove year so that it does not join on it
rename(base = result) #rename the result column of the cut dataframe to base
)
This will join by all other columns that aren't year, meaning the same state and city and all your conditionals, and return the full dataframe with a new column called "base" with the year 2000 result for state+city+conditions. If there are other columns you don't wish to join on you can either remove them in the select, or specify all columns to join on explicitly by using the "by" variable in the left_join.
Consider ave for calculation of records across same multiple groups and have Result return itself with identity().
# YEAR 2000 CALCULATION
df$Base <- with(df, ifelse(Year == 2000,
ave(Result, Condition1, Condition2, Condition3, FUN=identity),
NA)
)
# ASSIGN 2000 RESULT TO ALL OTHER YEARS
df$Base <- with(df, ave(Base, Condition1, Condition2, Condition3, FUN=function(x) max(x, na.rm=TRUE)))
Not sure of performance across ~15 mill obs.

R: replace identical rows with average

I have data which looks like this:
patient day response
Bob "08/08/2011" 5
However, sometimes, we have several responses for the same day (from the same patient). For all such rows, I want to replace them all with just one row, where the patient and the day is of course what it happens to be for all those rows, and the response is the average of them.
So if we also had
patient day response
Bob "08/08/2011" 6
then we'd remove both these rows and replace them with
patient day response
Bob "08/08/2011" 5.5
How do I write up a code in R to do this for a data frame that spans tens of thousands of rows?
EDIT: I might need the code to generalize to several covariables. So, for example, apart from day, we might have "location", so then we'd only want to average all the rows which correspond to the same patient on the same day on the same location.
Required output can be obtained by:
aggregate(a$response, by=list(Category=a$patient,a$date), FUN=mean)
You can do this with the dplyr package pretty easily:
library(dplyr)
df %>% group_by(patient, day) %>%
summarize(response_avg = mean(response))
This groups by whatever variables you choose in the group_by so you can add more. I named the new variable "response_avg" but you can change that to what you want also.
just to add a data.table solution if any reader is a data.table user.
library(data.table)
setDT(df)
df[, response := mean(response, na.rm = T), by = .(patient, day)]
df <- unique(df) # to remove duplicates

using aggregate to generate report based on multiple categories in r

I have a .dbf containing roughly 2.8 million records that contain residential parcel data with a year built category field, a county code field, and a windzone field (for building code restrictions). There are 3 year built categories and 5 wind zones. I need to get the number of parcels for each year built category in each windzone for each county. Basically I have a county (CNTY_ID = 11) with three year built categories (BUILT_CAT = "1" , "2" , "3") each that are also assigned to one of five windspeed categories (WINDSPEED = "100", "110", "120", etc.). I think I need to use the aggregate() function but haven't had any luck. Optimally the generated table would look something like:
CNTY_ID = 11
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
CNTY_ID = 12
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
Is this kind of task possible to accomplish?
Actually, you're better of using table, that's less hassle and more performant. You get an array back, and this one is easily converted to a data frame.
Some test data:
n <- 10000
df <- data.frame(
windspeed = sample(c(110,120,130), n, TRUE),
built_cat = sample(c(1,2,3),n,TRUE),
cnty_id = sample(1:20,n,TRUE)
)
Constructing the table and converting to a data frame:
tbl <- with(df, table(windspeed, built_cat, cnty_id))
as.data.frame(tbl)
Note that I use with here so I have the variable names automatically as the dimnames of my table. That helps with the conversion.
What you essentially need is a way to group your data.
I think dplyr is the way to go. You can use aggregate too.
Using dplyr
library(dplyr)
library(datasets)
temp <- airquality %>%
group_by(Month, Day) %>%
summarise(TOT = sum(Ozone))
View(temp)
This will give you the data in a normalized format where the data is grouped first by Month and then by Day of the month and then sums the provided variable. Ozone in this case. You can also count the values by using length in stead.
Using aggregate
temp2 <- aggregate(Ozone ~ Month + Day, data = airquality, sum)
View(temp2)
The key difference in the approach is the treatment of NA.
Since base R functions do not have a very intuitive treatment of NAs and would add the record whenever it encounters it. As a result in group by the sum fails for that grouped entity and it is dropped from the resultant.
Here is a link to other group by treatments using data.table or ddply. You can also achieve this by plyr or tapply.

Counting Frequencies Using (logical?) Expressions

I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.

Resources