R: Recoding several characters into one new factor - r

I am new to R, and could not find specific help for my question on this site.
I have (among others) ten character variables in my dataframe $grant_database, country_1 through country_10. Each contains either a country code, for example E20, F27 or G10, or an NA. Each case is a grant to a project. The ten country variables specify which country/countries a grant is benefitting. In my dataframe, most, but not all cases will have at least one country code, first marked in country_1, many will have one for country_2 as well, and some even for country_3 to _10. All empty fields are marked with an NA.
id country_1 country_2 country_3 country_4 country_5 country_6 ...new_binaryvar
1 F20 NA NA NA NA NA 0
2 E12 E17 E52 NA NA NA 0
3 O62 O33 NA NA NA NA 0
4 E21 E20 NA NA NA NA 1
5 NA NA NA NA NA NA 0
...
I wish to create a new factor flagging grants which benefit a defined subset of countries. This binary "dummy" variable should give the value "1" to each case that in at least one of the ten country variables corresponds with a list of country codes. It should give "0" to each case/grant that does not have a corresponding country code in any of its ten country variables. Let this subset of country codes to be flagged be: E20, F27 and G10 (in reality, there are about 40 to be flagged, from 150+).
Would you help me out by suggesting a way to program this? Thank you very much for your help!

Assuming that you wanted to check whether a subset of "countrycodes" are there in each of the "country" variables with the condition that if atleast one of the "countrycode" is present in a particular row, that row will get "1", or else "0". The idea is to create a vector (v1) of "countrycodes" that needs to be checked. Convert the dataset (df) to matrix after removing the "id" column (as.matrix(df[,-1])) and then create a logical vector by comparing with "v1" (%in%). The vector can be changed back to "matrix" by assigning the dimensions (dim<-) to dimension of df[,-1] ie (c(5,7)). Do the rowSums, double negate (!!), finally add 0 to get the binary dummy variable.
v1 <- c('E20', 'F27', 'G10')
(!!rowSums(`dim<-`(as.matrix(df[,-1]) %in% v1, c(5,7))))+0
#[1] 0 0 0 1 0
newdata
df <- structure(list(id = 1:5, country_1 = c("F20", "E12", "O62", "E21",
NA), country_2 = c(NA, "E17", "O33", "E20", NA), country_3 = c(NA,
"E52", NA, NA, NA), country_4 = c(NA, NA, NA, NA, NA), country_5 = c(NA,
NA, NA, NA, NA), country_6 = c(NA, NA, NA, NA, NA), country_7 = c(NA,
NA, NA, NA, NA)), .Names = c("id", "country_1", "country_2",
"country_3", "country_4", "country_5", "country_6", "country_7"
), class = "data.frame", row.names = c(NA, -5L))

Related

filtering rows that only contain certain values among multiple columns in R

I have the following data, where a single row (an arrest event) can have up to 5 charges associated with it.
#statutes 1, 2, and 3 are marijuana related
#statute 4 is paraphernalia related.
caseID <- c("1", "2", "3", "4", "5", "6", "7", "8")
date <- c("2017-01-01", "2017-01-12", "2018-03-23", "2019-10-12", "2018-11-22", "2018-01-01", "2017-02-01", "2017-02-20")
charge1 <- c("Statute4", "Statute12", "Statute1", "Statute3", "Statute3", "Statute158", "Statute2", "Statute1")
charge2 <- c(NA, "Statute1", "Statute3", "Statute44", "Statute4", "Statute4", NA, "Statute4")
charge3 <- c(NA, "Statute12", NA, "Statute4", NA, NA, NA, "Statute19")
charge4 <- c(NA, "Statute6", NA, NA, NA, NA, NA, NA)
charge5 <- c(NA, "Statute8", NA, NA, NA, NA, NA, NA)
df <- data.frame(caseID, date, charge1, charge2, charge3, charge4, charge5)
I want to filter out any rows where the following conditions are true:
a record contains ONLY marijuana charges (which are statute1, statute2, and statute3) OR
a record contains ONLY marijuana charges (statute1, statute2, and statute3) and paraphernalia charges (statute4).
So by this logic, the following cases would be excluded/excluded:
CaseID 1 excluded (paraphernalia only)
CaseID 2 not excluded
CaseID 3 excluded (marijuana only)
CaseID 4 not excluded
CaseID 5 excluded (marijuana and paraphernalia only)
CaseID 6 not excluded
CaseID 7 excluded (marijuana only)
CaseID 8 not excluded
There are many many statutes in my real example, so I need to figure out a way to do this where I find rows that contain only a very specific set of statutes.
This might work for you
library(data.table)
# set to data.table
setDT(df)
# melt long
df_long = melt(df, id.vars=c("caseID", "date"))[!is.na(value)]
#count all charges
total_charges = df_long[,.(totalcharges =.N), by=caseID]
# return the subset of the original wide dataset, with target
# rows excluded
df[df_long[value %chin% c("Statute1","Statute2","Statute3","Statute4")] %>%
.[, .N, caseID] %>%
.[total_charges, on=.(caseID)] %>%
.[N!=totalcharges,.(caseID)], on="caseID"]
Output
caseID date charge1 charge2 charge3 charge4 charge5
1: 2 2017-01-12 Statute12 Statute1 Statute12 Statute6 Statute8
2: 4 2019-10-12 Statute3 Statute44 Statute4 <NA> <NA>
3: 6 2018-01-01 Statute158 Statute4 <NA> <NA> <NA>
4: 8 2017-02-20 Statute1 Statute4 Statute19 <NA> <NA>

Add chronological number to a column if empty

I have a dataframe with a column called 's_nummer'. This column is sometimes NA and in that case, I would like to add a number myself that can range from 700001 to 800000. So in this case, row numbers 3 and 4 do not contain a value in the s_nummer column and I would like to add the values 700001 to row 3 and 700002 to row 4.
dput:
structure(list(s_nummer = c(599999, 599999, NA, NA), eerste_voornaam = c("Debbie",
"Debbie", "Debbie", "Debbie"), tussenvoegsel = c(NA, NA, NA,
NA), geslachtsnaam = c("Oomen", "Oomen", "Oomen", "Oomen")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Hope you can help!
Thanks in advance
You can use which with is.na to get the lines with NA in x$s_nummer and overwrite them with 700000 + seq_along.
i <- which(is.na(x$s_nummer))
x$s_nummer[i] <- 700000 + seq_along(i)
# s_nummer eerste_voornaam tussenvoegsel geslachtsnaam
#1 599999 Debbie NA Oomen
#2 599999 Debbie NA Oomen
#3 700001 Debbie NA Oomen
#4 700002 Debbie NA Oomen

Computing Growth Rates

I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]

Iterate through columns' suffixes in a for loop. R

I am trying to modify my dataset with a for loop. I want to modify certain cells of some columns depending on the value of its "paired" column. My dataset could be:
data1989 <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, 0.589, 0.120),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 0.589 0.447 NA 66.897 66.097 NA
3 1987-01-19 0.120 NA NA 90.599 NA NA
Columns are "paired" by the suffix of each column, so NDVI_1 is paired with pixelQA_1, and so on. I want to modify the values under NDVI's columns depending on it's "paired" values on pixelQA column, following:
if PixelQa is NA -> then NDVI should be also NA.
if Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
if Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
Applying these very simple rules my data should look like:
data1989clean <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, NA, NA),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989clean
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 NA 0.447 NA 66.897 66.097 NA
3 1987-01-19 NA NA NA 90.599 NA NA
To reach my goal I am trying the following for loop:
for(i in 1:4){
data1989$NDVI_[i] <- ifelse(data1989$pixelQA_[i] < 66.5 & data1989$pixelQA_[i] > 65.5 |
data1989$pixelQA_[i] < 130.5 & data1989$pixelQA_[i] > 129.5,
data1989$NDVI_[i], NA)
}
But so far it is not working, as the dataset output looks exactly the same as the original one. Any suggestion will be welcomed.
As suggested by #George Savva, you can achieve this by pivoting longer, correcting the data, and pivoting back wider. So, using the tidyverse, that gives:
library(tidyverse)
newdd1 <-
#
data1989 %>%
#
pivot_longer(cols = -date,
names_to = c(".value", "set"),
names_sep = "_") %>%
#
mutate(NDVI = case_when(is.na(pixelQA) ~ NA_real_,
between(pixelQA, 65.5, 66.5) ~ NDVI,
between(pixelQA, 129.5, 130.5) ~ NDVI,
TRUE ~ NA_real_)) %>%
#
pivot_wider(names_from = set,
values_from = c(NDVI, pixelQA))

Looping through select rows based on input of one or more columns

I have a nested list of items such that I have 3 separate lists grouped into one. I would like to make changes to a specific column that is present in all the lists. I have more details below
X
$`Manufacturing`
Stage Days.Added Start.Date End.Date
Planning 2 1968-12-01 NA
Building 14 NA NA
Testing 3 NA NA
Implementation 15 NA NA
$`Project Analysis`
Stage Days.Added Start.Date End.Date
Initial Review 3 1968-12-01 NA
Building 14 NA NA
User Testing 20 NA NA
Implementation 15 NA NA
User Review 7 NA NA
Final Analysis 4 NA NA
lapply(X, '[', 'End.Date') gives me:
$`Manufacturing`
End.Date
NA
NA
NA
NA
$`Project Analysis`
End.Date
NA
NA
NA
NA
NA
NA
I want to create a loop whereby the 'End.Date' column is the addition of the 'Start.Date' and the 'Days.Added' column for the first row. The resulting value would be the 'Start.Date' entry for the second row which would have the 'Days.Added' column added to produce the new 'End.Date' for the second row and so forth. So basically something like this:
$`Manufacturing`
Stage Days.Added Start.Date End.Date
Planning 2 1968-12-01 1968-12-03
Building 14 1968-12-03 1968-12-17
Testing 3 1968-12-17 1968-12-20
Implementation 15 1968-12-20 1969-01-04
$`Project Analysis`
Stage Days.Added Start.Date End.Date
Initial Review 3 1968-12-01 1968-12-04
Building 15 1968-12-04 1968-12-19
User Testing 20 1968-12-19 1969-01-08
Implementation 15 1969-01-08 1969-01-23
User Review 7 1969-01-23 1969-01-30
Final Analysis 4 1969-01-30 1969-02-03
How do I achieve this?
Assuming the Start.Date' isDate` class,
lapply(X, transform, Start.Date = Start.Date[1] +
c(0, cumsum(Days.Added[-length(Days.Added)])),
End.Date = Start.Date[1] + cumsum(Days.Added))
data
X <- list(Manufacturing = structure(list(Stage = c("Planning", "Building",
"Testing", "Implementation"), Days.Added = c(2L, 14L, 3L, 15L
), Start.Date = structure(c(-396, NA, NA, NA), class = "Date"),
End.Date = c(NA, NA, NA, NA)), row.names = c(NA, -4L), class = "data.frame"),
`Project Analysis` = structure(list(Stage = c("Initial Review",
"Building", "User Testing", "Implementation", "User Review",
"Final Analysis"), Days.Added = c(3L, 14L, 20L, 15L, 7L,
4L), Start.Date = structure(c(-396, NA, NA, NA, NA, NA), class = "Date"),
End.Date = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-6L), class = "data.frame"))

Resources