How to aggregate/list values of multiple columns by group? - r

I have a data frame describing the ownership levels of companies that looks like this:
Company Subsidiary1 Subsidiary2 Subsidiary3
DE5930 DE5931 NA NA
GB3489 GB3490 NA NA
GB3489 GB3490 GB3491 NA
US2036 US2037 NA NA
US2036 US2037 US2038 NA
US2036 US2037 US2038 GB3491
....# and so on
Now I would like to create one column of all subsidiaries per company that should look like this:
Company Subsidiaries
DE5930 DE5931
GB3489 GB3490
GB3489 GB3491
US2036 US2037
US2036 US2038
US2036 GB3491
The dataset is really large (more than 100.000 rows) and I could not figure any solutions using the group_by or aggregate function as most examples are for numeric variables (e.g. average).
One idea would be to remove the duplicates with df[ !duplicated(df$Subsidiary1), ] to retain the first occurence of each subsidiary and then shift the values to the left, but the problem is that one subsidiary could belong to several companies (like "GB3491") and I do not want to loose these observations. Is there any elegant solution to this problem?
Thank you in advance!

I would suggest next tidyverse approach:
library(tidyverse)
#Data
df <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))
The code:
df %>% pivot_longer(cols = -Company) %>% select(-name) %>%
filter(!is.na(value)) %>%
filter(!duplicated(paste(Company,value)))
Output:
# A tibble: 6 x 2
Company value
<chr> <chr>
1 DE5930 DE5931
2 GB3489 GB3490
3 GB3489 GB3491
4 US2036 US2037
5 US2036 US2038
6 US2036 GB3491

We can use coalesce
library(dplyr)
df1 %>%
transmute(Company, Subsidiaries =
coalesce(!!! rlang::syms(rev(names(df1)[-1]))))
# Company Subsidiaries
#1 DE5930 DE5931
#2 GB3489 GB3490
#3 GB3489 GB3491
#4 US2036 US2037
#5 US2036 US2038
#6 US2036 GB3491
Or with base R using max.col
cbind(df1[1], Subsidiaries = df1[-1][cbind(seq_len(nrow(df1)),
max.col(!is.na(df1[-1]), "last"))])
data
df1 <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))

Related

Match bigram frequencies to bigram tokens across multiple columns

I have two dataframes, one is a frequency list with bigram frequencies:
F_bigrams <- structure(list(word_tag = c("it_PNP 's_VBZ", "do_VDB n't_XX0",
"that_DT0 's_VBZ", "you_PNP know_VVB", "i_PNP 'm_VBB", "i_PNP do_VDB",
"in_PRP the_AT0", "i_PNP 've_VHB", "'ve_VHB got_VVN", "i_PNP mean_VVB"
), Freq_bigr = c(31831L, 26273L, 21691L, 14157L, 14010L, 12904L,
10994L, 10543L, 10089L, 9856L)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
The other contains bigram tokens:
df <- data.frame(
bigr_1_2 = c("i_PNP 'm_VBB", NA, NA, NA),
bigr_2_3 = c("it_PNP 's_VBZ", "'ve_VHB got_VVN", NA, NA),
bigr_3_4 = c("you_PNP know_VVB", "it_PNP 's_VBZ", "'ve_VHB got_VVN", NA)
)
I want to match the frquencies from the frequency list F_bigrams to each bigram token in df. This I can do without problems in df, which is a tiny snippet of the actual data, with this base R method:
df[, paste0("f_bigr_", 1:3, "_", 2:4)] <- sapply(df[, 1:3], function(x) F_bigrams$Freq_bigr[match(x, F_bigrams$word_tag)])
However, in the actual data, which has far more columns and half a million rows, I consistently get the number 2 where there should be NA. Why is that? And, more importantly, is there an alternative way to match the frequencies to their respective bigram tokens?
df %>%
rowid_to_column() %>%
pivot_longer(-rowid, values_to = 'word_tag', values_drop_na = TRUE) %>%
left_join(F_bigrams) %>%
pivot_wider(rowid, values_from = c(word_tag, Freq_bigr))
rowid word_tag_bigr_1_2 word_tag_bigr_2_3 word_tag_bigr_3_4 Freq_bigr_bigr_1_2 Freq_bigr_bigr_2_3 Freq_bigr_bigr_3_4
<int> <chr> <chr> <chr> <int> <int> <int>
1 1 i_PNP 'm_VBB it_PNP 's_VBZ you_PNP know_VVB 14010 31831 14157
2 2 NA 've_VHB got_VVN it_PNP 's_VBZ NA 10089 31831
3 3 NA NA 've_VHB got_VVN NA NA 10089

filtering rows that only contain certain values among multiple columns in R

I have the following data, where a single row (an arrest event) can have up to 5 charges associated with it.
#statutes 1, 2, and 3 are marijuana related
#statute 4 is paraphernalia related.
caseID <- c("1", "2", "3", "4", "5", "6", "7", "8")
date <- c("2017-01-01", "2017-01-12", "2018-03-23", "2019-10-12", "2018-11-22", "2018-01-01", "2017-02-01", "2017-02-20")
charge1 <- c("Statute4", "Statute12", "Statute1", "Statute3", "Statute3", "Statute158", "Statute2", "Statute1")
charge2 <- c(NA, "Statute1", "Statute3", "Statute44", "Statute4", "Statute4", NA, "Statute4")
charge3 <- c(NA, "Statute12", NA, "Statute4", NA, NA, NA, "Statute19")
charge4 <- c(NA, "Statute6", NA, NA, NA, NA, NA, NA)
charge5 <- c(NA, "Statute8", NA, NA, NA, NA, NA, NA)
df <- data.frame(caseID, date, charge1, charge2, charge3, charge4, charge5)
I want to filter out any rows where the following conditions are true:
a record contains ONLY marijuana charges (which are statute1, statute2, and statute3) OR
a record contains ONLY marijuana charges (statute1, statute2, and statute3) and paraphernalia charges (statute4).
So by this logic, the following cases would be excluded/excluded:
CaseID 1 excluded (paraphernalia only)
CaseID 2 not excluded
CaseID 3 excluded (marijuana only)
CaseID 4 not excluded
CaseID 5 excluded (marijuana and paraphernalia only)
CaseID 6 not excluded
CaseID 7 excluded (marijuana only)
CaseID 8 not excluded
There are many many statutes in my real example, so I need to figure out a way to do this where I find rows that contain only a very specific set of statutes.
This might work for you
library(data.table)
# set to data.table
setDT(df)
# melt long
df_long = melt(df, id.vars=c("caseID", "date"))[!is.na(value)]
#count all charges
total_charges = df_long[,.(totalcharges =.N), by=caseID]
# return the subset of the original wide dataset, with target
# rows excluded
df[df_long[value %chin% c("Statute1","Statute2","Statute3","Statute4")] %>%
.[, .N, caseID] %>%
.[total_charges, on=.(caseID)] %>%
.[N!=totalcharges,.(caseID)], on="caseID"]
Output
caseID date charge1 charge2 charge3 charge4 charge5
1: 2 2017-01-12 Statute12 Statute1 Statute12 Statute6 Statute8
2: 4 2019-10-12 Statute3 Statute44 Statute4 <NA> <NA>
3: 6 2018-01-01 Statute158 Statute4 <NA> <NA> <NA>
4: 8 2017-02-20 Statute1 Statute4 Statute19 <NA> <NA>

Add chronological number to a column if empty

I have a dataframe with a column called 's_nummer'. This column is sometimes NA and in that case, I would like to add a number myself that can range from 700001 to 800000. So in this case, row numbers 3 and 4 do not contain a value in the s_nummer column and I would like to add the values 700001 to row 3 and 700002 to row 4.
dput:
structure(list(s_nummer = c(599999, 599999, NA, NA), eerste_voornaam = c("Debbie",
"Debbie", "Debbie", "Debbie"), tussenvoegsel = c(NA, NA, NA,
NA), geslachtsnaam = c("Oomen", "Oomen", "Oomen", "Oomen")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Hope you can help!
Thanks in advance
You can use which with is.na to get the lines with NA in x$s_nummer and overwrite them with 700000 + seq_along.
i <- which(is.na(x$s_nummer))
x$s_nummer[i] <- 700000 + seq_along(i)
# s_nummer eerste_voornaam tussenvoegsel geslachtsnaam
#1 599999 Debbie NA Oomen
#2 599999 Debbie NA Oomen
#3 700001 Debbie NA Oomen
#4 700002 Debbie NA Oomen

Iterate through columns' suffixes in a for loop. R

I am trying to modify my dataset with a for loop. I want to modify certain cells of some columns depending on the value of its "paired" column. My dataset could be:
data1989 <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, 0.589, 0.120),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 0.589 0.447 NA 66.897 66.097 NA
3 1987-01-19 0.120 NA NA 90.599 NA NA
Columns are "paired" by the suffix of each column, so NDVI_1 is paired with pixelQA_1, and so on. I want to modify the values under NDVI's columns depending on it's "paired" values on pixelQA column, following:
if PixelQa is NA -> then NDVI should be also NA.
if Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
if Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
Applying these very simple rules my data should look like:
data1989clean <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, NA, NA),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989clean
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 NA 0.447 NA 66.897 66.097 NA
3 1987-01-19 NA NA NA 90.599 NA NA
To reach my goal I am trying the following for loop:
for(i in 1:4){
data1989$NDVI_[i] <- ifelse(data1989$pixelQA_[i] < 66.5 & data1989$pixelQA_[i] > 65.5 |
data1989$pixelQA_[i] < 130.5 & data1989$pixelQA_[i] > 129.5,
data1989$NDVI_[i], NA)
}
But so far it is not working, as the dataset output looks exactly the same as the original one. Any suggestion will be welcomed.
As suggested by #George Savva, you can achieve this by pivoting longer, correcting the data, and pivoting back wider. So, using the tidyverse, that gives:
library(tidyverse)
newdd1 <-
#
data1989 %>%
#
pivot_longer(cols = -date,
names_to = c(".value", "set"),
names_sep = "_") %>%
#
mutate(NDVI = case_when(is.na(pixelQA) ~ NA_real_,
between(pixelQA, 65.5, 66.5) ~ NDVI,
between(pixelQA, 129.5, 130.5) ~ NDVI,
TRUE ~ NA_real_)) %>%
#
pivot_wider(names_from = set,
values_from = c(NDVI, pixelQA))

R: Recoding several characters into one new factor

I am new to R, and could not find specific help for my question on this site.
I have (among others) ten character variables in my dataframe $grant_database, country_1 through country_10. Each contains either a country code, for example E20, F27 or G10, or an NA. Each case is a grant to a project. The ten country variables specify which country/countries a grant is benefitting. In my dataframe, most, but not all cases will have at least one country code, first marked in country_1, many will have one for country_2 as well, and some even for country_3 to _10. All empty fields are marked with an NA.
id country_1 country_2 country_3 country_4 country_5 country_6 ...new_binaryvar
1 F20 NA NA NA NA NA 0
2 E12 E17 E52 NA NA NA 0
3 O62 O33 NA NA NA NA 0
4 E21 E20 NA NA NA NA 1
5 NA NA NA NA NA NA 0
...
I wish to create a new factor flagging grants which benefit a defined subset of countries. This binary "dummy" variable should give the value "1" to each case that in at least one of the ten country variables corresponds with a list of country codes. It should give "0" to each case/grant that does not have a corresponding country code in any of its ten country variables. Let this subset of country codes to be flagged be: E20, F27 and G10 (in reality, there are about 40 to be flagged, from 150+).
Would you help me out by suggesting a way to program this? Thank you very much for your help!
Assuming that you wanted to check whether a subset of "countrycodes" are there in each of the "country" variables with the condition that if atleast one of the "countrycode" is present in a particular row, that row will get "1", or else "0". The idea is to create a vector (v1) of "countrycodes" that needs to be checked. Convert the dataset (df) to matrix after removing the "id" column (as.matrix(df[,-1])) and then create a logical vector by comparing with "v1" (%in%). The vector can be changed back to "matrix" by assigning the dimensions (dim<-) to dimension of df[,-1] ie (c(5,7)). Do the rowSums, double negate (!!), finally add 0 to get the binary dummy variable.
v1 <- c('E20', 'F27', 'G10')
(!!rowSums(`dim<-`(as.matrix(df[,-1]) %in% v1, c(5,7))))+0
#[1] 0 0 0 1 0
newdata
df <- structure(list(id = 1:5, country_1 = c("F20", "E12", "O62", "E21",
NA), country_2 = c(NA, "E17", "O33", "E20", NA), country_3 = c(NA,
"E52", NA, NA, NA), country_4 = c(NA, NA, NA, NA, NA), country_5 = c(NA,
NA, NA, NA, NA), country_6 = c(NA, NA, NA, NA, NA), country_7 = c(NA,
NA, NA, NA, NA)), .Names = c("id", "country_1", "country_2",
"country_3", "country_4", "country_5", "country_6", "country_7"
), class = "data.frame", row.names = c(NA, -5L))

Resources