individuating rows based on conditions in nested data - r

I am new to r, and I am having some trouble manipulating the data in the way I need it for my analysis. I would be grateful if anyone could help, because this is essential for my research.
I already asked a similar question but the answer I got did not fully address my problem, I will try to be more clear this time to see if anyone can help.
my data looks something like this:
df<- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA","USA"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU"),
"Product.cat" = c("1", "11", "111", "112", "12", "2"),
"Product Description" = c("Food", "Fruit", "Apple",
"Banana", "Meat", "Manifactured"),
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970),
"trade value" = c( 100, 50, 30, 20, 50, 220),
stringsAsFactors = FALSE)
I have country-year observations about trade.
The vector 'product.cat' indicates what kind of commodity is exported. The more digits the product.cat has, the more the trade information is disaggregated.
For example product.cat. 111 (eg. apple) and 112 (e.g. bananas) are sub-product categories of product category 11 (e.g. fruit).
The same holds for the higher levels of aggregation. Product category 11 (fruit) is a subcategory of product.cat 1(food) together with product.cat 12 (meat).
To note that data in lower categories is nested in higher level of aggregation. Hence the value of product.cat 11 (50) is equal to the value of product.cat 111 (30) + product.cat 112 (20).
To do my analysis I need to identify those values that are not reported at the most disaggregated possible level - i.e. I need to identify the data not reported at the 3 digit level.
My problem is that for some country-year observation I have data reported accurately at all levels of aggregation (e.g. 1,11,111,112) while for others i only have data at the higher level of aggregation (e.g. 12 and 2). For instance, in my example, I only have product.cat 12 (meat), but not data on what kind of meat product.cat 121(pork), product.cat 122 (veal).
Similarly, in the example, data on product.cat 2 (manufacturing), is not reported at lower levels.
we do not know whether is product.cat 21 (clothing) or product.cat 22 (wood products).
In other words, I have data reported at the 2 digit (12) or first digit level (2) that could be reported at the 3 digit level. To note that every category should be disaggregated at the 3 level digit
What I would like to do is to find a way to individuate all the data exclusively reported at a higher level of aggregation and change their product.cat name adding an "m" to the end.
After manipulation the product.cat 12 should become* 12m to indicate that data was reported only at the 2nd digit.
Similarly I would like to identify exports that are reported only at the first digit. product.cat 2 should become 2mm to reflect that the data was reported only at the first digit.
To be sure, only the data for which I have information exclusively at a higher level of aggregation - i.e. in the example 12 and 2 - should include "m"s.
For instance, in the example, I do not want to have 1mm, since I have data at a lower level of aggregation (11,12). Similarly, I do not want to have 11m, because I have data at lower levels of aggregation (111,112). What I would like to have is 12m and 2mm because the data is reported only at a higher level of aggregation (12 and 2).
I know that this is a very specific question but I would really appreciate if anyone could help.
Note: in the real dataset, due to for measurement errors, the sum of the disaggregated values do not always perfectly add up to the higher level of aggregation. (for instance, 111+112 can be > 11). Hence, ideally to solve the issue the, I am looking to a function that is able to specify when to add the m based on the number of digits divided by country, partner, year, rather than the sum of the traded value.
I really thank everyone that could give me a help with this, it would be a huge step forward for my research.
---- attempts
I have been working on this function, but it does not seem to do what I am looking for. Maybe someone can find out what is going wrong
fillLevel <- function(x, width = 3, fill = "m"){
sp <- split(x, substr(x, 1, 1))
sp <- lapply(seq_along(sp), function(i){
n <- nchar(sp[[i]])
if(all(n < 3)){
j <- which(n == max(n))
sp[[i]][j] <- gsub(" ", "m", formatC(sp[[i]][j], width = -3))
}
sp[[i]]
})
unname(unlist(sp))
}
df <- df%>% mutate(prdcat2 = fillLevel(df$Product.cat.))
As you can see it only individuates 2mm but not 12m. Moreover when I run it on more complex codes it mess up the order of my data. I think this relates to sp <- lapply(seq_along(sp) but i am not sure how to go about it.
Best

Here's one way to do it:
library(data.table)
setDT(df)
# tag levels
df[, lvl := nchar(Product.cat)]
df[lvl < 3L, has_subcat := FALSE]
# use level-3 observations to flag level-2s as okay
df[
df[lvl == 3, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 2))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# use level-2 observations to flag level-1s as okay
df[
df[lvl == 2, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 1))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# create new cat, flagging observations with no subcategories
df[, newcat := Product.cat]
df[has_subcat == FALSE, newcat := paste0(Product.cat, strrep("m", 3-lvl))]
Reporter Partner Product.cat Product.Description Year trade.value lvl has_subcat newcat
1: USA EU 1 Food 1970 100 1 TRUE 1
2: USA EU 11 Fruit 1970 50 2 TRUE 11
3: USA EU 111 Apple 1970 30 3 NA 111
4: USA EU 112 Banana 1970 20 3 NA 112
5: USA EU 12 Meat 1970 50 2 FALSE 12m
6: USA EU 2 Manifactured 1970 220 1 FALSE 2mm
I'm assuming that this should be done separately per Reporter-Partner-Year.

Related

How to calculate co-occurrence matrices based on large dataframes?

I want to create a co-occurrence matrix based on the recommended code here (also see below). It works fine for most of the dataframes I work with. However, I get the following error messages for larger dataframes either if I use data.table::melt ...
negative length vectors are not allowed
... or later on using base::crossprod
error in crossprod: attempt to make a table with >=2^31 elements
Both are related to the size of the dataframe. In the first case, it relates to the number of rows, while in the latter case the size of the matrix exceeds the limit.
I'm aware about the solutions for the first issue (data.table::melt) proposed by [2], [3] and [4] as well as for the second issue (base::crossprod) by [5] and [6], and I've seen [7] but I'm not sure how to adapt them properly to my situation. I have tried to split the dataframe by ID into several dataframes, merge them and calculate the co-occurrence matrix but I've just produced additional error messages (e.g., cannot allocate vector of size 17.8 GB).
Reproducible Example
I have an assembled dataframe created by plyr::join that looks like this (but, of course, a lot larger):
df <- data.frame(ID = c(1,2,3,20000),
C1 = c("England", "England", "England", "China"),
C2 = c("England", "China", "China", "England"),
C5850 = c("England", "China", "China", "England"),
SC1 = c("FOO", "BAR", "EAT", "FOO"),
SC2 = c("MERCI", "EAT", "EAT", "EAT"),
SC5850 = c("FOO", "MERCI", "FOO", "FOO"))
ID C1 C2 ... C5850 SC1 SC2 ... SC5850
1 England England England FOO MERCI FOO
2 England China China BAR EAT MERCI
3 England China China EAT EAT EAT
200000 China England England FOO EAT FOO
Original Code
colnames(df) <- c(paste0("SCCOUNTRY", 2:7))
library(data.table)
melt(setDT(df), id.vars = "ID", measure = patterns("^SCCOUNTRY"))[nchar(value) > 0 & complete.cases(value)] -> foo
unique(foo, by = c("ID", "value")) -> foo2
crossprod(table(foo2[, c(1,3)])) -> mymat
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
Conditions (for the calculation of the co-occurrence matrix)
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.

Cleaning Origin and Destination data with duplicates but different factor level

I have some GIS data with origins and destinations (OD) and an information about the time of the day of the OD. I intending to make a map of this, and to color the ODs by the time of day information.
One thing is that some ODs are in the data set with both day and night and maybe with a different order. I would like to mark those differntly, e.g. "Day/Night"
Is there an easy way to do this? MY MWE is just one OD but I would need to identify it among several others. I can manage to find the duplicates regardless of the order, but I dont know how to find out wether or not there are both time cases there and how to replace them with "Day/Night"
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
You can do that using dplyr package as follows;
Feel free to change the conditions to what fits your need.
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day
Thanks for the dplyr solution by #Nareman Darwisch that gave me the inspiration for my solution with data.table
I am creating a new variable as a unique ID for each Origin Destination
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
Then I can either calculate the length of the unique occurences of the factor by group or if they match more than once with any of the 3 levels. Based on this I can recode the labels with the "Day/Night" level whenever the length is > 1 or the other condition is TRUE
dt[,No.levels:=length(unique(c(Time))), by=unique.name]
dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
The thing I would like to understand how I could use a logical condition in the spirit of looking at the levels by group and compares those with the cases I want.
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
But I guess the levels command always gives me all three levels.
If I understand correctly, the OP wants to
identify city pairs regardless of the order of origin and destination, e.g. London-Paris belongs to the same city pair as Paris-London
collapse separate rows if a city pair is operated Day and Night or Day/Night
or update the original dataset
This is what I would do:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
Origin Destination Time Pair
1: London Paris Day/Night London-Paris
2: Paris London Day/Night London-Paris
3: Lisbon Berlin Day Berlin-Lisbon
4: Madrid Lisbon Day/Night Lisbon-Madrid
5: Berlin Lisbon Day Berlin-Lisbon
6: London Paris Day/Night London-Paris
The key point is to identify the city pairs which fullfil the second requirement:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
So, there is no need to deal with factor levels. BTW, factor levels are an attribute of the whole column and do not change when subsetting or grouping. What does change is which of the levels are used in a subset or group.
pairs_DN contains the unique key of those city pairs
Pair
1: London-Paris
2: Lisbon-Madrid

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

How to identify observations with multiple matching patterns and create another variable in R?

I am trying to create a broad industry category from detailed categories in my data. I am wondering where am I going wrong in creating this with grepl in R?
My example data is as follows:
df <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"))
I am trying to create a variable called industry with 2 levels (e.g., agri, manufacturing) with the help of grepl or str_replace commands in R.
I have tried this:
newdf$industry <- ""
newdf[df$ind %>% grepl(c("^07|^08|^09", levels(df$ind), value = TRUE)), "industry"] <- "Agri"
But this gives me the following error:
argument 'pattern' has length > 1 and only the first element will be used
I want to get the following dataframe as my result:
newdf <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"),
industry = c("Agri", "Agri", "Agri", "Agri", "Manufacturing", "Manufacturing", "Agri", "Manufacturing"))
So my question is this, how do I specify if variable 'ind' starts with 07,08 or 09, my industry variable will take the value 'agri', if 'ind' starts with 10, 14 or 15, industry will be 'manufacturing'? Needless to say, there is a huge list of industry codes that I am trying to crunch in 10 categories, so looking for a solution which will help me do it with pattern recognition.
Any help is appreciated! Thanks!
Try this:
newdf = df %>%
mutate(industry = ifelse(str_detect(string = ind,
pattern = '^07|^08|^09'),
'Agri',
'Manufacturing'))
This works, using ifelse() to add desired column to df data.frame
df$industry <- ifelse(grepl(paste0("^", c('07','08','09'), collapse = "|"), df$ind), "Agri", "Manufacturing")
> df
county ind industry
1 1001 0700 Agri
2 1002 0701 Agri
3 2003 0780 Agri
4 4004 0980 Agri
5 8005 1000 Manufacturing
6 1002 1429 Manufacturing
7 2003 0840 Agri
8 4004 1500 Manufacturing

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Resources