R- compare different columns of a data frame with different values - r

I am currently working on microdata, using a survey called SHARE. I want to use a variable for education but the way it was coded makes it kind of hard.
In the survey, households are asked what degree they have. There is one column for each degree and it takes the value 0 or 1 if the interviewed has the degree or not. The issue is that I have two countries with different degrees, but they are using the same column, so I have to go to the user manual to find for each country to which degree corresponds each 0 or 1. I was able to do so and then translate it to an international way of measuring education.
My idea was to sum each column and then having only one column for each household. However, I wasn't able to proceed because some people have many degrees. I would like to get the highest degree of each household. I would like to have your help on this issue.
Here are tables of what I have and what I would like:
Let imagine in Germany the first diplome is equivalent to the first diplome in international standards, the second and thee third in Germany are the same as the second diplom in international standards and the last diplom in Germany is the same as the third internationally. And in France we have first = first int., second = second int., third = third int. and no fourth diplom. Then I have a the table:
country= c( "Germany", "Germany", "Germany", "France" , "France", "France")
degree_one= c( 1, 1, 1, 1 , 1, 1)
degree_two = c( 0, 1, 0, 1 , 1, 0)
degree_three= c( 1, 0, 1, 1 , 1, 0)
degree_four = c( 1, 0, 0, NA ,NA, NA)
f = data.frame(country,degree_one,degree_two,degree_three,degree_four)
Then I can translate and try to creat my variable degree by summing everything:
f$degree_one = ifelse(f$country == "Germany" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "Germany" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "Germany" & f$degree_three == 1,2,f$degree_three)
f$degree_four = ifelse(f$country == "Germany" & f$degree_four == 1,3,f$degree_four)
f$degree_one = ifelse(f$country == "France" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "France" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "France" & f$degree_three == 1,3,f$degree_three)
f$degree_four = ifelse(f$country == "France" & f$degree_four == "NA",0,f$degree_four)
f = replace(f, is.na(f), 0)
f2 = f %>% mutate(degree = degree_one + degree_two + degree_three + degree_four )
Unfortunately, it does not work and what I would like should look like this:
degree = c(3,2,2,3,3,1)
f3 = data.frame(f,degree)
I tried to do smth with a while loop but it did not work, as anyone any idea how I can solve my problem? I tried to make it as clear as possible, I hope you will understand and that someone as an idea on how to fix this.
Thanks :)

Here is an approach using data.table
library(data.table)
##
# create degree map by country
#
degreeMap <- data.table(country=c('France', 'Germany'))
degreeMap <- degreeMap[, .(degree=paste('degree', c('one', 'two', 'three', 'four'), sep='_')), by=.(country)]
degreeMap[country=='France', intlDegree:=c(1,2,3,NA)]
degreeMap[country=='Germany', intlDegree:=c(1,2,2,3)]
##
# process your data
#
setDT(f)
f[, indx:=1:.N] # need an index column to recover original order
f[, HH:=1:.N, by=.(country)] # need a HH column to distinguish different HH w/in country
maxDegree <- melt(f, id=c('country', 'HH', 'indx'), variable.name='degree', value.name = 'flag')
maxDegree <- maxDegree[flag > 0] # remove rows with flag=0 or NA
setorder(maxDegree, HH, degree)
maxDegree <- maxDegree[, .SD[.N], keyby=.(country, HH)]
maxDegree[degreeMap, intlDegree:=i.intlDegree, on=.(country, degree)]
setorder(maxDegree, indx)
maxDegree
## country HH indx degree flag intlDegree
## 1: Germany 1 1 degree_four 1 3
## 2: Germany 2 2 degree_two 1 2
## 3: Germany 3 3 degree_three 1 2
## 4: France 1 4 degree_three 1 3
## 5: France 2 5 degree_three 1 3
## 6: France 3 6 degree_one 1 1
So this converts your f to a data.table and adds an index column and a HH column to distinguish between HH within a country.
We then convert to long format using melt(...). In long format the four degree_ columns are reduced to two columns: a flag column indicating whether or not the degree applies, and a degree column indicating which degree.
Then we remove all rows with 0 or NA flags, and then extract the last remaining row (highest degree) for each country and HH.
Finally, we join to degreeMap to get the equivalent intlDegree.

Change NAs to 0 and then sum degree columns:
f <- f %>%
mutate(
degree_one = ifelse(is.na(degree_one), 0, degree_one),
degree_two = ifelse(is.na(degree_two), 0, degree_two),
degree_three = ifelse(is.na(degree_three), 0, degree_three),
degree_four = ifelse(is.na(degree_four), 0, degree_four),
degree_sum = degree_one + degree_two + degree_three + degree_four
)
Or, if you want to get fancy with the dplyr
f <- f %>%
mutate(across(contains("degree"), \(x) {ifelse(is.na(x), 0, x)})) %>%
mutate(degree_sum = select(., contains("degree")) %>% rowSums())

Related

How do I create a new variable from specific variables?

Noob question, but how would I create a separate variable that is formed from specific attributes of other variables? For example, I'm trying to find Asian countries in the "region" variable that have a "democracy" variable score of "3." I want to create a variable called "asia3" that selects those Asian countries with a democracy score of 3.
The which operator should solve your request.
asia3 <- your_data[ which(your_data$Region=='Asia' & your_data$democracy == 3), ]
In base R, you can create a new variable based on a condition using an ifelse statement, then assign to a new variable called asia3.
df$asia3 <- ifelse(df$region == "Asia" & df$democracy == 3, "yes", "no")
region democracy asia3
1 Asia 3 yes
2 Australia 3 no
3 Asia 2 no
4 Europe 1 no
Or if you only need a logical output, then you do not need the ifelse:
df$asia3 <- df$region == "Asia" & df$democracy == 3
region democracy asia3
1 Asia 3 TRUE
2 Australia 3 FALSE
3 Asia 2 FALSE
4 Europe 1 FALSE
or with tidyverse
library(tidyverse)
df %>%
mutate(asia3 = ifelse(df$region == "Asia" & df$democracy == 3, TRUE, FALSE))
However, if you only want to keep the rows that meet those conditions, then you can:
#dplyr
df %>%
filter(region == "Asia" & democracy == 3)
#base R
df[df$region=='Asia' & df$democracy == 3, ]
# region democracy
#1 Asia 3
Data
df <-
structure(list(
region = c("Asia", "Australia", "Asia", "Europe"),
democracy = c(3, 3, 2, 1)
),
class = "data.frame",
row.names = c(NA,-4L))

How to get one row per unique ID with multiple columns per values of particular column

I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)

Creating numeric variable based on string intersection in R

I'm attempting to create a numeric variable based on the intersection of strings with R's dplyr package.
I have a list of columns containing codes for thousands of individuals who made purchases at an auto dealership. The codes can represent a purchase of a car, internal parts for a car, or items for the exterior of a car. I want to denote codes identified as a car purchase with 2, items for the interior of a car with 1, and items for the exterior of a car with 0. If the customer purchased a car, I want the column LargestPurchase = 2; if the customer didn't buy a car but bought an interior component, I would like the column LargestPurchase = 1; and if the customer did not buy a car or interior component I would like the column LargestPurchase = 0.
The codes for a car purchase are located in a separate data frame with column CarCodes, and the codes for the interior components of a car are located in a separate data frame with column InteriorCodes. Each contain thousands of codes.
The data for the customers would look like the following (called customers):
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3
001 STW387 K987 W9333
002 AZ326 CP993 EN499
003 BKY98 A0091 C2001
Example:
df1$CarCodes = c('STW387', 'W9333')
df2$InteriorCodes = c('K987', 'AZ326')
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
001 STW387 K987 W9333 2
002 AZ326 CP993 EN499 1
003 BKY98 A0091 C2001 0
I attempted to use the following ifelse function with mutate, but it does not seem to work with strings:
customers <- customers %>% mutate(LargestPurchase =
(ifelse(intersect(customers$PurchaseCode1, df1$CarCodes) == TRUE |
intersect(customers$PurchaseCode2, df1$CarCodes) |
intersect(customers$PurchaseCode3, df1$CarCodes), 2, ifelse(
intersect(customers$PurchaseCode1, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode2, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode3, df3$InteriorCodes) == TRUE, 1, 0)))
Any insight would be great.
Here is a dplyr version
CarCodes = c('STW387', 'W9333')
InteriorCodes = c('K987', 'AZ326')
data.frame(customer = c(001, 002, 003),
code1 = c('STW387', 'AZ326', 'BKY98'),
code2 = c('K987', 'CP993', 'A0091'),
code3 = c('W9333', 'EN499', 'C2001')) %>%
gather(variable, value, -customer) %>%
mutate(purchase = case_when(value %in% CarCodes ~ 2,
value %in% InteriorCodes ~ 1,
TRUE ~ 0)) %>%
group_by(customer) %>%
summarise(largest_purchase = max(purchase))
Determine if either the CarCodes or InteriorCodes are contained and then use the max value.
c2 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df2$InteriorCodes), 1, 0))
c1 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df1$CarCodes), 2, 0))
df3$LargestPurchase <- pmax(c1, c2)
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
1 1 STW387 K987 W9333 2
2 2 AZ326 CP993 EN499 1
3 3 BKY98 A0091 C2001 0

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Sliding window over data.frame with nested hierarchy

Description of the data
My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.
Goal
For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...
At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.
Note on performance
I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.
Example data
set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
country city salary arg1 arg2
1 1 1 22791 -1.4606212 1.07084528
2 1 1 34598 0.9244679 1.19519158
3 1 1 76411 0.8288587 0.86737330
4 1 1 76790 1.3013056 0.07380115
5 1 1 87297 -1.4021137 1.62395596
6 1 2 12581 1.3062181 -1.03360620
With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.
Unless I've misunderstood something, the following might be helpful.
Define a simple function regardless of hierarchical groupings:
ff = function(salary, wSz, wSt, arg1, arg2)
{
froms = (wSt * (0:ceiling(max(salary) / wSt)))
tos = froms + wSz
Ds = mapply(function(from, to, salaries, args1, args2) {
inds = salaries > from & salaries < to
sum(args1[inds]) / sum(args2[inds])
},
from = froms, to = tos,
MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
list(from = froms, to = tos, D = Ds)
}
Compute on the groups with, for example, data.table:
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
# country city from to D
# 1: 1 1 0 70000 -0.2366040
# 2: 1 1 30000 100000 0.4394620
# 3: 1 1 60000 130000 0.2838260
# 4: 1 1 90000 160000 NaN
# 5: 1 2 0 70000 1.8112196
# 6: 1 2 30000 100000 0.6134090
# 7: 1 2 60000 130000 0.5959344
# 8: 1 2 90000 160000 NaN
# 9: 1 3 0 70000 1.3216255
#10: 1 3 30000 100000 1.8812397
I.e. a faster equivalent of
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))
Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.
set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)
# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)
WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) /
rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)),
by = list(country = country, city = city))]
You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.
Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
#subset of data for that country
CountrySubset = dt[dt$country == x,,drop=F]
for(y in CountrySubset$city){
#subset of data for citys within country
CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
for(z in 1:StepNumber){
WinRange = WindowRange + (z*WindowStep)
#subset of salarys within country of city via windowRange
WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
Table = rbind(Table,Output)
}
}
}
Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.
So if someone can alter this to fix that. Should be good.
WindowStep is the difference between each consecutive WindowSize you want.
StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.

Resources