I'm trying to recode multiple race variables into a single race variable. The different variables are RaceVar1: Asian RaceVar2: AIAN RaceVar3: Black RaceVar4: Native Hawaiian RaceVar5: White. Variables are ticked off with a 1 if participants chose that race, and 0 if they didn't. I would like to create a new race variable that condenses these variables into one, as well as assess for if someone ticked off multiple races.
I am able to do this in SAS however I need to do this in R and am unsure how to perform the same task. SAS code below
data want;
set have;
length race $40;
if sum(of r_s_q61___1 - r_s_q61___5) > 1 then race='More than one race';
else if r_s_q61___2 then race='American Indian or Alaska Native';
else if r_s_q61___1 then race='Asian';
else if r_s_q61___5 then race='White';
else if r_s_q61___3 then race='Black or African American';
else if r_s_q61___4 then race='Native Hawaiian or Other Pacific Islander';
else race='Unknown';
run;
I'm not sure where to start other than I believe maybe using rowSums() and ifthen() within a mutate() statement.
Yep! mutate and ifelse are your friends here. With dplyr though we've also got a neat function called case_when that lets us nest a bunch of ifelse statements simultaneously.
library(dplyr)
data.frame(`RaceVar1`=c(1,0,0,0,0,1),
`RaceVar2`=c(0,1,0,0,0,0),
`RaceVar3`=c(0,0,1,0,0,0),
`RaceVar4`=c(0,0,0,1,0,0),
`RaceVar5`=c(0,0,0,0,1,0),
`RaceVar6`=c(0,0,0,0,0,1)) %>%
mutate(more_than_one=rowSums(.)) %>%
mutate(Race=ifelse(
more_than_one>1,
'More than one race',
case_when(
RaceVar1 == 1 ~ "RaceVar1: Asian",
RaceVar2 == 1 ~ "RaceVar2: AIAN",
RaceVar3 == 1 ~ "RaceVar3: Black",
RaceVar4 == 1 ~ "RaceVar4: Native Hawaiian",
RaceVar5 == 1 ~ "RaceVar5: White"
)
))
RaceVar1 RaceVar2 RaceVar3 RaceVar4 RaceVar5 RaceVar6 more_than_one Race
1 1 0 0 0 0 0 1 RaceVar1: Asian
2 0 1 0 0 0 0 1 RaceVar2: AIAN
3 0 0 1 0 0 0 1 RaceVar3: Black
4 0 0 0 1 0 0 1 RaceVar4: Native Hawaiian
5 0 0 0 0 1 0 1 RaceVar5: White
6 1 0 0 0 0 1 2 More than one race
Related
I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).
Identifier
Price
Limit
Deductible
Peril1
Peril2
Peril3
Peril4
Peril5
Peril6
State1
State2
State3
State4
POL1
250.0
100000
500.0
1
1
1
0
0
1
1
0
0
0
POL1
625.0
100000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
1650.0
500000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
2500.0
1000000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
4375.0
2000000
2000.0
1
1
1
0
0
1
1
0
0
0
POL2
25.0
50000
500.0
0
0
1
1
0
0
1
0
0
0
POL3
60.25
25000
500.0
1
1
1
1
1
1
1
0
0
0
POL3
73.25
50000
500.0
1
1
1
1
1
1
1
0
0
0
Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.
I am trying to conduct a multivariate regression
reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)
By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log() the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.
For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here
wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)
But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.
From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.
I have multiple response questions which have 5 categories (values). I want to get respondents who answered only one category.
For example,
Respondents who answered category not 2,3,4,5.
I want only A mentions like, who are all checked A category alone. I need count of this.
Help, Please.
The following solution is assuming the data has 5 dichotomous variables - one for each of the multiple response categories.
* creating some sample data to demonstrate on.
data list list/cat1 to cat5.
begin data
1 0 0 0 1
0 1 1 0 0
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 0 1
1 0 0 0 0
1 1 1 0 0
end data.
* now checking in which cases only category 1 was chosen.
compute NumCats=sum(cat1 to cat5).
if cat1=1 and NumCats=1 onlyCat1=1.
execute.
* if instead you wish to do the same check for each of the 5 categories,
use `do repeat` this way.
do repeat cat=cat1 to cat5/only=only1 to only5.
compute only=(cat=1 and NumCats=1).
end repeat.
execute.
But ditch the EXECUTE commands. They just cause a useless data pass in this case except for immediately updating the Data Editor (instead of updating on the next data pass).
I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.
This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0
I want to transform an ordinal variabel (0-2) – where 0 is no rights, 1 is some rights, and 2 full rights – to a dichotomous variable.
The original ordinal variable is coded for each country and year (country-year unit).
I want to create a dichotomous variable, (let's call it Improvement), capturing all annual positive changes, for each country-year. So when it goes from 0 to 1 (or from 0 to 2, or from 1 to 0), I want it to be 1 for that year and country. And zero otherwise.
Below I give an example of how my data looks like. The "RIGHTS" is the original ordinal variable. The "MY DICHOTOMOUS" variable is what I want to calculate in R. How can I do it?
COUNTRY YEAR RIGHTS MY DICHOTOMOUS
A 1990 0 0
A 1991 0 0
A 1992 0 0
A 1993 1 1
A 1994 0 0
B 1990 1 1
B 1991 1 0
B 1992 1 0
B 1993 1 0
B 1994 1 0
Please, note that the original data can go the other away as well, i.e. it can go negative. I do not want to code for negative changes for this dichotomous variable.
We can use diff
df1$dichotomous <- +c(FALSE,diff(df1$RIGHTS)==1)
df1$dichotomous
#[1] 0 0 0 1 0 1 0 0 0 0
This assumes you don't consider starting with a 1 in rights as a 1 in dichotomous:
x <- rights
n <- length(x)
dichotomous <- c(0, as.numeric(x[-1] - x[-n] == 1))
Might have to do a series of ifelse() statements. But then again I might be miss reading your question. An example is posted below.
MY.DATA$MY.DICHOTOMOUS <- with(MY.DATA,ifelse(COUNTRY=="A",RIGHTS,ifelse(COUNTRY=="B"&YEAR==1990,1,factor(RIGHTS)))`
My question is an extension of that found here: Construct new variable from given 5 categorical variables in Stata
I am an R user and I have been struggling to adjust to the Stata syntax. Also, I'm use to being able to Google for R documentation/examples online and haven't found as many resources for Stata so I've come here.
I have a data set where the rows represent individual people and the columns record various attributes of these people. There are 5 categorical variables (white, hispanic, black, asian, other) that have binary response data, 0 or 1 ("No" or "Yes"). I want to create a mosaic plot of race vs response data using the spineplots package. However, I believe I must first combine all 5 of the categorical variables into a categorical variable with 5 levels that maintains the labels (so I can see the response rate for each ethnicity.) I've been playing around with the egen function but haven't been able to get it to work. Any help would be appreciated.
Edit: Added a depiction of what my data looks like and what I want it to look like.
my data right now:
person_id,black,asian,white,hispanic,responded
1,0,0,1,0,0
2,1,0,0,0,0
3,1,0,0,0,1
4,0,1,0,0,1
5,0,1,0,0,1
6,0,1,0,0,0
7,0,0,1,0,1
8,0,0,0,1,1
what I want is to produce a table through the tabulate command to make the following:
respond, black, asian, white, hispanic
responded to survey | 20, 30, 25, 10, 15
did not respond | 15, 20, 21, 23, 33
It seems like you want a single indicator variable rather than multiple {0,1} dummies. The easiest way is probably with a loop; another option is to use cond() to generate a new indicator variable (note that you may want to catch respondents for whom all the race dummies are 0 in an 'other' group), label its values (and the values of responded), and then create your frequency table:
clear
input person_id black asian white hispanic responded
1 0 0 1 0 0
2 1 0 0 0 0
3 1 0 0 0 1
4 0 1 0 0 1
5 0 1 0 0 1
6 0 1 0 0 0
7 0 0 1 0 1
8 0 0 0 1 1
9 0 0 0 0 1
end
gen race = "other"
foreach v of varlist black asian white hispanic {
replace race = "`v'" if `v' == 1
}
label define race2 1 "asian" 2 "black" 3 "hispanic" 4 "white" 99 "other"
gen race2:race2 = cond(black == 1, 1, ///
cond(asian == 1, 2, ///
cond(white == 1, 3, ///
cond(hispanic == 1, 4, 99))))
label define responded 0 "did not respond" 1 "responded to survey"
label values responded responded
tab responded race
with the result
| race
responded | asian black hispanic other white | Total
--------------------+-------------------------------------------------------+----------
did not respond | 1 1 0 0 1 | 3
responded to survey | 2 1 1 1 1 | 6
--------------------+-------------------------------------------------------+----------
Total | 3 2 1 1 2 | 9
tab responded race2 yields the same results with a different ordering (by the actual values of race2 rather than the alphabetical ordering of the value labels).