Dummy variables based on values from different columns - r

I currently have a data frame that looks as such:
dat2<-data.frame(
ID=c(100,101,102,103),
DEGREE_1=c("BA","BA","BA","BA"),
DEGREE_2=c(NA,"BA",NA,NA),
DEGREE_3=c(NA,"MS",NA,NA),
YEAR_DEGREE_1=c(1980,1990,2000,2004),
YEAR_DEGREE_2=c(NA,1992,NA,NA),
YEAR_DEGREE_3=c(NA,1996,NA,NA)
)
ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_DEGREE_1 YEAR_DEGREE_2 YEAR_DEGREE_3
100 BA <NA> <NA> 1980 NA NA
101 BA BA MS 1990 1992 1996
102 BA <NA> <NA> 2000 NA NA
103 BA <NA> <NA> 2004 NA NA
I would like to create dummy variables coded 0/1 based on what kind of degree was earned, using the completion of one BA degree as the base.
The completed data frame would have a second BA degree dummy, an MS degree dummy, and so on. For example, for ID 101, both dummies would have a value of 1. The completion of two MS degrees would not require a dummy, i.e. if someone completed two MS degrees, then the MS degree dummy would be 1 and there would be no dummy to signify completing two MS degrees.
Like such
This is a simple snapshot of a much bigger data frame that has many different degrees types besides BA and MS, so it isn't ideal for me to create if/else statements for every single degree type.
Any advice would be appreciated.

You could also include new columns and assign the value based on the DEGREE columns.
Including new columns, with all values equal 0:
dat2 <- cbind(dat2, BA_2nd = 0)
dat2 <- cbind(dat2, MS = 0)
Changing the value to 1, based on your conditions:
dat2[!is.na(dat2$DEGREE_2), 8] <- 1
dat2[!is.na(dat2$DEGREE_3) & dat2$DEGREE_3 == "MS", 9] <- 1
dat2
You can adapt it to all the conditions you have. This code generates only the output table that you included.

Related

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

How can I fill in NA values based on the next real value but divide that value between the preceding NAs?

Please note: this is a hyper simplified explanation of where the 'data' comes from, but where the data is from is irrelevant to the coding question.
I have a data set created by collecting water in a tube everyday.
I can't go and measure the tube every day (but the tube keeps filling) so there are gaps in the water value records.
This dummy data set shows where this has happened on days 5 and 10, because this is a dummy dataset I have made an assumption that each day 500ml of water goes into the tube (the real data set is a alot messier!)
dummy data
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)
Data explanation: I have collected every day for days 1:4 so the value for each day is 500ml, missed day 5 so the value is NA, collected on day 6 so the value is 1000ml (the water from day 5 and day 6 combined), missed 7,8,9, so values equal NA, collected on day 10 to give a value of 2000ml for the 4 days) then collected every day for the last two)
I would like to fill in the NA gaps by taking the value of the next 'real' measurement and dividing that value between the NA's and that value's day.Yes, I am assuming that if I have not made a measurement there is a constant process and that I can divide the last measurement equally between the days.
this is what the output data should look like
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)
Again this is just a dummy data set otherwise the easiest way would just be replace NA with 500 with 'value[is.na(value)] <- 500', but in the real data set the values can be 457.6, 779, 376, etc.
Also tried to do a loop but keep getting stuck...
Any ideas on how I can do this?
Help is greatly appreciated
Here's a possible solution :
# Create test Data:
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)
# "Cleansing" starts here :
RLE <- rle(is.na(df$value))
# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
RLE$lengths <- head(RLE$lengths,-1)
RLE$values <- head(RLE$values,-1)
}
afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences
df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)
Result :
> df
day value
1 1 250
2 2 250
3 3 500
4 4 500
5 5 1500
6 6 1500
7 7 1250
8 8 1250
9 9 1250
10 10 1250
11 11 500
12 12 500
13 13 NA
14 14 NA
15 15 NA

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Unable to set column names to a subset of a dataframe

I run the following code, p is the dataframe loaded.
a <- sort(table(p$Title))
a1 <- as.data.frame(a)
tail(a1, 7)
a
Maths 732
Science 737
Physics 737
Chemistry 776
Social Science 905
null 57374
88117
I want to do some manipulations on the above dataframe result. I want to add column names to the dataframe. I tried the colnames function.
colnames(a1) <- c("category", "count")
I get the below error:
Error in `colnames<-`(`*tmp*`, value = c("category", "count")) :
attempt to set 'colnames' on an object with less than two dimensions
Please suggest.
As I said in the comments to your question, the categories are rownames. A reproducible example:
# create dataframe p
x <- c("Maths","Science","Physics","Chemistry","Social Science","Languages","Economics","History")
set.seed(1)
p <- data.frame(title=sample(x, 100, replace=TRUE), y="some arbitrary value")
# create the data.frame as you did
a <- sort(table(p$title))
a1 <- as.data.frame(a)
The resulting dataframe:
> a1
a
Social Science 6
Maths 9
History 10
Science 11
Physics 12
Languages 15
Economics 17
Chemistry 20
Looking at the dimensions of dataframe a1, you get this:
> dim(a1)
[1] 8 1
which means that your dataframe has 8 rows and 1 column. Trying to assign two columnnames to the a1 dataframe will hence result in an error.
You can solve your problem in two ways:
1: assign just 1 columnname with colnames(a1) <- c("count")
2: convert the rownames to a category column and then assign the columnnames:
a1$category <- row.names(a1)
colnames(a1) <- c("count","category")
The resulting dataframe:
> a1
count category
Social Science 6 Social Science
Maths 9 Maths
History 10 History
Science 11 Science
Physics 12 Physics
Languages 15 Languages
Economics 17 Economics
Chemistry 20 Chemistry
You can remove the rownames with rownames(a1) <- NULL. This gives:
> a1
count category
1 6 Social Science
2 9 Maths
3 10 History
4 11 Science
5 12 Physics
6 15 Languages
7 17 Economics
8 20 Chemistry

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources