How to create frequency table under couple conditions in r? - r

I am a rookie in R. I think my questions are basic ones. I want to know the frequency of a variable under couple conditions. I try to use table() but it does not work. I have searched a lot, I still cannot find the answers.
My data looks like this
ID AGE LEVEL End_month
1 14 1 201005
2 25 2 201006
3 17 2 201006
4 16 1 201008
5 19 3 201007
6 33 2 201008
7 17 2 201006
8 15 3 201005
9 23 1 201004
10 25 2 201007
I want to know two things.
First, I want to know the frequency of age under different level. The age shows in certain range and aggregate the rest as a variable. It looks like this.
level
1 2 3 sum
age 14 1 0 0 1
16 1 0 0 1
15 0 0 1 1
17 0 2 0 2
19 0 0 1 1
20+ 1 3 0 4
sum 3 5 2 10
Second, I want to know the frequency of different age in different end_month of level 2&3 customer. I want to get a table like this.
For level 2 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 0 0 0 0 0
19 0 0 0 0 0 0
17 0 0 2 0 0 2
19 0 0 0 0 0 0
25 0 0 0 1 0 1
33 0 0 0 1 1 2
sum 0 0 2 2 1 5
For level 3 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 1 0 0 0 1
19 0 0 0 1 0 1
17 0 0 0 0 0 0
19 0 0 0 0 0 0
25 0 0 0 0 0 0
33 0 0 0 0 0 0
sum 0 1 0 1 0 2
Many thanks in advance.

You can still achieve this with table, because it can take more than one variables.
For example, use
table(AGE, LEVEL)
to get the first two-way table.
Now, when you want to produce such table for each subset according to LEVEL, you can do it this way, assuming we are going for level 1:
subset <- LEVEL == 1
table(AGE[subset], END[subset])

Related

Create event (dummy) one year before/ after of a dummy variable (or close to)

I am doing an event study in an unbalanced panal data set. The basic structure is that I have a different number of observations (deliveries) for each firm at different points over a period of around 15 years. I am interested in an event (price increase) which is coded as a dummy variable if it occurs and some dummy lead and lags to check if the effect of the price increase on my dependent variable becomes apparent around that event. As an example, for some firms the price increase occurs at 5 deliveries of e.g. 50 over 15 years.
However, now I also want to "simulate" the same event study one year after and before to improve inference. So I want R to duplicate the event dummy for each firm at the delivery closest to one year before and after. The delivery dates occur not daily but on average every 25 days.
So, as code, the data looks something like this:
df <- data.frame(firm_id = c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4),
delivery_id = c(1,2,6,9,15,3,5,18,4,7,8,10,11,13,17,19,22,12,14,16,20,21),
date=c("2004-06-16", "2004-08-12", "2004-11-22", "2005-07-03", "2007-01-04",
"2004-09-07", "2005-02-01", "2006-01-17",
"2004-10-11", "2005-02-01", "2005-04-27", "2005-06-01", "2005-07-01",
"2006-01-03", "2007-01-06", "2007-03-24", "2007-05-03",
"2005-08-03", "2006-02-19", "2006-06-13", "2007-02-04", "2007-04-26"),
price_increase = c(0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0),
price_increase_year_before = c(1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0),
price_increase_year_afer = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0))
Creating
firm_id delivery_id date price_increase price_increase_year_before price_increase_year_after
1 1 1 2004-06-16 0 1 0
2 1 2 2004-08-12 0 0 0
3 1 6 2004-11-22 0 0 0
4 1 9 2005-07-03 1 0 0
5 1 15 2007-01-04 0 0 0
6 2 3 2004-09-07 0 0 0
7 2 5 2005-02-01 0 0 0
8 2 18 2006-01-17 0 0 0
9 3 4 2004-10-11 0 0 0
10 3 7 2005-02-01 0 1 0
11 3 8 2005-04-27 0 0 0
12 3 10 2005-06-01 0 0 0
13 3 11 2005-07-01 0 0 0
14 3 13 2006-01-03 1 0 0
15 3 17 2007-01-06 0 0 1
16 3 19 2007-03-24 0 0 0
17 3 22 2007-05-03 0 0 0
18 3 12 2005-08-03 0 0 0
19 4 14 2006-02-19 0 0 0
20 4 16 2006-06-13 0 0 0
21 4 20 2007-02-04 0 0 0
22 4 21 2007-04-26 0 0 0
Where I want to create the two dummy columns on the right based on the price_increase and the date, for each firm. Although I would start with dyplr's group_by and mutate approach and an if_else function, I have no idea how to create a condition that becomes TRUE when a delivery in one year is +1/-1 month close to the date in the prior or following year and how to select the respective delivery. Do you guys have an idea?
Here is a possible approach using dplyr.
After group_by(firm_id), filter and include groups where there was a price increase.
Then, create your two dummy variables if the date is one year (+/- 30 days) before or after the date where the price_increase was equal to 1. Then, would filter for rows that met these criteria.
Using distinct you can prevent multiples or duplicates for your dummy variable within a group/firm. Otherwise, if your deliveries were 25 days apart, it seemed like a theoretical possibility.
The rest afterwards is joining back to the original data, replacing the NA with zero for dummy columns, and sorting.
library(dplyr)
df$date <- as.Date(df$date)
df %>%
group_by(firm_id) %>%
filter(any(price_increase == 1)) %>%
mutate(
price_increase_year_before = ifelse(
between(date[price_increase == 1] - date, 335, 395), 1, 0),
price_increase_year_after = ifelse(
between(date - date[price_increase == 1], 335, 395), 1, 0),
) %>%
filter(price_increase_year_before == 1 | price_increase_year_after == 1) %>%
distinct(firm_id, price_increase_year_before, price_increase_year_after, .keep_all = TRUE) %>%
right_join(df) %>%
replace_na(list(price_increase_year_before = 0, price_increase_year_after = 0)) %>%
arrange(firm_id, date)
Output
firm_id delivery_id date price_increase price_increase_year_before price_increase_year_after
<dbl> <dbl> <date> <dbl> <dbl> <dbl>
1 1 1 2004-06-16 0 1 0
2 1 2 2004-08-12 0 0 0
3 1 6 2004-11-22 0 0 0
4 1 9 2005-07-03 1 0 0
5 1 15 2007-01-04 0 0 0
6 2 3 2004-09-07 0 0 0
7 2 5 2005-02-01 0 0 0
8 2 18 2006-01-17 0 0 0
9 3 4 2004-10-11 0 0 0
10 3 7 2005-02-01 0 1 0
11 3 8 2005-04-27 0 0 0
12 3 10 2005-06-01 0 0 0
13 3 11 2005-07-01 0 0 0
14 3 12 2005-08-03 0 0 0
15 3 13 2006-01-03 1 0 0
16 3 17 2007-01-06 0 0 1
17 3 19 2007-03-24 0 0 0
18 3 22 2007-05-03 0 0 0
19 4 14 2006-02-19 0 0 0
20 4 16 2006-06-13 0 0 0
21 4 20 2007-02-04 0 0 0
22 4 21 2007-04-26 0 0 0

creating a DTM from a 3 column CSV file with r

I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1

How to compare data frame with a factor as a variable?

I have a data frame, please see below.
How do I compare the Volume where Purchase == 1 to the previous Purchase == 1 Volume and create a factor variable V1 like shown in the Picture 2?
The df[5,"V1"] == 1 because df[5,"Volume"] > df[3,"Volume"].... and so on.
How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?
I've tried sub-setting, then do the comparison but when tried to put them back to a factor variable, the number of rows of the result is not the same as the number of rows of the df therefore I cannot put the factor variable to the dataframe.
Picture 2
Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0
With data.table:
library(data.table)
data <- data.table(read.table(text=' Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0', header=T))
data[, V1 := 0]
data[Purchase == 1, V1 := as.integer(Volume > shift(Volume)) ]
data[, V1 := as.factor(V1)]
Here, I filtered data to where Purchase = 1, then I brought previous Volume with shift function.
Finally, I compared Volume to Previous volume and assigned 1 if Volume is larger than Previous.

Re-expanding a compressed dataframe to include zero values in missing rows

Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))

mlogit: missing value where TRUE/FALSE needed

I have data from a discrete choice experiment (DCE), looking at hiring preferences for individuals from different sectors. that I've formatted into long format. I want to model using mlogit. I have exported the data and can successfully run the model in Stata using the asclogit command, but I'm having trouble getting it to run in R.
Here's a snapshot of the first 25 rows of data:
> data[1:25,]
userid chid item sector outcome cul fit ind led prj rel
1 11275 211275 2 1 1 0 1 0 1 1 1
2 11275 211275 2 2 0 1 0 0 0 0 0
3 11275 211275 2 0 0 0 0 1 1 0 1
4 11275 311275 3 0 1 1 1 0 0 0 1
5 11275 311275 3 2 0 0 1 0 0 0 1
6 11275 311275 3 1 0 0 1 0 0 0 0
7 11275 411275 4 0 0 1 0 1 1 0 0
8 11275 411275 4 2 1 0 1 1 1 1 0
9 11275 411275 4 1 0 0 1 0 1 0 0
10 11275 511275 5 1 1 1 0 1 0 1 1
11 11275 511275 5 2 0 0 0 1 1 0 0
12 11275 511275 5 0 0 0 0 1 1 1 0
13 11275 611275 6 0 0 0 1 1 0 0 1
14 11275 611275 6 1 1 1 1 1 0 0 1
15 11275 611275 6 2 0 1 1 1 0 1 0
16 11275 711275 7 1 0 0 0 0 0 1 0
17 11275 711275 7 0 0 1 0 0 1 1 0
18 11275 711275 7 2 1 1 0 0 1 1 1
19 11275 811275 8 0 1 0 1 0 0 1 1
20 11275 811275 8 1 0 1 0 1 1 1 1
21 11275 811275 8 2 0 0 0 0 0 1 1
22 11275 911275 9 0 0 1 1 0 0 1 0
23 11275 911275 9 2 1 1 1 1 1 0 1
24 11275 911275 9 1 0 1 0 1 1 0 0
25 11275 1011275 10 0 0 0 0 0 0 0 0
userid and chid are factor variables, the rest are numeric. The variables:
Userid is unique respondent ID
chid is unique choice set ID per respondent
item is choice set ID (they are repeated across respondents)
sector is alternatives (3 different sectors)
outcome is alternative selected by respondent in the given choice set
cul-rel is binary factor variables, alternative specific that vary across alternatives according to the experimental design.
Here is my mlogit syntax:
mlogit(outcome~cul+fit+ind+led+prj+rel,shape="long",
data=data,id.var=userid,chid.var="chid",
choice=outcome,alt.var="sector")
Here is the error I get:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
I've made sure there are no missing data, and that each choice set has exactly 1 selected alternative.
Any ideas about why I'm getting this error, when the model runs fine in Stata with the exact same dataset? I've probably misread the mlogit syntax somewhere. If it helps, my Stata syntax is:
asclogit outcome cul fit rel ind fit led prj, case(chid) alternatives(sector)
Answering my own question here as I figured it out.
R mlogit can't handle when none of the alternatives in a choice set is selected. R also needs the data ordered properly, each alternative in a choice set must be in a row. I hadn't done that due to some data management. Interestingly, Stata can handle both of these conditions, so that's why my Stata commands worked.
As an aside, for those interested, Stata's asclogit and R's mlogit give the exact same results. Always nice when that happens.
You may need to use mlogit.data() to shape the data. There's an examples at ?mlogit. Hope that helps.

Resources