How to put information obtained by cast function of reshape package back in my original data frame in R - r

I have a data.frame in panel format (country-year) and I need to calculate the mean of a variable by country and at each five years. So I just used the 'cast' function from 'reshape' package and it worked. Now I need to put this information(the mean by quinquennium) in the old data.frame, so I can run some regressions. How can I do that? Below I provide an example to ilustrate what I want:
set.seed(2)
fake= data.frame(y=rnorm(20), x=rnorm(20), country=rep(letters[1:2], each=10), year=rep(1:10,2), quinquenio= rep(rep(1:2, each=5),2))
fake.m = melt.data.frame(fake, id.vars=c("country", "year", "quinquenio"))
cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
Now, everything is fine and I get what I wantted: the mean of x and y, by country and by quinquennial years. Now, I would like to put them back in the data.frame fake, like this:
y x country year quinquenio mean.x
1 -0.89691455 2.090819205 a 1 1 0.8880242
2 0.18484918 -1.199925820 a 2 1 0.8880242
3 1.58784533 1.589638200 a 3 1 0.8880242
4 -1.13037567 1.954651642 a 4 1 0.8880242
5 -0.08025176 0.004937777 a 5 1 0.8880242
6 0.13242028 -2.451706388 a 6 2 -0.2978375
7 0.70795473 0.477237303 a 7 2 -0.2978375
8 -0.23969802 -0.596558169 a 8 2 -0.2978375
9 1.98447394 0.792203270 a 9 2 -0.2978375
10 -0.13878701 0.289636710 a 10 2 -0.2978375
11 0.41765075 0.738938604 b 1 1 0.2146461
12 0.98175278 0.318960401 b 2 1 0.2146461
13 -0.39269536 1.076164354 b 3 1 0.2146461
14 -1.03966898 -0.284157720 b 4 1 0.2146461
15 1.78222896 -0.776675274 b 5 1 0.2146461
16 -2.31106908 -0.595660499 b 6 2 -0.8059598
17 0.87860458 -1.725979779 b 7 2 -0.8059598
18 0.03580672 -0.902584480 b 8 2 -0.8059598
19 1.01282869 -0.559061915 b 9 2 -0.8059598
20 0.43226515 -0.246512567 b 10 2 -0.8059598
I appreciate any tip in the right direction. Thanks in advance.
ps.: the reason I need this is that I'll run a regression with quinquennial data, and for some variables (like per capita income) I have information for all years, so I decided to average them by 5 years.

I'm sure there's an easy way to do this with reshape, but my brain defaults to plyr first:
require(plyr)
ddply(fake, c("country", "quinquenio"), transform, mean.x = mean(x))
This is quite hackish, but one way to use reshape building off your earlier work:
zz <- cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
merge(fake, melt(zz), by = c("country", "quinquenio"))
though I'm positive there has to be a better solution.

Here's a more old school approach using tapply, ave, and with
fake$mean.x <- with(fake, unlist(tapply(x, list(country, quinquenio), ave)))

Related

creating a dataframe of means of 5 randomly sampled observations

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

R - Skip columns in pmax command if they do not exist

I'd like to use the pmax command to create a new column. My code Looks like this:
Master <- Master %>%
mutate(RAM = pmax(RAM1, RAM2, RAM3, RAM4, RAM5, RAM6, RAM7, RAM8, RAM9, RAM10,
RAM11, RAM12, RAM13, RAM14, RAM15, RAM16, RAM17, RAM18,
RAM19, RAM20, RAM21, RAM22, RAM23, RAM24, RAM25, RAM26,
RAM27, RAM28, RAM29, RAM30, RAM31, RAM32, RAM33, RAM34,
RAM35, RAM36, RAM37, RAM38, RAM39, RAM40, RAM41, RAM42,
RAM43, RAM44, RAM45, RAM46, RAM47, RAM48, RAM49, RAM50,
RAM51, RAM52, RAM53, RAM54, RAM55, RAM56, RAM57, RAM58,
RAM59, RAM60, RAM61, RAM62, RAM63, RAM64, RAM65, RAM66,
RAM67, RAM68, RAM69, RAM70, RAM71, RAM72, RAM73, RAM74,
RAM75, RAM76, RAM77, RAM78, RAM79, RAM80, RAM81, RAM82,
RAM83, RAM84, RAM85, RAM86, RAM87, RAM88, RAM89, RAM90,
RAM91, RAM92, na.rm =T))
In my current data base, however, only the columns RAM1 to RAM8 exist. In this case, I want R to skip all the other columns mentioned in the Statement and to only use column RAM1 to RAM8 (it is okay if R displays an error message, but I don't want the program to interrupt running the code).
Any ideas how to do so?
Thanks!
One way to do this would be as follows:
Set up some data to make a reproducible example
set.seed(0)
Master <- data.frame(Other=100,RAM1=1:10, RAM2=1:10, RAM3=1:10, RAM4=1:10,
RAM5=1:10, RAM6=1:10, RAM7=1:10, RAM8=rnorm(10)+5)
Master[5,5] <- NA
Select required columns of the dataframe:
Master[colnames(Master) %in% paste0("RAM",1:92)]
Use do.call to run pmax using the selected columns as arguments, and adding the argument na.rm=TRUE
Master$RAM <- do.call(pmax, c(Master[colnames(Master) %in% paste0("RAM",1:92)], na.rm=TRUE))
Sample output:
Master
# Other RAM1 RAM2 RAM3 RAM4 RAM5 RAM6 RAM7 RAM8 RAM
#1 100 1 1 1 1 1 1 1 6.262954 6.262954
#2 100 2 2 2 2 2 2 2 4.673767 4.673767
#3 100 3 3 3 3 3 3 3 6.329799 6.329799
#4 100 4 4 4 4 4 4 4 6.272429 6.272429
#5 100 5 5 5 NA 5 5 5 5.414641 5.414641
#6 100 6 6 6 6 6 6 6 3.460050 6.000000
#7 100 7 7 7 7 7 7 7 4.071433 7.000000
#8 100 8 8 8 8 8 8 8 4.705280 8.000000
#9 100 9 9 9 9 9 9 9 4.994233 9.000000
#10 100 10 10 10 10 10 10 10 7.404653 10.000000

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

An easier way to get average of a table with some conditions in R

I am trying to get the average of all 6 quizzes for each male student.
Here is part of the code that I've tried:
a<-subset(mydf,Sex=="M")
b<-a[4:9]
b
sum(b[1:6])
My logic is to get a table only contains male students with each of their 6 quizzes, then sum the table and divide by the number of male student. But I think there should be an easier way to do this.
Sample data:
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
How about this:
data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
Note: "c(4:9)" in the code above is takes the row average for quiz columns 4-9.
So we're calculating quiz scores for each individual this way.
Output:
Section Degree Sex Quiz1 Quiz2 Quiz3 Quiz4 Quiz5 Quiz6 QuizMeans
1 A MBA M 0 0 6 5 7 3 3.500000
2 A MBA M 10 1 5 4 3 8 5.166667
3 A MBA M 2 4 6 5 6 6 4.833333
4 A MBA M 2 4 6 5 3 6 4.333333
5 A MS M 8 1 4 10 10 5 6.333333
7 A MBA M 6 0 7 7 6 10 6.000000
8 A MS M 2 3 9 7 10 10 6.833333
Then if you wanted to take the mean of their means (i.e. the grand mean), you could store the above as something like "df", then use mean() to calculate the mean of the column QuizMeans, like this:
df <- data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
mean(df$QuizMeans)
[1] 5.285714
If there are missing values in your data, you'll need to add na.rm=TRUE to either the mean() or rowMeans() function, like this:
mean(df$QuizMeans, na.rm=TRUE)
[1] 5.285714
You could use the following without specifying column positions
ans <- sum(df[df$Sex=="M", grepl("Quiz",names(df))])/sum(df$Sex=="M")
# 31.71429
If you know the column positions
ans <- sum(df[df$Sex=="M", 4:9])/sum(df$Sex=="M")
# 31.71429
Data
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
Use dplyr.
library(dplyr)
mydf %>% filter(Sex == "Male") %>%
summarise(avg_q6 = mean(Quiz6))

Replacing each value in a vector with its rank number for a data.frame

In this hypothetical scenario, I have performed 5 different analyses on 13 chemicals, resulting in a score assigned to each chemical within each analysis. I have created a table as follows:
---- Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
Chem_1 3.524797844 4.477695034 4.524797844 4.524797844 4.096698498
Chem_2 2.827511555 3.827511555 3.248136118 3.827511555 3.234398548
Chem_3 2.682144761 3.474646298 3.017780505 3.682144761 3.236152242
Chem_4 2.134137304 2.596921333 2.95181339 2.649076603 2.472875191
Chem_5 2.367736454 3.027814219 2.743137896 3.271122346 2.796607809
Chem_6 2.293110565 2.917318708 2.724156207 3.293110565 2.530967343
Chem_7 2.475709113 3.105794018 2.708222528 3.475709113 3.088819908
Chem_8 2.013451822 2.259454085 2.683273938 2.723554966 2.400976121
Chem_9 2.345123123 3.050074893 2.682845391 3.291851228 2.700844104
Chem_10 2.327658894 2.848729452 2.580415233 3.327658894 2.881490893
Chem_11 2.411243882 2.98131398 2.554456095 3.411243882 3.109205453
Chem_12 2.340778276 2.576860244 2.549707035 3.340778276 3.236545826
Chem_13 2.394698249 2.90682524 2.542599327 3.394698249 3.12936843
I would like to create columns corresponding to each analysis which contain the rank position for each chemical. For instance, under Analysis1,Chem_1 would have value "1", Chem_2 would have value "2", Chem_3 would have value "4", Chem_7 would have value "4", Chem_11 would have value "5", and so on.
We can use dense_rank from dplyr
library(dplyr)
df %>%
mutate_each(funs(dense_rank(-.)))
In base R, we can do
df[] <- lapply(-df, rank, ties.method="min")
In data.table, we can use
library(data.table)
setDT(df)[, lapply(-.SD, frank, ties.method="dense")]
To avoid the copies from multiplying with -, as #Arun mentioned in the comments
lapply(.SD, frankv, order=-1L, ties.method="dense")
You can also do this in base R:
cbind("..." = df[,1], data.frame(do.call(cbind,
lapply(df[,-1], order, decreasing = T))))
... Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
1 Chem_1 1 1 1 1 1
2 Chem_2 2 2 2 2 12
3 Chem_3 3 3 3 3 3
4 Chem_4 7 7 4 7 2
5 Chem_5 11 9 5 11 13
6 Chem_6 13 5 6 13 11
7 Chem_7 5 11 7 12 7
8 Chem_8 9 6 8 10 10
9 Chem_9 12 13 9 6 5
10 Chem_10 10 10 10 9 9
11 Chem_11 6 4 11 5 6
12 Chem_12 4 12 12 8 4
13 Chem_13 8 8 13 4 8
If I'm not mistaken, you want to have the column-wise rank of your table. Here is my solution:
m=data.matrix(df) # converts data frame to matrix, convert your data to matrix accordingly
apply(m, 2, function(c) rank(c)) # increasingly
apply(m, 2, function(c) rank(-c)) # decreasingly
However, I believe you could solve it by yourself with the help of the answers to this question
Get rank of matrix entries?

Resources