When using subset on a data frame, my resulting data frame has some odd behavior.
df is the subset of a larger data frame
>df
buy_sell_count trt sector
1 1 0.023957 Apartment
2 1 0.026739 Strip Center
3 1 0.0705979999999999 Mall
4 1 0.0595650000000001 Office
5 1 0.0290539999999999 Industrial
I've tried the various drop-level practices shown in this question, but none have worked.
When i do mean(df$trt) I get a argument is not numeric or logical: returning NA
When i do as.numeric(df$trt) I get
[1] 8 9 12 11 10 1 4 6 3 5 7 2
I think it has to do with the levels:
df$trt produces
[1] 0.023957 0.026739 0.0705979999999999 0.0595650000000001 0.0290539999999999
[6] -0.01607 -0.188538 0.00279700000000016 -0.022502 0.00178300000000009
[11] 0.00770099999999996 -0.0191330000000001
12 Levels: -0.01607 -0.0191330000000001 -0.022502 -0.188538 0.00178300000000009 ... 0.0705979999999999
Related
Say, I have a dataframe df in R as follows,
id inflam
1 1 0.03093764
2 2 0.50115406
3 3 0.82153770
4 4 0.01985961
5 5 0.04994588
6 6 0.91714810
7 7 0.83438400
8 8 0.80832225
9 9 0.12360681
10 10 0.08490079
I can access the entirety of the inflam column by indexing as df[,2] or df[2]. However, typeof(df[,2]) returns double, whereas typeof(df[2]) returns list. The comma seems to be the differentiator, but why is this the case? What is going on under the hood?
I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546
I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))
I would like to reshape my old data.frame data from long to wide using two variables as the columns for the new data.frame new.data. Specifically, I want to take the two variables data$assessment and data$question_id:
1) Figure out how many data$question_id are in each data$assessment, so that
2) Each data$question_id represents a column in the new data.frame, and
3) Relabel each data$question_id to indicate the assessment it belongs to (i.e. Assessment1 and Q1 is Assessment1_Q1, Assessment1 and Q3 is Assessment1_Q3).
However, there are two things to consider:
1) The assessments have different numbers of questions
2) Not all questions were filled out by the participant (i.e. missing data)
Here's the general structure of the old data.frame:
> dim(data)
[1] 42106 4
> colnames(data)
[1] "subjectid" "assessment" "question_id" "question_value"
> lapply(data, class)
$subjectid
[1] "integer"
$assessment
[1] "factor"
$question_id
[1] "factor"
$question_value
[1] "factor"
> length(unique(data$subjectid))
[1] 96
> table(data$assessment)
Assessment1 Assessment2
1362 2102
Assessment3 Assessment4
966 864
Assessment5 Assessment6
1183 2093
Assessment7 Assessment8
181 14208
Assessment9 Assessment10
6734 2044
Assessment11 Assessment12
3129 2185
Assessment13 Assessment14
3962 1093
> length(unique(data$question_id))
[1] 431
I want my new data.frame new.data to have rows representing participants (N=96), columns representing the assessment and question (i.e. Assessment1_Q1), and new.data$question_value representing each participant's score on a specific assessment/question. Using dim(new.data) should yield 96 432
It should look something like this
subjectid Assessment1_Q1 Assessment1_Q2 Assessment1_Q3 Assessment1_Q4 Assessment2_Q1 Assessment2_Q2 Assessment2_Q3 Assessment3_Q1 Assessment3_Q2 Assessment3_Q3 Assessment4_Q1 Assessment4_Q2
1 6 7 5 4 1 2 4 8 6
2 5 9 3 1 2 4 8 2 3
3 3 9 5 4 5 9 2 3 7 5 5
As you can see, the new data.frame's rows are participants, the columns are Assessments/Questions, and the values are the participants' responses (missing responses are left blank.
I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion