I have a data frame named df:
number value
1 5
2 5
3 5
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
13 8
14 9
15 9
I want to remove specific rows in case of a min and max level. I tried separate this:
df[df$value>5 , ]
and after that this:
df[df$value>8 , ]
After I tried this:
df[df$value>5 & df$value>8, ]
but it execute online the df$value>8
and another problem I observed is that when I type
df[df$value>5, ]
it eliminate the value however when I type df it contains the values I tried to remove before. What could be wrong and I don’t take a clear data frames without the removed values?
An example of the output data:
number value
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
If you want remove lines with level lower than min and higher than max, try this:
df[df$value<5 | df$value>8, ]
Edit
Look right code:
df <- df[df$value>5 & df$value<8,]
Its work for me.
Related
Say, I have a dataframe df in R as follows,
id inflam
1 1 0.03093764
2 2 0.50115406
3 3 0.82153770
4 4 0.01985961
5 5 0.04994588
6 6 0.91714810
7 7 0.83438400
8 8 0.80832225
9 9 0.12360681
10 10 0.08490079
I can access the entirety of the inflam column by indexing as df[,2] or df[2]. However, typeof(df[,2]) returns double, whereas typeof(df[2]) returns list. The comma seems to be the differentiator, but why is this the case? What is going on under the hood?
I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546
I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))
From the following data frame, I am trying to output two tables, one for PASS and another one for FAIL. The condition is that the output for each table should contain only the ID and the Score. Can anyone help me with this? I am still starting to know the full capabilities of the table function. If anyone could suggest other alternatives I would greatly appreciate it as long as the conditions for the output is met.
> df <- data.frame(
ID <- as.factor(c(20260, 11893, 54216, 11716, 53368, 46196, 40007, 20970, 11802, 46166, 23615, 11865, 16138, 64789, 43211, 66539));
Score <- c(9,7,6,2,10,7,8,10,6,7,7,9,9,9,10,8)
Remark<- as.factor(c("PASS","PASS","FAIL","FAIL","PASS","PASS","PASS","PASS","FAIL","PASS","PASS","PASS","PASS","PASS","PASS","PASS"))
)
> df
ID Score Remark
1 20260 9 PASS
2 11893 7 PASS
3 54216 6 FAIL
4 11716 2 FAIL
5 53368 10 PASS
6 46196 7 PASS
7 40007 8 PASS
8 20970 10 PASS
9 11802 6 FAIL
10 46166 7 PASS
11 23615 7 PASS
12 11865 9 PASS
13 16138 9 PASS
14 64789 9 PASS
15 43211 10 PASS
16 66539 8 PASS
Something like this?
df <- data.frame(
ID = as.factor(c(20260, 11893, 54216, 11716, 53368, 46196, 40007, 20970, 11802, 46166, 23615, 11865, 16138, 64789, 43211, 66539)),
Score = c(9,7,6,2,10,7,8,10,6,7,7,9,9,9,10,8),
Remark = as.factor(c("PASS","PASS","FAIL","FAIL","PASS","PASS","PASS","PASS","FAIL","PASS","PASS","PASS","PASS","PASS","PASS","PASS"))
)
df[df$Remark == "PASS", 1:2]
ID Score
1 20260 9
2 11893 7
5 53368 10
6 46196 7
7 40007 8
8 20970 10
10 46166 7
11 23615 7
12 11865 9
13 16138 9
14 64789 9
15 43211 10
16 66539 8
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.