Cut function creates too many levels

Cut function creates too many levels - r

I have a list of integers that represent years of education:
education= 12 14 17 15 12 19 16 12 16 14 12 18 12 13 18 18 10 13 12 18
22 16 13 22 12 15 12 16 18 18 18 20 18 16 13 12 16 13 18 20 20 20 14 18
18 12 18 16 20 18 14 16 19 12 12 11 13 13
I am trying to categorize the years into 3 different levels:
9-12
13-17
18+
I have tried to used the cut function:
edulevels=cut(education,c(9,12,13,17,18,22))
but it creates 2 additional levels for 12-13 and 17-18:
Levels: (9,12] (12,13] (13,17] (17,18] (18,22]
How do I get it to only create these three levels?

simplest solution
edulevels= cut(education,c(9,12.5,17.5,22), labels = c("9-12", "13-17", "18+"))

Intervals defined by the cut() function are closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up.
edulevels <- cut(education,
c(8.5, 12.5, 17.5, Inf),
labels=c('9-12','13-17','18+')
)

Related

Selecting columns with only one character in R

Here is my data
df<-read.table(text="A1 A2 AA2 A3 APP3 AA4 A4
17 17 14 18 18 14 17
16 15 13 16 19 15 19
17 14 12 19 15 18 14
17 16 16 18 19 19 20
19 18 12 18 13 17 17
12 19 17 18 16 20 18
20 18 14 13 15 15 16
18 20 12 20 12 12 18
12 15 18 14 16 18 18",h=T)
I want to select columns that have only one A, i.e.,
A1 A2 A3 A4
17 17 18 17
16 15 16 19
17 14 19 14
17 16 18 20
19 18 18 17
12 19 18 18
20 18 13 16
18 20 20 18
12 15 14 18
I have used the following code:
df1<- df%>%
select(contains("A"))
but it gives me all As that start with A
Is it possible to get table 2? Thanks for your help.

You can use matches() with a regex pattern. A pattern for "contains exactly 1 'A'" would be this "^[^A]*A[^A]*$"
df %>% select(matches("^[^A]*A[^A]*$"))
# A1 A2 A3 A4
# 1 17 17 18 17
# 2 16 15 16 19
# 3 17 14 19 14
# 4 17 16 18 20
# ...
Based on comments, my best guess for what you want is columns where the name starts with a P and after the P contains only numbers:
# single P followed by numbers
df %>% select(matches("^P[0-9]+$"))
# single A followed by numbers
df %>% select(matches("^A[0-9]+$"))
# single capital letter followed by numbers
df %>% select(matches("^[A-Z][0-9]+$"))

If your not very comfortable with RegEx here's an alternative solution,
The first step is to create a function that counts the number of "A"s in a vector of strings, I will do this by creating a temporary vector of columns names with all the As removed and then subtracting the new number of characters from the original.
count_a<-function(vector,char){
vec2<-gsub("A","",vector, fixed=T)
numb_As<-nchar(vector)-nchar(vec2)
return(numb_As)
}
Once you have this function you simply apply it to the colnames of your dataset and then limit your data to the columns where the count is equal to one.
As<-count_a(colnames(df))
df[,As==1]

If you are not familiar with regular expressions, you can use a function of the popular package for analysing strings: stringr. With one line you get this:
library(stringr)
df[,str_count(names(df),'A')==1]

Trying to integrate over discrete points from a data frame

I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?

This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)

The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)

trouble combining two numeric columns into one R

so I am having a bit of bother combining two columns into one. I have two columns of ages, which are split into child and adolescent columns. For example:
child adolescent
1 NA 12
2 NA 15
3 NA 12
4 NA 12
5 NA 13
6 NA 13
7 NA 13
8 NA 14
9 14 15
10 NA 12
11 12 13
12 NA 12
13 NA 13
14 NA 14
15 NA 14
16 12 13
17 NA 14
18 NA 13
19 NA 13
20 NA 14
21 NA 12
22 NA 13
23 12 15
24 NA 13
25 NA 15
26 NA 12
27 NA 15
28 NA 15
29 NA 13
30 NA 12
31 13 15`
Now what I would like to do is combine them into one column called "age" and remove all the na values. However when I try the following code, I encounter a problem:
age<- c(na.omit(data$child),na.omit(data$adolescent))
The problem being that my original data has 514 rows, yet when I combine the two columns, removing the nas, I somehow end up with 543 values, not 514 and I don't know why.
So, if possible, could someone explain firstly why I am getting more values than I planned, and secondly what might be a better way to combine the two columns.
EDIT: I am looking for something like this
age
1 12
2 15
3 12
4 12
5 13
6 13
7 13
8 14
9 14
10 12
11 12
12 12
13 13
14 14
15 14
16 12
17 14
18 13
19 13
20 14
21 12
22 13
23 12
24 13
25 15
26 12
27 15
28 15
29 13
30 12
31 13
32 14
33 13
34 11
35 15
36 13
Thanks in advance

This line:
age<- c(na.omit(data$child),na.omit(data$adolescent))
concatenates all the non-missing values from the child field to all the non-missing values from the adolescent field. I suspect you want to use one of these solutions
# youngest age
age<- pmin(data$child,data$adolescent,na.rm=T)
# oldest age
age<- pmax(data$child,data$adolescent,na.rm=T)
# child age, replaced with adolescent if missing
age<- data$child
age[is.na(age)] <- data$adolescent[is.na(age)]
# ^ notice same logical index ^
# |_______________________________|

Your code works on the example data, but you could try this:
age <- c(data$child, data$adolescent)
age <- age[!is.na(age)]
This combines the two columns from the data frame into a vector and removes all NA elements.

df$age <- ifelse( !(is.na(df$child)), df$child , df$adolescent)

how to deal with this kind of data type

I used igraph package to detect communities. When I used membership(community) function, the result is:
1 2 3 4 5 6 7 13 17 18 19 20 22 23 24 25
12 9 1 10 12 6 12 16 1 11 6 6 3 13 16 1
29 30 31 33 34 37 38 39 40 41 42 43 44 45 46 47
9 5 11 14 13 6 13 11 12 13 1 16 11 6 12 7
...
The first line is node ID and the second line is its corresponding community ID.
Suppose the name of the above result is X. I used Y=data.frame(X). The result is:
community
1 12
2 9
3 1
4 10
5 12
6 6
7 12
13 16
...
I want to use the first column (1,2,3,...), for instance, Y[13,]=16. But in this case, it is Y[8,]=16. How to do this?
This question may be very simple. But I do not know how to google it. Thanks.

Function as.data.frame() converts a named vector to a data frame, where the names of the vector elements are used as row names.
In other words, use a construct like rownames(Y)[8] to access the first column (or the row names, actually).

How to obtain all possible sub-samples of size n from a dataframe of size N in R?

I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R),
dataframe below:
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
9 9 32
10 10 26
11 11 27
12 12 34
13 13 27
14 14 28
15 15 33
16 16 21
17 17 36
18 18 24
19 19 19
20 20 32

It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned.
Assuming you want all possible combinations of 8 classrooms from the dataset classrooms
combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,],
simplify = FALSE, data =classrooms )
head(combinations, n = 2)
[[1]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
[[2]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
9 9 32

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cut function creates too many levels - r

simplest solution edulevels= cut(education,c(9,12.5,17.5,22), labels = c("9-12", "13-17", "18+"))

Related

Selecting columns with only one character in R

Trying to integrate over discrete points from a data frame

trouble combining two numeric columns into one R

how to deal with this kind of data type

How to obtain all possible sub-samples of size n from a dataframe of size N in R?

Categories

Resources