so I am having a bit of bother combining two columns into one. I have two columns of ages, which are split into child and adolescent columns. For example:
child adolescent
1 NA 12
2 NA 15
3 NA 12
4 NA 12
5 NA 13
6 NA 13
7 NA 13
8 NA 14
9 14 15
10 NA 12
11 12 13
12 NA 12
13 NA 13
14 NA 14
15 NA 14
16 12 13
17 NA 14
18 NA 13
19 NA 13
20 NA 14
21 NA 12
22 NA 13
23 12 15
24 NA 13
25 NA 15
26 NA 12
27 NA 15
28 NA 15
29 NA 13
30 NA 12
31 13 15`
Now what I would like to do is combine them into one column called "age" and remove all the na values. However when I try the following code, I encounter a problem:
age<- c(na.omit(data$child),na.omit(data$adolescent))
The problem being that my original data has 514 rows, yet when I combine the two columns, removing the nas, I somehow end up with 543 values, not 514 and I don't know why.
So, if possible, could someone explain firstly why I am getting more values than I planned, and secondly what might be a better way to combine the two columns.
EDIT: I am looking for something like this
age
1 12
2 15
3 12
4 12
5 13
6 13
7 13
8 14
9 14
10 12
11 12
12 12
13 13
14 14
15 14
16 12
17 14
18 13
19 13
20 14
21 12
22 13
23 12
24 13
25 15
26 12
27 15
28 15
29 13
30 12
31 13
32 14
33 13
34 11
35 15
36 13
Thanks in advance
This line:
age<- c(na.omit(data$child),na.omit(data$adolescent))
concatenates all the non-missing values from the child field to all the non-missing values from the adolescent field. I suspect you want to use one of these solutions
# youngest age
age<- pmin(data$child,data$adolescent,na.rm=T)
# oldest age
age<- pmax(data$child,data$adolescent,na.rm=T)
# child age, replaced with adolescent if missing
age<- data$child
age[is.na(age)] <- data$adolescent[is.na(age)]
# ^ notice same logical index ^
# |_______________________________|
Your code works on the example data, but you could try this:
age <- c(data$child, data$adolescent)
age <- age[!is.na(age)]
This combines the two columns from the data frame into a vector and removes all NA elements.
df$age <- ifelse( !(is.na(df$child)), df$child , df$adolescent)
Related
I try to
-calculate the mean of each row from column 2 to 11 for my dataframe "alpha"
-add the result into column 12 of my dataframe "alpha" which currently has "NA" values
column 1 is "locs"
my df:
[,1][,2][,3][,4][,5][,6][,7][,8][,9][,10][,11][,12]...[,17]
[1,] A1 5 9 4 8 12 4 8 12 4 8 NA NA
[2,] C3 6 10 4 8 12 4 8 12 4 8 NA NA
[3,] P2 7 11 5 6 10 5 6 10 5 6 NA NA
[4,] 4 8 12 5 6 10 5 6 10 5 6 NA NA
[49,] 4 8 12 5 6 10 5 6 10 5 6 NA NA
I am not very familiar with R and I don't understand the problem.
Those are the two different for loops I tried and the warning message:
> for (j in 1:49){
+ alpha[j, 12] <- mean(alpha[j,2:11])
+ }
There were 49 warnings (use warnings() to see them)
>
> for (j in 1:length(locs)) {
+ alpha$mean[j] <- mean(alpha[j,2:11])
+ }
There were 49 warnings (use warnings() to see them)
>
> warnings()
Warnmeldungen:
1: In mean.default(alpha[j, 2:11]) :
Argument ist weder numerisch noch boolesch: gebe NA zurück
2: In mean.default(alpha[j, 2:11]) :
Argument ist weder numerisch noch boolesch: gebe NA zurück
data.frame': 49 obs. of 17 variables:
$ locs: Factor w/ 49 levels "A1","C3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ sum.2009 : num 12 11 12 15 22 18 14 18 8 9 ...
$ sum.2010 : num 14 11 13 18 22 21 15 21 16 17 ...
$ sum.2011 : num 15 12 20 18 26 25 22 18 25 14 ...
$ sum.2012 : num 15 13 17 25 24 20 24 28 26 20 ...
$ sum.2013 : num 14 9 21 21 28 20 14 19 23 21 ...
$ sum.2014 : num 21 16 28 24 32 26 19 22 7 12 ...
$ sum.2015 : num 27 27 31 23 17 6 14 26 19 19 ...
$ sum.2016 : num 18 18 14 23 25 22 24 39 32 15 ...
$ sum.2017 : num 18 18 23 35 22 7 12 27 15 16 ...
$ sum.2018 : num 25 23 25 26 20 11 12 13 7 8 ...
$ mean : num NA NA NA NA NA NA NA NA NA NA ...
Then I converted "locs" from factor to numeric using:
alpha$locs <- as.numeric(alpha$locs)
alpha$locs <- lapply(alpha$locs , as.numeric)
which both worked but I still got the same error messages after running
the for loops.
alpha[1, 2:11] is a data frame with one row, not a vector, and mean doesn't know what to do with a data frame. A better approach would be alpha[, 12] = rowMeans(alpha[, 2:11])
Your approach would work just fine if alpha was a matrix - matrices can only have one data type, so a row or a column can be converted to a vector always. But data frames are all about columns, and columns can have different types. alpha[2:11, 1] is a vector because it's all from one column, and each column is a vector, so it's just part of a vector. But alpha[1, 2:11] spans several columns, and each of the columns might have a different type, so R keeps it as a data frame.
Another approach you could take would be to unlist each row to convert it to a vector, alpha[j, 12] <- mean(unlist(alpha[j,2:11])). This will work, but it will be very slow compared to the rowMeans approach.
Here is my data
df<-read.table(text="A1 A2 AA2 A3 APP3 AA4 A4
17 17 14 18 18 14 17
16 15 13 16 19 15 19
17 14 12 19 15 18 14
17 16 16 18 19 19 20
19 18 12 18 13 17 17
12 19 17 18 16 20 18
20 18 14 13 15 15 16
18 20 12 20 12 12 18
12 15 18 14 16 18 18",h=T)
I want to select columns that have only one A, i.e.,
A1 A2 A3 A4
17 17 18 17
16 15 16 19
17 14 19 14
17 16 18 20
19 18 18 17
12 19 18 18
20 18 13 16
18 20 20 18
12 15 14 18
I have used the following code:
df1<- df%>%
select(contains("A"))
but it gives me all As that start with A
Is it possible to get table 2? Thanks for your help.
You can use matches() with a regex pattern. A pattern for "contains exactly 1 'A'" would be this "^[^A]*A[^A]*$"
df %>% select(matches("^[^A]*A[^A]*$"))
# A1 A2 A3 A4
# 1 17 17 18 17
# 2 16 15 16 19
# 3 17 14 19 14
# 4 17 16 18 20
# ...
Based on comments, my best guess for what you want is columns where the name starts with a P and after the P contains only numbers:
# single P followed by numbers
df %>% select(matches("^P[0-9]+$"))
# single A followed by numbers
df %>% select(matches("^A[0-9]+$"))
# single capital letter followed by numbers
df %>% select(matches("^[A-Z][0-9]+$"))
If your not very comfortable with RegEx here's an alternative solution,
The first step is to create a function that counts the number of "A"s in a vector of strings, I will do this by creating a temporary vector of columns names with all the As removed and then subtracting the new number of characters from the original.
count_a<-function(vector,char){
vec2<-gsub("A","",vector, fixed=T)
numb_As<-nchar(vector)-nchar(vec2)
return(numb_As)
}
Once you have this function you simply apply it to the colnames of your dataset and then limit your data to the columns where the count is equal to one.
As<-count_a(colnames(df))
df[,As==1]
If you are not familiar with regular expressions, you can use a function of the popular package for analysing strings: stringr. With one line you get this:
library(stringr)
df[,str_count(names(df),'A')==1]
I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?
This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)
The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)
I have a data.frame consisting of about 300k rows with 24 rows for each ID - each row representing an hourly observation of that ID. My problem lies in that for some IDs the observation ends before the 24 hours has gone by - yet still have 24 rows with the remaining rows having NA in their 3 observation variables.
In a simplified table would be something like this
ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
1 0 29 32 34 19 21
1 1 21 12 NA 19 21
1 2 NA 24 NA 19 21
1 3 NA NA NA 19 21
1 4 NA NA NA 19 21
2 0 41 16 21 13 24
2 1 NA NA NA 13 24
2 2 11 30 41 13 24
2 3 21 NA NA 13 24
2 4 24 35 21 13 24
2 5 NA NA NA 13 24
2 6 NA NA NA 13 24
3 0 NA NA NA 35 46
3 1 23 34 24 35 46
3 2 NA 26 NA 35 46
3 3 NA NA 24 35 46
3 4 12 29 42 35 46
3 5 NA NA NA 35 46
3 6 NA NA NA 35 46
In the table, each ID would represent a scenario that should be handled appropriately:
ID 1: Ordinary with observations starting from hour 0 and observation ending at hour 3 - and thus row with hour 3 and 4 for that group should be removed
ID 2: Has an hour (1) where all three observation variables are set at NA, but observation is resumed and ends at hour 5 - and thus row 2 should be kept (due to faulty registration and not end of observation) and rows with hour 5 and 6 should be removed.
ID 3: Starts out with an row with NA in all three observation variables, but observation begins then next hour and ends at hour 5. This is akin to the scenario for ID 2, but this time occurring at the very start (instead of in the middle of the observations). However, this still represent a faulty registration and should be kept and rows from hour 5 and 6 in this group should be removed.
Conceptually, I would think a possible solution would be do a group_by ID and then for R to go through the rows in a group in reverse (from bottom and up) until it encounters a row where "OBS_1", "OBS_2" and "OBS_3" are not all NA and remove the rows examined before reaching to this row and then move on to examine the next group.
Any help would be greatly appreciated!
If your MISC and MISC_2 values are consistent for each ID, you could
filter all rows that have na values then fill back in the missing data with complete and fill.
library(dplyr)
library(tidyr)
df %>% filter(!(is.na(OBS_1)&is.na(OBS_2)&is.na(OBS_3))) %>%
group_by(ID) %>%
complete(HOUR=0:max(HOUR)) %>%
fill(MISC,MISC_2) %>% fill(MISC,MISC_2,.direction = "up")
# A tibble: 13 x 7
# Groups: ID [3]
# ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 29 32 34 19 21
# 2 1 1 21 12 NA 19 21
# 3 1 2 NA 24 NA 19 21
# 4 2 0 41 16 21 13 24
# 5 2 1 NA NA NA 13 24
# 6 2 2 11 30 41 13 24
# 7 2 3 21 NA NA 13 24
# 8 2 4 24 35 21 13 24
# 9 3 0 NA NA NA 35 46
# 10 3 1 23 34 24 35 46
# 11 3 2 NA 26 NA 35 46
# 12 3 3 NA NA 24 35 46
# 13 3 4 12 29 42 35 46
This filters only the missing values if the no observation for the day are existing after this and keeps all missing observations that do not indicate the end of the observations for the day. These also allow for your other variables to vary during the day because it just removes them if the end of observations is reached.
df %>% arrange(rev(as.numeric(rownames(.)))) %>%
group_by(ID) %>%
mutate(rowNum = 1:n(),
naObs = cumsum((is.na(OBS_1) & is.na(OBS_2) & is.na(OBS_3))),
missingBlock = naObs != rowNum) %>%
slice(min(which(missingBlock)):n()) %>%
ungroup() %>%
arrange(rev(as.numeric(rownames(.)))) %>%
select(-rowNum, -naObs, -missingBlock)
I used igraph package to detect communities. When I used membership(community) function, the result is:
1 2 3 4 5 6 7 13 17 18 19 20 22 23 24 25
12 9 1 10 12 6 12 16 1 11 6 6 3 13 16 1
29 30 31 33 34 37 38 39 40 41 42 43 44 45 46 47
9 5 11 14 13 6 13 11 12 13 1 16 11 6 12 7
...
The first line is node ID and the second line is its corresponding community ID.
Suppose the name of the above result is X. I used Y=data.frame(X). The result is:
community
1 12
2 9
3 1
4 10
5 12
6 6
7 12
13 16
...
I want to use the first column (1,2,3,...), for instance, Y[13,]=16. But in this case, it is Y[8,]=16. How to do this?
This question may be very simple. But I do not know how to google it. Thanks.
Function as.data.frame() converts a named vector to a data frame, where the names of the vector elements are used as row names.
In other words, use a construct like rownames(Y)[8] to access the first column (or the row names, actually).