I wrote a function that is supposed to count how many NA's there are per column. Before I packed everything into a function it worked. Now it doesn't.
I bet just a stupid beginner mistake, still, I could use your help on this.
My thought is, that the statement
x[nrow(x),i] <- aux_count
does not properly assign my stuff. Why I wonder.
The following code shows, my function, which demonstrates the problem.
check_Quandl_tibble <- function(x){
for(i in 2:ncol(x)){
aux_count <- 0
for(j in 1:(nrow(x)-1)){
if(is.na(x[j,i])){
aux_count <- aux_count + 1
}
}
x[nrow(x),i] <- aux_count
}
}
a <- matrix(c(1,4, NA, 81), nrow = 5, ncol = 5)
a <- rbind(a, rep(NA, ncol(a)))
a <- as_tibble(a)
# a now looks like this
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 NA NA NA NA NA
a <- check_Quandl_tibble(a)
# a now looks like this
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 NA NA NA NA NA
# instead I wanted
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 1 1 2 1 1 # this row is supposed to count the NA's per column.
We can take the colSums of logical matrix (is.na(a)) and rbind to the matrix
rbind(a, colSums(is.na(a)))
Here, it is assumed that the 'a' is from the first line of code
a <- matrix(c(1,4, NA, 81), nrow = 5, ncol = 5)
If we want to replace the last row after creating the tibble
a %>%
mutate_all(list(~ replace(., n(), sum(is.na(.[-n()])))))
Related
I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.)
Some simplified data look like this:
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)
My approach was to duplicate and then join the data, which would look like this:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df is the result that I am looking for - from here I can create an overall neighbour dummy or a count.
This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.
How can I solve this in a less clumsy way?
Thanks in advance for your ideas!!
A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.
For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().
library(dplyr)
library(tidyr)
df1 %>%
select(!id) %>%
pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
mutate(original_id = rep(1:100, each = 3))%>%
left_join(df2, by = c("value" = "id"))%>%
pivot_wider(original_id, values_from = c(value, nbdum))
#> # A tibble: 100 × 7
#> original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 25 90 23 0 0 1
#> 2 2 12 NA NA 1 NA NA
#> 3 3 11 40 47 0 0 0
#> 4 4 94 87 NA 0 1 NA
#> 5 5 46 77 NA 1 0 NA
#> 6 6 98 82 NA 1 0 NA
#> 7 7 43 NA NA 1 NA NA
#> 8 8 74 NA 7 0 NA 1
#> 9 9 57 NA NA 1 NA NA
#> 10 10 49 72 NA 0 0 NA
#> # … with 90 more rows
## compare to original
as_tibble(df)
#> # A tibble: 100 × 8
#> id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 0 25 90 23 0 0 1
#> 2 2 1 12 NA NA 1 NA NA
#> 3 3 1 11 40 47 0 0 0
#> 4 4 1 94 87 NA 0 1 NA
#> 5 5 0 46 77 NA 1 0 NA
#> 6 6 1 98 82 NA 1 0 NA
#> 7 7 1 43 NA NA 1 NA NA
#> 8 8 0 74 NA 7 0 NA 1
#> 9 9 0 57 NA NA 1 NA NA
#> 10 10 0 49 72 NA 0 0 NA
#> # … with 90 more rows
As you just seem to be indexing dum with your neighbor variables you should be able to do:
library(dplyr)
df_base %>%
mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))
id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1 1 0 25 90 23 0 0 1
2 2 1 12 NA NA 1 NA NA
3 3 1 11 40 47 0 0 0
4 4 1 94 87 NA 0 1 NA
5 5 0 46 77 NA 1 0 NA
6 6 1 98 82 NA 1 0 NA
7 7 1 43 NA NA 1 NA NA
8 8 0 74 NA 7 0 NA 1
9 9 0 57 NA NA 1 NA NA
10 10 0 49 72 NA 0 0 NA
...
Or same idea in base R:
df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])
I have a data frame as this
df <- data.frame(student_name = c('U','V','X','Y','Z'),
grade = c('AA','CC','DD','AB','BB'),
math_marks = c(40,80,38,97,65),
eng_marks = c(95,78,36,41,25),
sci_marks = c(56,25,36,87,15),
Point_A=c(1,1,1,1,NA),
Point_B=c(NA,1,NA,1,1),
Point_C=c(NA,1,NA,NA,NA),
Point_D=c(NA,NA,NA,NA,1),
Point_E=c(NA,1,NA,NA,1))
I need add a column called "Point" based on the column values Point_A to Point_E, if any 3 column value equals 1.
Excepted output.
df <- data.frame(student_name = c('U','V','X','Y','Z'),
grade = c('AA','CC','DD','AB','BB'),
math_marks = c(40,80,38,97,65),
eng_marks = c(95,78,36,41,25),
sci_marks = c(56,25,36,87,15),
Point_A=c(1,1,1,1,NA),
Point_B=c(NA,1,NA,1,1),
Point_C=c(NA,1,NA,NA,NA),
Point_D=c(NA,NA,NA,NA,1),
Point_E=c(NA,1,NA,NA,1),
Point=c(NA,1,NA,NA,1))
So far I was doing with this for all possible 3 combinations
df%>% filter(Point_A ==1,Point_B==1,Point_C==1)
Is there any other way to do this ?
To subset down to the rows with that condition use rowSums with across to sum the 1's by row:
df %>% filter(rowSums(across(starts_with("Point")), na.rm = TRUE) >= 3)
## student_name grade math_marks eng_marks sci_marks Point_A Point_B Point_C Point_D Point_E
## 1 V CC 80 78 25 1 1 1 NA 1
## 2 Z BB 65 25 15 NA 1 NA 1 1
or to add a 0/1 Point column indicating whether that row satisfies the condition:
df %>% mutate(Point = + (rowSums(across(starts_with("Point")), na.rm = TRUE) >= 3))
## student_name grade math_marks eng_marks sci_marks Point_A Point_B Point_C Point_D Point_E Point
## 1 U AA 40 95 56 1 NA NA NA NA 0
## 2 V CC 80 78 25 1 1 1 NA 1 1
## 3 X DD 38 36 36 1 NA NA NA NA 0
## 4 Y AB 97 41 87 1 1 NA NA NA 0
## 5 Z BB 65 25 15 NA 1 NA 1 1 1
ifelse with rowwise option:
library(dplyr)
df %>%
rowwise %>%
mutate(Point = ifelse(rowSums(across(Point_A:Point_E, ~ .x == 1), na.rm = T)>=3, 1, NA))
Output:
# A tibble: 5 × 11
# Rowwise:
student_name grade math_marks eng_marks sci_marks Point_A Point_B Point_C Point_D Point_E Point
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 U AA 40 95 56 1 NA NA NA NA NA
2 V CC 80 78 25 1 1 1 NA 1 1
3 X DD 38 36 36 1 NA NA NA NA NA
4 Y AB 97 41 87 1 1 NA NA NA NA
5 Z BB 65 25 15 NA 1 NA 1 1 1
mydata <-data.frame(group = c(1,1,1,1,1,2,2,2,2,2), score = c(10, NA, NA, 20, 30, 5, NA, NA, NA, 40))
From 'mydata' I am trying to use dplyr to interpolate 'x' using na.approx when there are fewer than 3 consecutive NAs between the closest non-NA entries in 'value'. The interpolated x values are store in 'x_approx'.
Without the condition on the number of consecutive NAs in 'value' I use this code:
library(zoo)
mydata %>%
group_by(group) %>%
mutate(score_approx = na.approx(score)) %>%
mutate(score_approx = coalesce(score_approx,score))
mydata
# A tibble: 10 x 3
# Groups: group [2]
group score score_approx
<dbl> <dbl> <dbl>
1 1 10 10
2 1 NA 13.3
3 1 NA 16.7
4 1 20 20
5 1 30 30
6 2 5 5
7 2 NA 13.8
8 2 NA 22.5
9 2 NA 31.2
10 2 40 40
However, the desired data frame is:
# A tibble: 10 x 3
# Groups: group [2]
group score score_approx
<dbl> <dbl> <dbl>
1 1 10 10
2 1 NA 13.3
3 1 NA 16.7
4 1 20 20
5 1 30 30
6 2 5 5
7 2 NA NA
8 2 NA NA
9 2 NA NA
10 2 40 40
You can use maxgap argument in na.approx -
library(dplyr)
library(zoo)
mydata %>%
group_by(group) %>%
mutate(score_approx = na.approx(score, maxgap = 2)) %>%
ungroup
# group score score_approx
# <dbl> <dbl> <dbl>
# 1 1 10 10
# 2 1 NA 13.3
# 3 1 NA 16.7
# 4 1 20 20
# 5 1 30 30
# 6 2 5 5
# 7 2 NA NA
# 8 2 NA NA
# 9 2 NA NA
#10 2 40 40
Some example of my data:
library(tidyverse)
set.seed(1234)
df <- tibble(
v1 = c(1:6),
v2 = rnorm(6, 5, 2) %>% round,
v3 = rnorm(6, 4, 2) %>% round,
v4 = rnorm(6, 4, 1) %>% round %>% lag(1),
v5 = rnorm(6, 6, 2) %>% round %>% lag(2),
v6 = rnorm(6, 5, 3) %>% round %>% lag(3),
v7 = rnorm(6, 5, 3) %>% round %>% lag(4))
v1 v2 v3 v4 v5 v6 v7
1 1 3 3 NA NA NA NA
2 2 6 3 3 NA NA NA
3 3 7 3 4 4 NA NA
4 4 0 2 5 11 3 NA
5 5 6 3 4 6 1 8
6 6 6 2 3 5 7 4
I want to shift it by diagonal, that separates NA and filled data.
So, desired output looks like this:
v1 v2 v3 v4 v5 v6 v7
1 NA NA 3 3 4 3 8
2 NA 3 3 4 11 1 4
3 1 6 3 5 6 7 NA
4 2 7 2 4 5 NA NA
5 3 0 3 4 NA NA NA
6 4 6 2 NA NA NA NA
7 5 6 NA NA NA NA NA
8 6 NA NA NA NA NA NA
Each column around v3 is just shifted by 1, 2, 3.. etc rows down and up.
Tried to achieve this inside dplyr::mutate_all() but I failed to iterate it with a lag() and lead() functions.
EDIT: after #wibeasley advice I made this
df %>%
mutate(dummy1 = c(3:8)) %>%
gather("var", "val", -dummy1) %>%
mutate(
dummy2 = sub("v", "", var, fixed = T),
dummy3 = dummy1 - as.numeric(dummy2) + 1) %>%
select(-dummy1, -dummy2) %>%
spread(var, val) %>%
slice(-c(1:4)) %>% select(-dummy3)
Looks ugly, but works.
We can use lapply to handle each column, putting NA to the back.
df[] <- lapply(df, function(x) c(x[!is.na(x)], x[is.na(x)]))
df
# # A tibble: 6 x 7
# v1 v2 v3 v4 v5 v6 v7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3 4 3 8
# 2 2 6 3 4 11 1 4
# 3 3 7 3 5 6 7 NA
# 4 4 0 2 4 5 NA NA
# 5 5 6 3 3 NA NA NA
# 6 6 6 2 NA NA NA NA
I´m trying to get some statistics (min, max, mean) of repeated values by row in R.
My dataframe looks similar to this:
b <- as.data.frame(matrix(ncol=7, nrow=3,
c(3,NA,NA,4,5,NA,7,6,NA,7,NA,8,9,NA,NA,4,6,NA,NA,7,NA), byrow = TRUE))
For each row, I want to add a column with the min, max and mean of the no. of columns containing consecutive NAs and it should something like this
V1 V2 V3 V4 V5 V6 V7 max min mean
1 3 NA NA 4 5 NA 7 2 1 1.5
2 6 NA 7 NA 8 9 NA 1 1 1.0
3 NA 4 6 NA NA 7 NA 2 1 1.33
This is just a small example of my dataset with 2000 rows and 48 columns.
Does anyone have some code for this?
You can apply over the rows and get the "runs" of non-NA columns. Once you have that, you can simply take the summary stats of those:
b[,c("mean", "max", "min")] <- do.call(rbind, apply(b, 1, function(x){
res <- rle(!is.na(x))
res2 <- res[["lengths"]][res[["values"]]]
data.frame(mean = mean(res2), max = max(res2), min = min(res2))
}
))
b
# V1 V2 V3 V4 V5 V6 V7 mean max min
#1 3 NA NA 4 5 NA 7 1.333333 2 1
#2 6 NA 7 NA 8 9 NA 1.333333 2 1
#3 NA 4 6 NA NA 7 NA 1.500000 2 1
A dplyr solution with rlewhich computes the lengths of runs of equal values in a vector.
library(dplyr)
b %>% cbind( b %>% rowwise() %>% do(rl = rle(is.na(.))$lengths[rle(is.na(.))$values == T]))
%>% rowwise()
%>% mutate(mean = mean(rl),
max = max(rl),
min = min(rl))
%>% select(-rl)
# V1 V2 V3 V4 V5 V6 V7 max min mean
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
# 1 3 NA NA 4 5 NA 7 2 1 1.50
# 2 6 NA 7 NA 8 9 NA 1 1 1.00
# 3 NA 4 6 NA NA 7 NA 2 1 1.33