R language check missing data for columns and rows - r

I have a data frame sells and I want to check the missing data in both rows and columns
What I did for rows is:
sells[, complete.cases(sells)]
nrows(sells[, complete.cases(sells)])
but I didn't know who to solve if for columns
Help please

First let's take the iris dataframe and insert randomly some NA's:
iris.demo <- iris
iris.nas <- matrix(as.logical(sample(FALSE:TRUE, size = 150*5,
prob = c(.9,.1),replace = TRUE)),ncol = 5)
iris.demo[iris.nas] <- NA
For rows, it is pretty straightforward:
sum(complete.cases(iris.demo))
# [1] 75
For columns, two possibilities (among several possible others):
Transposing the whole dataframe
sum(complete.cases(t(iris.demo)))
# [1] 0 # 0 columns are complete
Using lapply to count the "non-missing" on every column and see if it's equal to nrow:
sum(lapply(iris.demo, function(x) sum(!is.na(x))) == nrow(iris.demo))
# [1] 0

You could do it like this:
set.seed(1)
(sells <- data.frame(replicate(2, sample(c(1:3, NA), 10, T)), x3 = 1:10))
# X1 X2 x3
# 1 NA 2 1
# 2 1 3 2
# 3 3 2 3
# 4 1 1 4
# 5 2 NA 5
# 6 2 3 6
# 7 1 NA 7
# 8 2 1 8
# 9 NA 3 9
# 10 2 2 10
Rows:
sells[complete.cases(sells), ]
# X1 X2 x3
# 1 2 1 1
# 2 2 1 2
# 3 3 3 3
# 9 3 2 9
nrow(sells[complete.cases(sells), ])
# [1] 6
Columns:
sells[, sapply(sells, function(col) any(is.na(col)))]
# X1 X2
# 1 2 1
# 2 2 1
# 3 3 3
# 4 NA 2
# 5 1 NA
# 6 NA 2
# 7 NA 3
# 8 3 NA
# 9 3 2
# 10 1 NA
sum(sapply(sells, function(col) any(is.na(col))))
# [1] 2

Related

R rename columns (hlookup)

I have the following data
data <- data.frame(c=1:5, ID=0,a=2,b=5:9)
naming <- data.frame(short=c("a","b","c", "d", "e"), long=c("aa","bb","cc", "dd", "ee"))
I would like to rename the columns in data frame data from c,ID,b,a to cc,ID,bb,aa
I tried:
colnames(data) <- naming[match(naming$short, colnames(data)),2]
but this does not work, as both vectors are not of the same length, further, I would like to keep the column names in data, that are not in naming.
Any suggestions? Basically its hlookup function from Excel, but due to large data files I cannot do this in Excel.
Base R:
matches <- match(names(data), naming$short)
matches
# [1] 3 NA 1 2
ifelse(is.na(matches), names(data), naming$long[matches])
# [1] "cc" "ID" "aa" "bb"
names(data) <- ifelse(is.na(matches), names(data), naming$long[matches])
data
# cc ID aa bb
# 1 1 0 2 5
# 2 2 0 2 6
# 3 3 0 2 7
# 4 4 0 2 8
# 5 5 0 2 9
using dplyr 's rename_at and passing an anonymous functions seems to fit your purposes
require(dplyr)
data <- data.frame(c=1:5, ID=0,a=2,b=5:9)
data
# c ID a b
# 1 1 0 2 5
# 2 2 0 2 6
# 3 3 0 2 7
# 4 4 0 2 8
# 5 5 0 2 9
cols_iwant_to_rename <- c("a","b","c")
data <-
data %>% rename_at(cols_iwant_to_rename , function(x) paste0(x, x))
data
# cc ID aa bb
# 1 1 0 2 5
# 2 2 0 2 6
# 3 3 0 2 7
# 4 4 0 2 8
# 5 5 0 2 9

split columns by character in list of data frames

I have the following list of data frames:
df1 <- data.frame(x = 1:3, y=c("1,2","1,2,3","1,5"))
df2 <- data.frame(x = 4:6, y=c("1,2","1,4","1,6,7,8"))
filelist <- list(df1,df2)
> filelist
[[1]]
x y
1 1 1,2
2 2 1,2,3
3 3 1,5
[[2]]
x y
1 4 1,2
2 5 1,4
3 6 1,6,7,8
Now I want to split each column 'y' by character ',' and store the output in new columns in the dataframe.
The output should look like this:
> filelist
[[1]]
x y_ref y_alt1 y_alt2
1 1 1 2
2 2 1 2 3
3 3 1 5
[[2]]
x y_ref y_alt2 y_alt3 y_alt4
1 4 1 2
2 5 1 4
3 6 1 6 7 8
How should I do this? I know there is 'strsplit' to split a string by character. But I don't see how I can store the output then in different columns.
apply cSplit on "y" column of each dataframe in filelist
lapply(filelist, splitstackshape::cSplit, "y")
#[[1]]
# x y_1 y_2 y_3
#1: 1 1 2 NA
#2: 2 1 2 3
#3: 3 1 5 NA
#[[2]]
# x y_1 y_2 y_3 y_4
#1: 4 1 2 NA NA
#2: 5 1 4 NA NA
#3: 6 1 6 7 8
Here's a solution that relies on tstrsplit from data.table
library(data.table)
lapply(filelist,
function(DF) {
commas = max(nchar(as.character(DF$y)) -nchar( gsub(",", "", DF$y)))
DF[, c('y_ind', paste0('y_alt', seq_len(commas)))] = tstrsplit(as.character(DF$y), ',')
DF
})
#> [[1]]
#> x y y_ind y_alt1 y_alt2
#> 1 1 1,2 1 2 <NA>
#> 2 2 1,2,3 1 2 3
#> 3 3 1,5 1 5 <NA>
#>
#> [[2]]
#> x y y_ind y_alt1 y_alt2 y_alt3
#> 1 4 1,2 1 2 <NA> <NA>
#> 2 5 1,4 1 4 <NA> <NA>
#> 3 6 1,6,7,8 1 6 7 8
Created on 2019-09-17 by the reprex package (v0.3.0)
Also with dplyr you can use separate() like this:
df %>%
separate(y, into = c(y_ind,y_alt1,...), sep = ",")
Note that into can also be used more "programmatically" to generate the needed amount of resulting columns with a proper indexing without manually defining each result column.

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

How to calculate number of specific values in a data frame in R? [duplicate]

This question already has answers here:
How to count the frequency of a string for each row in R
(4 answers)
Counting number of instances of a condition per row R [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe df:
a b c
1 5 5
2 3 5
3 3 5
3 3 3
3 3 2
4 2 2
1 2 2
I want to calculate how much 3's I have in a row for example, how can I do it?
For example row 2 = 1, row 3 = 2 etc.
Please advice.
The answer of #ManuelBickel is good if you want to count all of the values. If you really just want to know how many 3's there are, this might be simpler.
rowSums(data==3)
[1] 0 1 2 3
If you want the counts returned in a more ordered fashion
set.seed(1)
m <- matrix(sample(c(1:3, 5), 15, replace=TRUE), 5, dimnames=list(LETTERS[1:5]))
m
# [,1] [,2] [,3]
# A 2 5 1
# B 2 5 1
# C 3 3 3
# D 5 3 2
# E 1 1 5
u <- sort(unique(as.vector(m)))
r <- sapply(setNames(u, u), function(x) rowSums(m == x))
r
# 1 2 3 5
# A 1 1 0 1
# B 1 1 0 1
# C 0 0 3 0
# D 0 1 1 1
# E 2 0 0 1
You can use apply and table for this. The output is a list giving you the counts of unique elements per row. (If this is of interest, setting the MARGIN of apply to 2 would give you the output per column.)
Update: Since others have provided solutions producing more "ordered" output in the meanwhile, I have amended my approach by using data.table::rbindlist for this purpose.
#I have skipped some of the last rows of your example
data <- read.table(text = "
a b c
1 5 5
2 3 5
3 3 5
3 3 3
", header = T, stringsAsFactors = F)
apply(data, 1, table)
# [[1]]
# 1 5
# 1 2
# [[2]]
# 2 3 5
# 1 1 1
# [[3]]
# 3 5
# 2 1
# [[4]]
# 3
# 3
#Update: output in more ordered fashion
library(data.table)
rbindlist(apply(data, 1, function(x) as.data.table(t(as.matrix(table(x)))))
,fill = TRUE
,use.names = TRUE)
# 1 5 2 3
# 1: 1 2 NA NA
# 2: NA 1 1 1
# 3: NA 1 NA 2
# 4: NA NA NA 3
#if necessary NA values might be replaced, see, e.g.,
##https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table

cbind in for loop with unequal number of rows

This is my data frame
>head(dat)
Var1 Freq
1 89 2
2 95 2
3 97 1
4 99 2
5 103 2
6 104 2
I want to iterate over for loop and append dat$freq using cbind. Is it possible to append the Freq to the same Var1, when NA occurs in Freq?
I think OP is looking to merge a list of data.frames instead of cbind
Following should do the trick.
DF.LIST <- lapply(1:5, function(x) {
rows <- sample(1:5, 1)
data.frame(Var1 = sample(1:5, rows), Freq = sample(5:10, rows))
})
DF.LIST
## [[1]]
## Var1 Freq
## 1 2 6
## 2 4 7
## 3 3 9
## 4 5 10
##
## [[2]]
## Var1 Freq
## 1 3 10
## 2 2 9
##
## [[3]]
## Var1 Freq
## 1 4 5
## 2 3 6
##
## [[4]]
## Var1 Freq
## 1 1 6
## 2 2 10
## 3 5 7
## 4 3 9
## 5 4 8
##
## [[5]]
## Var1 Freq
## 1 5 10
##
OPTION 1
Problem with Reduce & merge combo if used directly on such a list is that it will just end up merging with both Var1 and Freq columns. To avoid that we first rename the second column in each data.frame by adding a index number. After that Reduce and merge combo should give what OP wants.
for (i in 1:length(DF.LIST)) {
names(DF.LIST[[i]]) <- c("Var1", paste0("Freq", i))
}
Reduce(function(...) merge(..., all = T), DF.LIST)
## Var1 Freq1 Freq2 Freq3 Freq4 Freq5
## 1 1 NA NA NA 6 NA
## 2 2 6 9 NA 10 NA
## 3 3 9 10 6 9 NA
## 4 4 7 NA 5 8 NA
## 5 5 10 NA NA 7 10
OPTION 2
You can try following on original DF.LIST directly, but you still need to take care of the column names in the result then.
Reduce(function(...) merge(..., by = "Var1", all = T), DF.LIST)
## Var1 Freq.x Freq.y Freq.x Freq.y Freq
## 1 1 NA 6 NA 9 NA
## 2 2 7 NA NA 8 NA
## 3 3 NA 5 NA 7 7
## 4 4 NA 7 5 5 10
## 5 5 9 9 7 6 NA
Warning messages:
1: In merge.data.frame(..., by = "Var1", all = T) :
column names ‘Freq.x’, ‘Freq.y’ are duplicated in the result
2: In merge.data.frame(..., by = "Var1", all = T) :
column names ‘Freq.x’, ‘Freq.y’ are duplicated in the result

Resources