colMeans not working in R - r

The data set I have as file Dummy.txt is as follows
A|B|C|D
1|2|1.9|5
2.5|5|53|3
4|48|49|0.4
8|94|495|B6
(please note a text character in 5th row, 4th column)
I would like to obtain the mean of each column (i.e. column A, B, C and D).
The code I am using is as follows:
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = FALSE, row.names = NULL)
mydata_1 <- as.numeric(as.character(mydata_1))
colMeans(mydata_1, na.rm = TRUE,)
However, this doesn't seem to be working. Any suggestions please?

You need to set header = TRUE to have the A|B|C|D row be used for column names, otherwise they are included as values, and all columns are parsed as string columns.
Then, passing stringsAsFactors = FALSE prevents columns D from being turned into a factor, and then the value 'B6' will automatically be turned into an NA when converted to a numeric type.
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = TRUE,
row.names = NULL, stringsAsFactors = FALSE)
mydata_1[] <- lapply(mydata_1, as.numeric)
#> Warning message:
#> In lapply(mydata_1, as.numeric) : NAs introduced by coercion
colMeans(mydata_1, na.rm = TRUE)
#> A B C D
#> 3.875 37.250 149.725 2.800
The syntax mydata_1[] <- ... makes mydata_1 keep its data frame structure even though a list is being returned on the right-hand side.

The problem here is that as.numeric(as.character(mydata_1)) returns [1] NA NA NA NA.
My suggestion would be to first go through all columns and coerce the types using sapply(), and then calculate the means of the columns:
library(magrittr)
mydata_1 %>%
sapply(., function(col) as.numeric(as.character(col))) %>%
colMeans(na.rm = TRUE)
This will return:
A B C D
3.875 37.250 149.725 2.800
Note: I am using magrittr to make use of the pipe (%>%) operator to chain the operations so you can check the output of every step.

Related

Data.Table R: error merging greater than four DTs with Reduce and merge

I have a list of 7 dts with the same dimension (1 74 2) and colnames ("variable" and "V1") for example ;
> head(dt1)
variable V1
1: A 4.213668
2: B 1474.040190
3: C 4.445173
4: D 76.960665
5: E 81.796707
6: F 215.312311
Based on the script by Michael Ohlrogge MergedDT = Reduce(function(...) merge(..., all = TRUE), List_of_DTs) revisions of Merging multiple data.tables, I came up with a following;
list<-list(dt1, dt2, ....dt7)
merged<-Reduce(function(...) merge(..., by="variable", all=T, allow.cartesian=T),list)
This works with a list of 4 its but beyond 4, I get this error,
error in merge.data.table(..., by = "variable", no.dups = F, all = T, :
x has some duplicated column name(s): V1.k1,V1.k2. Please remove or rename the duplicate(s) and try again.
In addition: Warning message:
In merge.data.table(..., by = "variable", no.dups = F, all = T, :
column names 'V1.k1', 'V1.k2' are duplicated in the result
I'd appreciate any pointes to get around this issue.
If we need to change this in multiple datasets, load them into a list with mget. Here, we assume the dataset object names starts with 'dt' followed by some digits
library(data.table)
lst1 <- mget(ls(pattern = '^dt\\d+$'))
lst1 <- Map(setnames, lst1, 'V1', paste0('V', seq_along(list)))

Select 1 column when DF has 2 similar column names in R

I have 2 problems. First, I have datasets with 2 column names that are similar. I want to select the first one and not use the second one. The numeric values in the column names are the serial number of the sensor and can vary and they can be in various columns.
How can I select the first column name of the 2 so I can plot it or use it in calculations?
How can I recover those long column names so I can use them? For example how to I get "Depth_456" to use in depthmax2 with out typing it in or making a subset named depth. The problem is the numeric value which is the serial number of the sensor and it changes from instrument to instrument and dataset to dataset. I am trying to write generic code that will work on all the different instruments.
My Data
df1 <- data.frame(Sal_224 = 1:8, Temp_696 = 1:8, Depth_456 = 1:8, Temp_654 = 8:15)
df2<-data.frame(sapply(df1, function(x) as.numeric(as.character(x))))
temp<- df2[grep("Temp", names(df2), value=TRUE)]
depth<- df2[grep("Depth", names(df2), value=TRUE)]
depthmax<- max(depth, na.rm = TRUE)
depthmax2<- max(df2$"Depth_456", na.rm = TRUE)
This doesn't work
depthmax2<- max(df2$grep("Depth", names(df2), value=TRUE), na.rm = TRUE)
We need [[ instead of $.
max(df2[[ grep("Depth", names(df2), value=TRUE)]], na.rm = TRUE)
#[1] 8
Or another option is startsWith
max(df2[[names(df2)[startsWith(names(df2), "Depth")]]], na.rm = TRUE)
#[1] 8
Also, max works on a vector. If there are more than one match, we may have to loop over and get the max
sapply(df2[ grep("Depth", names(df2), value=TRUE)], max, na.rm = TRUE)

Unite column names

I am trying to unite all columns of the dataframe df separating them with |.
However, for the name of the new column I would like to have all column names merged together separated the same way (eg S_n|S_s|S_b).
Here is what I tried and received error message Error: Must supply a symbol or a string as argument
S_n = c(2, 3, 5)
S_s = c("aa", "bb", "cc")
S_b = c(TRUE, FALSE, TRUE)
df = data.frame(S_n, S_s, S_b)
unite(df, S_n|S_s|S_b, sep="|", remove=TRUE)
unite(df, "S_n|S_s|S_b", sep="|", remove=TRUE). You need quotes around the column name because it's a non-standard name. (Standard names can't contain symbols other than . and _).
One idea via base R can be,
df[paste(names(df), collapse = '|')] <- do.call(paste, c(df, sep = '|'))
# S_n S_s S_b S_n|S_s|S_b
#1 2 aa TRUE 2|aa|TRUE
#2 3 bb FALSE 3|bb|FALSE
#3 5 cc TRUE 5|cc|TRUE

How to remove missing values (NA) when uniting columns?

I am trying to unite 5 columns into one new column using the Unite function. However, all rows contain lots of NA values, creating variables that look like
Mother|NA|NA|NA|NA
NA|NA|Father|Mother|NA
Mother|Father|NA|Stepmother|NA
I've tried to unite them using this code:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
But that gives me the following error:
Error: TRUE must evaluate to column positions or names, not a logical vector
I've also looked on the forum, and found that possibly the na.rm function of unite is not active?
Here is some data to recreate my dataset
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', 'NA', 'Mother')
Parent2 <- c('NA', 'NA', 'Father')
Parent3 <- c('NA', 'Father', 'NA')
Parent4 <- c('NA', 'Mother', 'Stepmother')
Parent5 <- c('NA', 'NA', 'NA')
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5)
Would love to know how to unite my columns without NA's.
UPDATE:
I've now updated the tidyr package and I added "na = c("", "NA")" to my read_csv command.
Now the
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
Command works, however for some reasons the NA at the end of the value stays. Now my columns look like this:
Mother|NA
Father|Mother|NA
Mother|Father|Stepmother|NA
Does anyone know what went wrong now?
You have got couple of problems,
1) the NAs are not reals NA's (Check is.na(df$Parent2))
2) Your columns are factors
While constructing the dataframe use stringsAsFactors = FALSE
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4,
Parent5, stringsAsFactors = FALSE)
and then replace NA and use unite
library(dplyr)
df %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
# Name Postalcode Parent_full
#1 Paul 4732 Mother
#2 Edward 9045 Father|Mother
#3 Mary 3476 Mother|Father|Stepmother
If the data is already loaded, we can change them by using mutate_if
df %>%
mutate_if(is.factor, as.character) %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
Your main problem here is that you haven't updated to tidyr 1.0 yet. That error message is the best that the previous version can do with the input na.rm = TRUE, since that argument didn't exist before. It thinks you're giving it a named argument as part of the ....
Specifically, just run install.packages("tidyr") and it should work. You might need to restart R first, so tidyr isn't currently loaded.
If your missing values are "NA" strings, then, as Ronak pointed out, you need to use na_if() on them first. It's strange to me because your initial code chunk makes it look like those are proper NAs, due to the red highlighting. But then your reprex code has 'NA' values which would definitely be strings. Anyway, you say you're reading in from CSV, so, it would be cleaner and quicker to run the CSV-reading code so as to read NAs in properly with an na argument or the like.
Response to Edit: That does seem like a bug, that NAs at the end of the united string don't get properly removed. Well, anyway, the fix is easy, and probably better than anything else we could do:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE) %>%
mutate_at("Parent_full", . %>%
str_remove("(^|\\|)NA$") %>%
na_if(""))
This ensures two things: 1) that the letters "NA" at the end of a string are only removed if they're there because of the unite(), with a pipe (if anything) in front of them; and 2) if there's no non-missing values on a line here, then the value will be a proper NA rather than "NA", "", or what have you, which I assume is what you want.
Update: I've found that the bug applies to any column that contains nothing but NAs, i.e. na.rm = TRUE only removes NAs from columns that have at least one non-missing value. I've filed a bug report: https://github.com/tidyverse/tidyr/issues/765
Given this, though, the optimal solution is probably just to remove any columns that are all NA beforehand. If this is production code, though, then that gets real tricky, since you have to specify the unite() so as to not break if any or even all of the columns to be united are dropped by that prior step.
Update 2: As a response to the bug report pointed out, the issue is actually that that all-missing column is logicals. So that makes the optimal solution: read in such columns as character, or coerce them to character before uniting. Full reprex for that:
library(tidyverse)
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', NA, 'Mother')
Parent2 <- c(NA, NA, 'Father')
Parent3 <- c(NA, 'Father', NA)
Parent4 <- c(NA, 'Mother', 'Stepmother')
Parent5 <- c(NA, NA, NA)
(df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5))
#> Name Postalcode Parent Parent2 Parent3 Parent4 Parent5
#> 1 Paul 4732 Mother <NA> <NA> <NA> NA
#> 2 Edward 9045 <NA> <NA> Father Mother NA
#> 3 Mary 3476 Mother Father <NA> Stepmother NA
(df2 <- df %>%
mutate_at(vars(Parent:Parent5), as.character) %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE))
#> Name Postalcode Parent_full
#> 1 Paul 4732 Mother
#> 2 Edward 9045 Father|Mother
#> 3 Mary 3476 Mother|Father|Stepmother
Created on 2019-09-27 by the reprex package (v0.3.0)
unite() (and na.rm = TRUE) only works for character columns (as far as I can tell). This isn't made clear in the help docs.
For factors, it also returns the integer code rather than the factor level - something to watch out for.
Numeric: Doesn't remove NAs:
df <- data.frame("to.combine1" = c(NA, 1, 3),
"to.combine2" = c(2, NA, 3))
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "numeric" "numeric"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 NA_2
#> 2 1_NA
#> 3 3_3
Factor: Doesn't remove NAs and uses integer code rather than level:
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = TRUE)
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "factor" "factor"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#>1 NA_1
#>2 1_NA
#>3 2_2
Character: Expected behaviour
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = FALSE)
sapply(df, class) #not functional, just illustrative
#>to.combine1 to.combine2
#>"character" "character"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 2
#> 2 1
#> 3 a_a
You can remove the NAs later with something like this
df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE) %>%
mutate(Parent_full = gsub("(?<![a-zA-Z])NA\\||\\|NA(?![a-zA-Z])|\\|NA$", '', Parent_full, perl = T))
Name Postalcode Parent_full
1 Paul 4732 Mother
2 Edward 9045 Father|Mother
3 Mary 3476 Mother|Father|Stepmother
It replaces NA| not preceded by a letter or |NA not followed by a letter or |NA at the end of the string, with an empty string

tidyr::separate() producing unexpected results

I am providing a data frame to tidyr::separate() and getting unexpected results. I have a minimal working example below where I show how I am using it, what I expect it to produce, and what it is actually producing. Why is this not working?
# Create toy data frame
dat <- data.frame(text = c("time_suffer|suffer_employ|suffer_sick"),
stringsAsFactors = FALSE)
# Separate variable into 3 columns a,b,c using | as a delimiter
dat %>% tidyr::separate(., col = "text", into = c("a","b","c"), sep = "|")
# What I'm expecting
data.frame(a = "time_suffer", b = "suffer_employ", c = "suffer_sick")
# What I'm actually getting:
data.frame(a = NA, b = "t", c = "1")
I am also getting the warning "Warning message: Expected 3 pieces. Additional pieces discarded in 1 rows [1]."
According to the documentation, the sep argument to separate is interpreted as a regular expression if it is a character (extremely useful if you have complicated separators). This does mean, however, that you need to escape characters with special meaning in regular expressions if you want to match on them literally. Use "\\|" as your separator:
library(tidyverse)
dat <- data.frame(text = c("time_suffer|suffer_employ|suffer_sick"),
stringsAsFactors = FALSE)
dat %>%
tidyr::separate(., col = "text", into = c("a","b","c"), sep = "\\|")
#> a b c
#> 1 time_suffer suffer_employ suffer_sick
Created on 2019-04-02 by the reprex package (v0.2.1)

Resources