Removing " " (empty values) from a Character of Strings - r

I have been looking around for few hours now and have not been able not remove "" from the character of strings below.
c("Final", "A", "7.43", "8.50", "15.93", "2.00",
"1.00", "0.30", "0.37", " 7.43", " 8.50", "0.50", "0.67", " ",
" ", " ", " ", " ", " ", " ", "B", "7.00", "3.77", "10.77",
" 7.00", "1.67", "3.77", " ", " ", " ", " ", " ", " ", " ", " ",
I have many more of these empty values in this dataset and just want to get rid of them before organizing then as a data frame like
Final
A B
7.43 7.43
8.50 8.50
15.93 0.50
2.00 0.67
1.00
0.30
Thanks,

You can use the base grep with values = TRUE. That searches the character vector for a given regex pattern and returns all values where that pattern is found.
You can think about the logic of your pattern a couple ways. One might be to think of it as keeping values with a "word" character, which are letters, numbers, or underscores.
x <- c("Final", "A", "7.43", "8.50", "15.93", "2.00", "1.00", "0.30", "0.37", " 7.43", " 8.50", "0.50", "0.67", " ", " ", " ", " ", " ", " ", " ", "B", "7.00", "3.77", "10.77", " 7.00", "1.67", "3.77", " ", " ", " ", " ", " ", " ", " ", " ")
grep("\\w", x, value = T)
#> [1] "Final" "A" "7.43" "8.50" "15.93" "2.00" "1.00" "0.30"
#> [9] "0.37" " 7.43" " 8.50" "0.50" "0.67" "B" "7.00" "3.77"
#> [17] "10.77" " 7.00" "1.67" "3.77"
Another way is to find values with a character that isn't a space (\\S is the negation of \\s):
grep("\\S", x, value = T)
#> [1] "Final" "A" "7.43" "8.50" "15.93" "2.00" "1.00" "0.30"
#> [9] "0.37" " 7.43" " 8.50" "0.50" "0.67" "B" "7.00" "3.77"
#> [17] "10.77" " 7.00" "1.67" "3.77"
Created on 2018-12-10 by the reprex package (v0.2.1)

Related

How can I change columns with mutate(across()) due to a specific RegEx?

I have a problem with the mutate(across()) function.
In the tibble you can see below, I want to delete the "letter + underscores" (e.g. "p__", "c__" etc) in the columns.
A tibble: 2,477 x 4
Phylum Class Order Family
<chr> <chr> <chr> <chr>
1 " p__Proteobacteria" " c__Gammaproteobacter~ " o__Aeromonadales" " f__Aeromonadaceae"
2 " p__Bacteroidota" " c__Bacteroidia" " o__Bacteroidales" " f__Williamwhitmaniac~
3 " p__Fusobacteriota" " c__Fusobacteriia" " o__Fusobacterial~ " f__Leptotrichiaceae"
4 " p__Firmicutes" " c__Clostridia" " o__Clostridiales" " f__Clostridiaceae"
5 " p__Proteobacteria" " c__Gammaproteobacter~ " o__Enterobactera~ " f__Enterobacteriacea~
6 " p__Bacteroidota" " c__Bacteroidia" " o__Bacteroidales" " f__Williamwhitmaniac~
7 " p__Firmicutes" " c__Clostridia" " o__Lachnospirale~ " f__Lachnospiraceae"
8 " p__Bacteroidota" " c__Bacteroidia" " o__Cytophagales" " f__Spirosomaceae"
9 " p__Proteobacteria" " c__Gammaproteobacter~ " o__Burkholderial~ " f__Comamonadaceae"
10 " p__Actinobacteriot~ " c__Actinobacteria" " o__Frankiales" " f__Sporichthyaceae"
# ... with 2,467 more rows
A year ago I used the command
table <- table %>%
mutate_at(vars(Phylum, Class, Order, Family),funs(sub(pattern = "^([a-z])(_{2})", replacement = "", .)))
Now, it gives me the hint that the funs-function is not longer supported and it does not work anymore.
Do you have some suggestions for me?
I thought about:
taxon <- c("Phylum", "Class", "Order", "Family")
table <- table %>%
mutate(across(taxon), gsub(pattern = "^([a-z])(_{2})", replacement = "", .))
But here I get the error:
Error: Invalid index: out of bounds
Thanks a lot :)
Kathrin
You can do :
library(dplyr)
taxon <- c("Phylum", "Class", "Order", "Family")
table <- table %>% mutate(across(taxon,
~gsub(pattern = "^([a-z])(_{2})", replacement = "", .)))
I don't have your data to confirm this but there seems to be a whitespace at the beginning of the string which should be removed first.
table <- table %>% mutate(across(taxon,
~gsub(pattern = "^([a-z])(_{2})", replacement = "", trimws(.))))

Remove whitespace in string using for loop

I learned about stringr library and tried to use it to remove whitespace from a string, but I don't understand why it would not remove it when I index a string from a vector.
ex is a vector with strings similar to " 1950"
> library(stringr)
> str_replace_all(ex[1], fixed(" "), "")
[1] " 1950"
> str_replace_all(" 1950", fixed(" "), "")
[1] "1950"
> str(" 1950")
chr " 1950"
> str(ex[1])
chr " 1950"
I wanted to write a loop to remove the whitespaces, but I don't understand why stringr does not work when I use ex[1]
Following is dput(ex)
c(" 1950", " 1951", " 1952", " 1953", " 1954",
" 1955", " 1956", " 1957", " 1958", " 1959", " 1960",
" 1961", " 1962", " 1963", " 1964", " 1965", " 1966",
" 1967", " 1968", " 1969", " 1970", " 1971", " 1972",
" 1973", " 1974", " 1975", " 1976", " 1977", " 1978",
" 1979", " 1980", " 1981", " 1982", " 1983", " 1984",
" 1985", " 1986", " 1987", " 1988", " 1989", " 1990",
" 1991", " 1992", " 1993", " 1994", " 1995", " 1996",
" 1997", " 1998", " 1999", " 2000", " 2001", " 2002",
" 2003", " 2004", " 2005", " 2006", " 2007", " 2008",
" 2009", " 2010", " 2011", " 2012", " 2013", " 2014",
" 2015", " 2016", " 2017", " ", "Provided by : All China Marketing Resarch"
)
What library can I use in a for loop to remove whitespaces?
It is easier with trimws from base R
trimws(ex)
The OP's issue is not reproducible with str_extract
stringr::str_replace_all(ex[1], fixed(" "), "")
#[1] "1950"
str_trim within library(stringr) would work
i.e
trimmed = str_trim(your_string, which = c('left'))
For right trim, you could set it to right or for both sides, both. You also wouldn't need to loop in this case. A vector of strings can be passed directly to str_trim.

Convert character matrix column to numeric matrix

I would like to perform heatmap. I transferred the data frame to matrix. My first column in the matrix contains 51 state names in character format. Due to this when I execute heatmap an error pops out ('X' must be numeric). If I convert the matrix into numeric all the states get converted to numeric values from 1 to 51. Name of the state gets changed to numbers. I would like someone to help me in converting the character column into numeric without any value change in the column.
enter image description here
I get the following error:
> heatmap.2(matrix)
Error in heatmap.2(matrix) : `x' must be a numeric matrix
dput(matrix[1:20,1:5])
structure(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE",
"FL", "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA",
" 156023.01", " 934292.20", " 565543.16", " 859246.77", "1802826.03",
" 236048.04", " 277419.16", " 44170.06", " 364245.19", "3059883.80",
"1032052.28", " 49148.00", " 484355.76", " 103032.97", "1501399.16",
"1098716.37", " 536964.81", " 714912.96", " 930454.92", "1006184.61",
NA, " 647281.97", " 243467.03", " 222016.05", "1955376.54", " 284157.80",
" 546510.14", " 310209.01", " 238855.76", "3055374.94", " 620487.04",
" 52286.08", " 183689.95", " 101198.95", "2299302.42", " 682522.43",
" 203429.06", " 566182.29", " 434137.97", "1269701.60", " 279984.88",
" 1785117.72", " 1210217.08", " 1738388.11", "12313826.52", " 1033786.31",
" 1905870.34", " 1589936.20", " 1177198.27", " 7379680.11", " 3182089.09",
" 539865.15", " 907408.47", " 706547.91", " 5616722.28", " 2793763.32",
" 751262.24", " 2620593.80", " 3327343.31", " 3423941.61", " 277346.4",
" 3231424.9", " 1784411.7", " 2539940.3", "13107647.6", " 1623508.4",
" 2475804.7", " 1382151.2", " 1362240.3", "10431341.9", " 4514651.7",
" 1081821.1", " 1653629.7", " 594605.5", " 9147134.3", " 4121661.9",
" 1292330.2", " 3252592.8", " 3360762.2", " 4269284.1"), .Dim = c(20L,
5L), .Dimnames = list(NULL, c("Provider.State", "039 ", "057 ",
"064 ", "065 ")))
(I named it m so that I don't override the matrix function.)
First, your first column is an identifier. I'm going to infer that they have meaning, so I'll keep them around as row-names, but that doesn't change the outcome.
head(m)
# Provider.State 039 057 064 065
# [1,] "AK" " 156023.01" NA " 279984.88" " 277346.4"
# [2,] "AL" " 934292.20" " 647281.97" " 1785117.72" " 3231424.9"
# [3,] "AR" " 565543.16" " 243467.03" " 1210217.08" " 1784411.7"
# [4,] "AZ" " 859246.77" " 222016.05" " 1738388.11" " 2539940.3"
# [5,] "CA" "1802826.03" "1955376.54" "12313826.52" "13107647.6"
# [6,] "CO" " 236048.04" " 284157.80" " 1033786.31" " 1623508.4"
rn <- m[,1]
m <- m[,-1]
rn
# [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA"
head(m)
# 039 057 064 065
# [1,] " 156023.01" NA " 279984.88" " 277346.4"
# [2,] " 934292.20" " 647281.97" " 1785117.72" " 3231424.9"
# [3,] " 565543.16" " 243467.03" " 1210217.08" " 1784411.7"
# [4,] " 859246.77" " 222016.05" " 1738388.11" " 2539940.3"
# [5,] "1802826.03" "1955376.54" "12313826.52" "13107647.6"
# [6,] " 236048.04" " 284157.80" " 1033786.31" " 1623508.4"
(We'll use rn in a minute.) Now we need to convert everything to numbers.
m <- apply(m, 2, as.numeric)
rownames(m) <- rn
head(m)
# 039 057 064 065
# AK 156023.0 NA 279984.9 277346.4
# AL 934292.2 647282.0 1785117.7 3231424.9
# AR 565543.2 243467.0 1210217.1 1784411.7
# AZ 859246.8 222016.0 1738388.1 2539940.3
# CA 1802826.0 1955376.5 12313826.5 13107647.6
# CO 236048.0 284157.8 1033786.3 1623508.4
And now the heatmap works.
heatmap(m)
it can be done with purrr package
try with below :
library(purrr)
df<-df %>%
map_if(is.factor,as.character) %>%
as.matrix

R divide two dataframes by column [duplicate]

I've tried:
i <- as.numeric(as.character(Impress))
i <- as.numeric(as.character(levels(Impress)))
i <- as.numeric(paste(Impress))
I always get:
Warning message:
NAs introduced by coercion
> i
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
This is the data I want to be numeric:
> Impress
[1] 24,085,563.00 35,962,587.00 31,714,513.00 28,206,422.00 40,161,010.00 36,292,929.00 31,545,482.00
[8] 28,213,878.00 35,799,224.00 32,400,885.00 28,496,459.00 37,456,344.00 38,108,667.00 33,407,771.00
[15] 32,540,479.00 30,692,707.00 22,873,000.00 21,329,146.00 28,921,953.00 30,471,519.00 28,601,289.00
[22] 27,450,630.00 26,708,790.00 19,825,041.00 18,844,169.00 29,592,039.00 31,012,594.00 28,792,531.00
[29] 28,578,028.00 24,913,985.00
30 Levels: 18,844,169.00 19,825,041.00 21,329,146.00 22,873,000.00 24,085,563.00 24,913,985.00 ... 40,161,010.00
> paste(Impress)
[1] " 24,085,563.00 " " 35,962,587.00 " " 31,714,513.00 " " 28,206,422.00 " " 40,161,010.00 " " 36,292,929.00 " " 31,545,482.00 "
[8] " 28,213,878.00 " " 35,799,224.00 " " 32,400,885.00 " " 28,496,459.00 " " 37,456,344.00 " " 38,108,667.00 " " 33,407,771.00 "
[15] " 32,540,479.00 " " 30,692,707.00 " " 22,873,000.00 " " 21,329,146.00 " " 28,921,953.00 " " 30,471,519.00 " " 28,601,289.00 "
[22] " 27,450,630.00 " " 26,708,790.00 " " 19,825,041.00 " " 18,844,169.00 " " 29,592,039.00 " " 31,012,594.00 " " 28,792,531.00 "
[29] " 28,578,028.00 " " 24,913,985.00 "
and when I do i<-as.numeric(Impress), it pastes the wrong values.
Thanks!
As far as the computer is concerned, , is not a number and hence any number string containing it must not be numeric, even if to a human these look like perfectly acceptable numbers.
Get rid of the , and then it will work, e.g. using gsub()
i <- as.numeric(gsub(",", "", as.character(Impress)))
E.g.
Impress <- c("24,085,563.00", "35,962,587.00", "31,714,513.00", "28,206,422.00")
gsub(",", "", as.character(Impress))
i <- as.numeric(gsub(",", "", as.character(Impress)))
i
R> gsub(",", "", as.character(Impress))
[1] "24085563.00" "35962587.00" "31714513.00" "28206422.00"
R> i
[1] 24085563 35962587 31714513 28206422
R> is.numeric(i)
[1] TRUE
Because the data has commas, R cannot convert it to a numeric. You have to remove the commas with sub() first and then convert:
i <- as.numeric(gsub(",", "", as.character(impress)))

Cant convert factor to numeric in R

I've tried:
i <- as.numeric(as.character(Impress))
i <- as.numeric(as.character(levels(Impress)))
i <- as.numeric(paste(Impress))
I always get:
Warning message:
NAs introduced by coercion
> i
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
This is the data I want to be numeric:
> Impress
[1] 24,085,563.00 35,962,587.00 31,714,513.00 28,206,422.00 40,161,010.00 36,292,929.00 31,545,482.00
[8] 28,213,878.00 35,799,224.00 32,400,885.00 28,496,459.00 37,456,344.00 38,108,667.00 33,407,771.00
[15] 32,540,479.00 30,692,707.00 22,873,000.00 21,329,146.00 28,921,953.00 30,471,519.00 28,601,289.00
[22] 27,450,630.00 26,708,790.00 19,825,041.00 18,844,169.00 29,592,039.00 31,012,594.00 28,792,531.00
[29] 28,578,028.00 24,913,985.00
30 Levels: 18,844,169.00 19,825,041.00 21,329,146.00 22,873,000.00 24,085,563.00 24,913,985.00 ... 40,161,010.00
> paste(Impress)
[1] " 24,085,563.00 " " 35,962,587.00 " " 31,714,513.00 " " 28,206,422.00 " " 40,161,010.00 " " 36,292,929.00 " " 31,545,482.00 "
[8] " 28,213,878.00 " " 35,799,224.00 " " 32,400,885.00 " " 28,496,459.00 " " 37,456,344.00 " " 38,108,667.00 " " 33,407,771.00 "
[15] " 32,540,479.00 " " 30,692,707.00 " " 22,873,000.00 " " 21,329,146.00 " " 28,921,953.00 " " 30,471,519.00 " " 28,601,289.00 "
[22] " 27,450,630.00 " " 26,708,790.00 " " 19,825,041.00 " " 18,844,169.00 " " 29,592,039.00 " " 31,012,594.00 " " 28,792,531.00 "
[29] " 28,578,028.00 " " 24,913,985.00 "
and when I do i<-as.numeric(Impress), it pastes the wrong values.
Thanks!
As far as the computer is concerned, , is not a number and hence any number string containing it must not be numeric, even if to a human these look like perfectly acceptable numbers.
Get rid of the , and then it will work, e.g. using gsub()
i <- as.numeric(gsub(",", "", as.character(Impress)))
E.g.
Impress <- c("24,085,563.00", "35,962,587.00", "31,714,513.00", "28,206,422.00")
gsub(",", "", as.character(Impress))
i <- as.numeric(gsub(",", "", as.character(Impress)))
i
R> gsub(",", "", as.character(Impress))
[1] "24085563.00" "35962587.00" "31714513.00" "28206422.00"
R> i
[1] 24085563 35962587 31714513 28206422
R> is.numeric(i)
[1] TRUE
Because the data has commas, R cannot convert it to a numeric. You have to remove the commas with sub() first and then convert:
i <- as.numeric(gsub(",", "", as.character(impress)))

Resources