I have a column mixture with number, text, NA values. I just want to extract numeric value from col2
col1 <- c('t1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9', 't10')
col2 <- c(300, '>200m', NA, 'result 50 mg/g', NA, 'Not data', 'pending', NA, 'positive', 'data >20 mile/h')
df <- data.frame(col1, col2)
My intention is:
All numbers will remain numeric
NA values will remain NA
Character/text will be converted to NA value
Extract number if it's mixture with text (e.g., 'data >20 mile/h' to 20)
The expected output (col3) will be like this:
col3 <- c(300, 200, NA, 50, NA, NA, NA, NA, NA, 20)
df2 <- data.frame(col1, col3)
Using str_extract from stringr to extract the numbers.
library(stringr)
cbind(df, col3 = as.numeric(str_extract(df$col2, "[\\d+.]+")))
col1 col2 col3
1 t1 300 300.0
2 t2 >200m 200.0
3 t3 wef3.2 wef 3.2
4 t4 result 50 mg/g 50.0
5 t5 <NA> NA
6 t6 Not data NA
7 t7 pending NA
8 t8 <NA> NA
9 t9 positive NA
10 t10 data >20 mile/h 20.0
Using gsub, removing everything but numbers.
cbind(df, col3 = as.numeric(
gsub("([.-])|[[:alpha:][:punct:] ]", "\\1", df$col2)))
col1 col2 col3
1 t1 300 300.0
2 t2 >200m 200.0
3 t3 wef3.2 wef 3.2
4 t4 result 50 mg/g 50.0
5 t5 <NA> NA
6 t6 Not data NA
7 t7 pending NA
8 t8 <NA> NA
9 t9 positive NA
10 t10 data >20 mile/h 20.0
Or use \\D (non-digits) instead [:alpha:] and [:punct:] (Thx to #thelatemail and #onyambu)!
Data
df <- structure(list(col1 = c("t1", "t2", "t3", "t4", "t5", "t6", "t7",
"t8", "t9", "t10"), col2 = c("300", ">200m", "wef3.2 wef", "result 50 mg/g",
NA, "Not data", "pending", NA, "positive", "data >20 mile/h")),
row.names = c(NA, -10L), class = "data.frame")
One potential option is to use parse_number() from the readr package, e.g.
library(readr)
col1 <- c('t1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9', 't10')
col2 <- c(300, '>200m', NA, 'result 50 mg/g', NA, 'Not data', 'pending', NA, 'positive', 'data >20 mile/h')
df <- data.frame(col1, col2)
df$col3 <- parse_number(df$col2)
#> Warning: 3 parsing failures.
#> row col expected actual
#> 6 -- a number Not data
#> 7 -- a number pending
#> 9 -- a number positive
df
#> col1 col2 col3
#> 1 t1 300 300
#> 2 t2 >200m 200
#> 3 t3 <NA> NA
#> 4 t4 result 50 mg/g 50
#> 5 t5 <NA> NA
#> 6 t6 Not data NA
#> 7 t7 pending NA
#> 8 t8 <NA> NA
#> 9 t9 positive NA
#> 10 t10 data >20 mile/h 20
Created on 2023-02-07 with reprex v2.0.2
Related
I would like to replace value of df1 by two vector(g1.vec. g2.vec) and column group.
If column of df1 exitst g1.vec and group == 1, replace value to Y.
If column of df1 exitst g2.vec and group == 2, replace value to X.
Here is a part of example:
> df1
ID group col1 col2 ...... col154
1 AMM115 2 C A ...... A+
2 ADM107 1 NA NA ...... B
3 AGM041 2 B C ...... C+
4 AGM132 1 A NA ...... A+
5 AQM007 1 NA A ...... B+
6 ARM028 2 NA B+ ...... A-
7 ASM019 1 A A+ ...... NA
8 AHM172 NA A A+ ...... NA
> vec
g1.vec <- c("col1", "col3", "col18", "col20", "col28", "col75", "col77", "col86", "col111")
g2.vec <- c("col2", "col13", "col37", "co38", "co44", "co87", "col123", "col41", "col154")
the output would look like:
> df2
ID group col1 col2 ...... col154
1 AMM115 2 C X ...... X
2 ADM107 1 Y NA ...... B
3 AGM041 2 B X ...... X
4 AGM132 1 Y NA ...... A+
5 AQM007 1 Y A ...... B+
6 ARM028 2 NA X ...... X
7 ASM019 1 Y A+ ...... NA
8 AHM172 NA A A+ ...... NA
What I have tried mutate with ifelse but I do not know how to mutate multiples column once or whatever to complete this.
Data
df1 <- structure(list(ID = c("AMM115", "ADM107", "AGM041", "AGM132",
"AQM007", "ARM028", "ASM019", "AHM172"), group = c(2L, 1L, 2L,
1L, 1L, 2L, 1L, NA), col1 = c("C", NA, "B", "A", NA, NA, "A",
"A"), col2 = c("A", NA, "C", NA, "A", "B+", "A+", "A+"), col154 = c("A+",
"B", "C+", "A+", "B+", "A-", NA, NA)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Base R solution -
df[df$group %in% 1, g1.vec] <- 'Y'
df[df$group %in% 2, g2.vec] <- 'X'
df
# ID group col1 col2 col154
#1 AMM115 2 C X X
#2 ADM107 1 Y <NA> B
#3 AGM041 2 B X X
#4 AGM132 1 Y <NA> A+
#5 AQM007 1 Y A B+
#6 ARM028 2 <NA> X X
#7 ASM019 1 Y A+ <NA>
#8 AHM172 NA A A+ <NA>
I have used %in% instead of == here to handle the NA values.
You may try
library(dplyr)
df1 %>%
mutate_at(g1.vec, list(~ifelse(group == 1 & !is.na(group), 'Y', .))) %>%
mutate_at(g2.vec, list(~ifelse(group == 2 & !is.na(group), 'X', .)))
Using across and case_match from dplyr
library(dplyr)# version 1.1.0
df1 %>%
mutate(across(any_of(g1.vec),~ case_match(group, 1 ~ 'Y', .default = .x)),
across(any_of(g2.vec), ~ case_match(group, 2 ~ 'X', .default = .x)))
-output
ID group col1 col2 col154
1 AMM115 2 C X X
2 ADM107 1 Y <NA> B
3 AGM041 2 B X X
4 AGM132 1 Y <NA> A+
5 AQM007 1 Y A B+
6 ARM028 2 <NA> X X
7 ASM019 1 Y A+ <NA>
8 AHM172 NA A A+ <NA>
I need to recode some data. Firstly,
iImagine that the the original data looks something like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
s1 414234 244575 539645 436236
s2 NA 512342 644252 835325
s3 NA NA 816747 475295
s4 NA NA NA 125429
s5 NA NA NA NA
s6 617465 844526 NA 194262
which, secondly, is transformed into
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 4 2 5 4
s2 NA 5 6 8
s3 NA NA 8 4
s4 NA NA NA 1
s5 NA NA NA NA
s6 6 8 NA 1
because I am going to recode everything according to the first digit. When, thirdly, recoded (see recoding pattern in MWE below) it should look like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s5 NA NA NA NA
s6 4 5 NA 1
and, fourthly, entire rows should be removed if all columns except the first one is empty, that is
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s6 4 5 NA 1
which is the ultimate data.
The first and second step were easily implemented but I struggle with the third and fourth step since I am new to R (see MWE below). For the third step, I tried to use mutate over multiple columns but Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')" appeared. The fourth step is easily implemented in Python with thresh but I am not sure if there is an equivalent in R.
How is this possible? Also, I work with huge data, so time-efficient solutions would also be highly appreciated.
library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}
Base R way:
# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]
df
# V1 V2 V3 V4 V5
# 1 s1 3 1 3 3
# 2 s2 NA 3 4 5
# 3 s3 NA NA 5 3
# 4 s4 NA NA NA 1
# 6 s6 4 5 NA 1
As you can see, it is not necessary to do the operations column-wise as they can be performed en bloc, which will be more efficient.
You can do this with a combination of tidyverse packages. We generally avoid for loops in R, unless we really need them. It's almost always preferable to vetorise.
library(dplyr)
library(stringr) # for str_sub
library(purrr) # for negate
mat = matrix(c( "s1", "s2", "s3", "s4", "s5", "s6",
"414234", NA, NA, NA, NA, "617465",
"244575", "512342", NA, NA, NA, "844526",
"539645", "644252", "816747", NA, NA, NA,
"436236", "835325", "475295", "125429", NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 4 2 5 4
#> 2 s2 <NA> 5 6 8
#> 3 s3 <NA> <NA> 8 4
#> 4 s4 <NA> <NA> <NA> 1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6 6 8 <NA> 1
## Step 3: Recode
df <- mutate(df,
across(V2:V5,
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2",
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s5 NA NA NA NA
#> 6 s6 4 5 NA 1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s6 4 5 NA 1
Created on 2022-12-13 with reprex v2.0.2
This one using fancy math
df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )
##> + # A tibble: 5 × 5
##> col1 col2 col3 col4 col5
##> <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1 3 1 3 3
##> 2 s2 NA 3 4 5
##> 3 s3 NA NA 5 3
##> 4 s4 NA NA NA 1
##> 5 s6 4 5 NA 1
Hello coding community,
If my data frame looks like:
ID Col1 Col2 Col3 Col4
Per1 1 2 3 4
Per2 2 NA NA NA
Per3 NA NA 5 NA
Is there any syntax to delete the row associated with ID = Per2, on the basis that Col2, Col3, AND Col4 = NA? I am hoping for code that will allow me to delete a row on the basis that three specific columns (Col2, Col3, and Col4) ALL are NA. This code would NOT delete the row ID = Per3, even though there are three NAs.
Please note that I know how to delete a specific row, but my data frame is big so I do not want to manually sort through all rows/columns.
Big thanks!
Test for NA and delete rows with a number of NA's equal to the number of columns tested using rowSums.
dat[!rowSums(is.na(dat[c('Col2', 'Col3', 'Col4')])) == 3, ]
# ID Col1 Col2 Col3 Col4
# 1 Per1 1 2 3 4
# 3 Per3 NA NA 5 NA
You can use if_all
library(dplyr)
filter(df, !if_all(c(Col2, Col3, Col4), ~ is.na(.)))
# ID Col1 Col2 Col3 Col4
# 1 Per1 1 2 3 4
# 2 Per3 NA NA 5 NA
data
df <- structure(list(ID = c("Per1", "Per2", "Per3"), Col1 = c(1L, 2L,
NA), Col2 = c(2L, NA, NA), Col3 = c(3L, NA, 5L), Col4 = c(4L,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Using if_any
library(dplyr)
df %>%
filter(if_any(Col2:Col4, complete.cases))
ID Col1 Col2 Col3 Col4
1 Per1 1 2 3 4
2 Per3 NA NA 5 NA
Suppose I have a dataframe with several types of columns (character, numeric, ID, time,etc.). I'll provide a simple example as follows:
m <- data.frame(LETTERS[1:10], LETTERS[15:24],runif(10),runif(10),runif(10),runif(10),runif(10))
x<-c("Col1","Col2","Col3","Col4","Col5","Col6","Col7")
colnames(m)<-x
m<-as.data.frame(lapply(m, function(x) x[ sample(c(TRUE, NA), prob = c(0.75, 0.25), size = length(x), replace = TRUE) ]))
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 NA 0.07054851 0.65521042
3 C <NA> 0.75798665 NA 0.04483692 0.54671014 NA
4 D R 0.96825047 0.01875140 0.07383107 NA 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 NA 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 NA NA 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 NA 0.72085080 NA
10 J X 0.39093317 0.97107464 NA 0.86417719 0.39890170
For Col3-Col7, if there are less than 3 NAs, I want to replace it with the row minimum from Col3-Col7, otherwise keep the NAs there. So, I'd want the dataset to look as follows:
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 0.07054851 0.07054851 0.65521042
3 C <NA> 0.75798665 0.04483692 0.04483692 0.54671014 0.04483692
4 D R 0.96825047 0.01875140 0.07383107 0.01875140 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 0.04181401 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 0.69548691 0.69548691 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 0.63233902 0.72085080 0.63233902
10 J X 0.39093317 0.97107464 0.39093317 0.86417719 0.39890170
So every row except row 6 had the values imputed by the minimum value in each row for columns3-7.
In my actual dataset, for every row between columns 18:27, if there are less than 4 NAs, replace with the row minimum for the columns 18:27, otherwise keep all the NAs.
I've tried using the dplyr pipes/mutate/replace method, but I'm not sure how to do it for a subset of columns (I'm under the impression you can only specify one column with mutate/replace). Some of the logic I've tried including in the if statement includes
rowSums(is.na(.[18:27]))<4 & rowSums(is.na(.[18:27]))>0)
I've seen the rowMins function in the matrixStats package, but I'm just wondering if I can do this with dplyr/dataframe and not matrices.
I would suggest a tidyverse approach where you reshape the data and group by Col1 and Col2 and the re build the data. As we will use pipes, we can also create the new variables with mutate() and evaluate the condition you want after creating Flag variable and also computing the min value. Next the code:
library(tidyverse)
#Data
m <- structure(list(Col1 = c("A", "B", "C", "D", "<NA>", "F", "G",
"<NA>", "I", "J"), Col2 = c("O", "P", "<NA>", "R", "S", "<NA>",
"U", "<NA>", "W", "X"), Col3 = c(0.09929126, 0.50314123, 0.75798665,
0.96825047, 0.47079716, NA, 0.71947656, 0.90518907, 0.79208514,
0.39093317), Col4 = c(0.40435352, 0.81725456, NA, 0.0187514,
0.04181401, NA, NA, 0.20661633, 0.63233902, 0.97107464), Col5 = c(0.1536083,
NA, 0.04483692, 0.07383107, 0.21423046, NA, NA, 0.65788523, NA,
NA), Col6 = c(0.038304, 0.07054851, 0.54671014, NA, NA, 0.33702657,
0.99142181, 0.0553433, 0.7208508, 0.86417719), Col7 = c(0.80157985,
0.65521042, NA, 0.04498563, 0.55493444, 0.5498926, 0.69548691,
0.78420756, NA, 0.3989017)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
The code:
#Reshape
m %>% pivot_longer(cols = -c(Col1,Col2)) %>%
group_by(Col1,Col2) %>% mutate(MinVal=min(value,na.rm=T),
Flag=sum(is.na(value))) %>% ungroup() %>%
mutate(value=ifelse(is.na(value) & Flag<3,MinVal,value)) %>%
select(-c(MinVal,Flag)) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 10 x 7
Col1 Col2 Col3 Col4 Col5 Col6 Col7
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A O 0.0993 0.404 0.154 0.0383 0.802
2 B P 0.503 0.817 0.0705 0.0705 0.655
3 C <NA> 0.758 0.0448 0.0448 0.547 0.0448
4 D R 0.968 0.0188 0.0738 0.0188 0.0450
5 <NA> S 0.471 0.0418 0.214 0.0418 0.555
6 F <NA> NA NA NA 0.337 0.550
7 G U 0.719 0.695 0.695 0.991 0.695
8 <NA> <NA> 0.905 0.207 0.658 0.0553 0.784
9 I W 0.792 0.632 0.632 0.721 0.632
10 J X 0.391 0.971 0.391 0.864 0.399
I have a data frame such as this (but of size 16 Billion):
structure(list(id1 = c(1, 2, 3, 4, 4, 4, 4, 4, 4, 4), id2 = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), b1 = c(NA, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L), b2 = c(1, NA, NA, NA, NA, NA,
1, 1, 1, 1), b3 = c(NA, 1, NA, NA, NA, NA, NA, NA, 1, 1), b4 = c(NA,
NA, 1, NA, NA, NA, NA, NA, 1, 1)), .Names = c("id1", "id2", "b1",
"b2", "b3", "b4"), row.names = c(NA, 10L), class = "data.frame")
df
id1 id2 b1 b2 b3 b4
1 1 a NA 1 NA NA
2 2 b NA NA 1 NA
3 3 c NA NA NA 1
4 4 d 1 NA NA NA
5 4 e 1 NA NA NA
6 4 f 1 NA NA NA
7 4 g 1 1 NA NA
8 4 h 1 1 NA NA
9 4 i 1 1 1 1
10 4 j 1 1 1 1
I need to get it into long format, while ONLY keeping values of 1. Of course, I tried using gather from tidyr and also melt from data.table to no avail as the memory requirements of them are explosive. My original data had zeros and ones, but I filled zeroes with NA and hoped na.rm = TRUE option will help with memory issue. But, it does not.
With just ones retained and lengthened, my data frame will fit easily in memory I have.
Is there a better way to get at this vs. using the standard methods - reasonable compute as a tradeoff for better memory fit is acceptable.
My desired output is the equivalent of:
library(dplyr)
library(tidyr)
df %>% gather(b, value, -id1, -id2, na.rm = TRUE)
id1 id2 b value
1 4 d b1 1
2 4 e b1 1
3 4 f b1 1
4 4 g b1 1
5 4 h b1 1
6 4 i b1 1
7 4 j b1 1
8 1 a b2 1
9 4 g b2 1
10 4 h b2 1
11 4 i b2 1
12 4 j b2 1
13 2 b b3 1
14 4 i b3 1
15 4 j b3 1
16 3 c b4 1
17 4 i b4 1
18 4 j b4 1
# or
reshape2::melt(df, id=c("id1","id2"), na.rm=TRUE)
# or
library(data.table)
melt(setDT(df), id=c("id1","id2"), na.rm=TRUE)
Currently, the call to gather on my full data set gives me this error, which I believe is due to memory issue:
Error in .Call("tidyr_melt_dataframe", PACKAGE = "tidyr", data, id_ind, :
negative length vectors are not allowed