How to replace NA's with specific data? [duplicate] - r

This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 11 months ago.
I have a df1:
ID Number
3 9
4 10
2 3
1 NA
8 5
6 4
9 NA
I would like to change the NAs to 1's, like below.
ID Number
3 9
4 10
2 3
1 1
8 5
6 4
9 1

With base R
df$Number[is.na(df$Number)] <- 1
Or with dplyr
library(dplyr)
df %>%
mutate(Number = ifelse(is.na(Number), 1, Number))
Or with data.table:
library(data.table)
setnafill(df, cols="Number", fill=1)
Output
ID Number
1 3 9
2 4 10
3 2 3
4 1 1
5 8 5
6 6 4
7 9 1
Data
df <- structure(list(ID = c(3L, 4L, 2L, 1L, 8L, 6L, 9L), Number = c(9L,
10L, 3L, NA, 5L, 4L, NA)), class = "data.frame", row.names = c(NA,
-7L))

Related

how to remove part of a string without interrupting a data frame?

I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3
Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

How to sum rows based on exact conditions on multiple columns and save edited rows in original dataset? [duplicate]

This question already has answers here:
Find nearest matches for each row and sum based on a condition
(4 answers)
Closed 3 years ago.
There are 3 parts to this problem:
1) I want to sum values in column b,c,d for any two adjacent rows which have the same values for columns(b,c,d)
2) I would like to keep values in other columns the same. (Some other column (eg. a) may contain character data.)
3) I would like to keep the changes by replacing the original value in columns b,c,d in the first row (of the 2 same rows) with the new values (the sums) and delete the second row(of the 2 same rows).
Time a b c d id
1 2014/10/11 A 40 20 10 1
2 2014/10/12 A 40 20 10 2
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 20 7 8 6
7 2014/10/17 B 20 7 8 7
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
Expected outcome:
Time a b c d id
1 2014/10/11 A 80 40 20 1 *
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 40 14 16 6 *
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
id 1 and 2 combined to become id 1; id 6 and 7 combined to become id 6.
Thank you. Any contribution is greatly appreciated.
Using dplyr functions along with data.table::rleid. To get same values for adjacent b, c and d columns we paste them and use rleid to create groups. For each group we sum the values at b, c and d columns and keep only the 1st row.
library(dplyr)
df %>%
mutate(temp_col = paste(b, c, d, sep = "-")) %>%
group_by(group = data.table::rleid(temp_col)) %>%
mutate_at(vars(b, c, d), sum) %>%
slice(1L) %>%
ungroup %>%
select(-temp_col, -group)
# Time a b c d id
# <fct> <fct> <int> <int> <int> <int>
#1 2014/10/11 A 80 40 20 1
#2 2014/10/13 B 9 10 9 3
#3 2014/10/14 D 16 5 12 4
#4 2014/10/15 D 1 6 5 5
#5 2014/10/16 B 40 14 16 6
#6 2014/10/18 A 11 9 5 8
#7 2014/10/19 C 31 20 23 9
data
df <- structure(list(Time = structure(1:9, .Label = c("2014/10/11",
"2014/10/12", "2014/10/13", "2014/10/14", "2014/10/15", "2014/10/16",
"2014/10/17", "2014/10/18", "2014/10/19"), class = "factor"),
a = structure(c(1L, 1L, 2L, 4L, 4L, 2L, 2L, 1L, 3L), .Label = c("A",
"B", "C", "D"), class = "factor"), b = c(40L, 40L, 9L, 16L,
1L, 20L, 20L, 11L, 31L), c = c(20L, 20L, 10L, 5L, 6L, 7L,
7L, 9L, 20L), d = c(10L, 10L, 9L, 12L, 5L, 8L, 8L, 5L, 23L
), id = 1:9), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9"))

R subsetting rows where values in multiple columns don't match

Apologies if this has been asked already, but I searched and could not find an exact example of what I am trying to do. I'm trying to subset a dataframe to exclude rows that have matching numerical values across five columns. For example, for the following dataframe, df, I'd want to return a new dataframe only with rows 1:2, 5:6, and 8:10:
Row A B C D E
1 1 1 2 3 1
2 4 1 2 3 5
3 2 2 2 2 2
4 5 5 5 5 5
5 4 4 2 3 4
6 2 1 3 5 2
7 3 3 3 3 3
8 3 2 5 3 3
9 2 1 2 2 4
10 3 3 3 2 3
I'm having trouble figuring out how to do this for more than two columns. I've tried the following and know they are not right.
df2 <- df[!duplicated(df, c("A", "B", "C", "D", "E"))]
and
df2 <- df[df$A==df$B==df$C==df$D==df$E,]
Thanks in advance.
Data frames are usually operated on column-wise rather than row-wise, which is why your duplicated attempt doesn't work. (It's checking for duplicate rows within those columns.) And your == doesn't work because == is a binary operator, df$A == df$B will be TRUE or FALSE, and then (df$A == df$B) == df$C (implied parentheses) will be testing if df$C is TRUE or FALSE.
apply is a good way to run a function on each row. It will convert your data frame to a matrix to run the function, but in this case that's fine columns A through E are all numeric. Here's one way:
df[apply(df[, -1], 1, function(x) length(unique(x))) > 1, ]
# Row A B C D E
# 1 1 1 1 2 3 1
# 2 2 4 1 2 3 5
# 5 5 4 4 2 3 4
# 6 6 2 1 3 5 2
# 8 8 3 2 5 3 3
# 9 9 2 1 2 2 4
# 10 10 3 3 3 2 3
You could come up with all sorts of different functions to apply to test for all the elements being the same.
I assumed you actually have a column named Row. If that isn't the case, leave out the -1 in my code above.
Using this data, reproducibly shared with dput().
df = structure(list(Row = 1:10, A = c(1L, 4L, 2L, 5L, 4L, 2L, 3L,
3L, 2L, 3L), B = c(1L, 1L, 2L, 5L, 4L, 1L, 3L, 2L, 1L, 3L), C = c(2L,
2L, 2L, 5L, 2L, 3L, 3L, 5L, 2L, 3L), D = c(3L, 3L, 2L, 5L, 3L,
5L, 3L, 3L, 2L, 2L), E = c(1L, 5L, 2L, 5L, 4L, 2L, 3L, 3L, 4L,
3L)), .Names = c("Row", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-10L))
You can simply compare all the columns against a single column and see if all the same
df[rowSums(df[-1] == df[, 1]) < (ncol(df) - 1), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3
Or just df[rowSums(df == df[, 1]) < (ncol(df)), ]
Or similarly, you can avoid matrix conversions all together and combine Reduce and lapply
df[!Reduce("&" , lapply(df, `==`, df[, 1])), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3

R: Sorting columns based on partial match of column names with row names

I have a data frame that can be simplified to look like this (included the dput at the end):
T2_KL_21 A1_LC_11 W3_FA_22 RR_BI_12 PL_EW_12 RT_LC_22 YU_BI_21
FA 1 2 3 4 5 6 7
BI 1 2 3 4 5 6 7
KL 1 2 3 4 5 6 7
EW 1 2 3 4 5 6 7
LC 1 2 3 4 5 6 7
I would like to sort the columns so that they follow the order of the row names (based on partial match). It would then look like this:
W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
FA 3 4 7 1 5 2 6
BI 3 4 7 1 5 2 6
KL 3 4 7 1 5 2 6
EW 3 4 7 1 5 2 6
LC 3 4 7 1 5 2 6
If more than one column name contains the string in the row names, they should be kept side by side, but the order does not matter.
I have already filtered the columns so that they all contain a match in the row names.
Here is the dput of the data frame:
structure(list(T2_KL_21 = c(1L, 1L, 1L, 1L, 1L), A1_LC_11 = c(2L,
2L, 2L, 2L, 2L), W3_FA_22 = c(3L, 3L, 3L, 3L, 3L), RR_BI_12 = c(4L,
4L, 4L, 4L, 4L), PL_EW_12 = c(5L, 5L, 5L, 5L, 5L), RT_LC_22 = c(6L,
6L, 6L, 6L, 6L), YU_BI_21 = c(7L, 7L, 7L, 7L, 7L)), .Names = c("T2_KL_21",
"A1_LC_11", "W3_FA_22", "RR_BI_12", "PL_EW_12", "RT_LC_22", "YU_BI_21"
), class = "data.frame", row.names = c("FA", "BI", "KL", "EW",
"LC"))
I have tried using pmatch, grep and match, with no success.
Any advice will be much appreciated! Thanks
We can loop through the rownames and grep to find the index of the column names that match, unlist and use that to arrange the columns
df1[unlist(lapply(gsub("\\d+", "", row.names(df1)), function(x) grep(x, names(df1))))]
#W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
#FA 3 4 7 1 5 2 6
#BI 3 4 7 1 5 2 6
#KL 3 4 7 1 5 2 6
#EW 3 4 7 1 5 2 6
#LC 3 4 7 1 5 2 6

Identify rows with complete data in R by adding details in additional column [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
For a sample dataframe:
df1 <- structure(list(id = structure(1:5, .Label = c("a", "b", "c",
"d", "e"), class = "factor"), cat = c(5L, 7L, 6L, 2L, 8L), dog = c(7L,
NA, 6L, 13L, 2L), sheep = c(NA, 6L, 3L, 6L, 2L), cow = c(2L,
10L, 8L, 9L, 1L), rabbit = c(5L, 3L, NA, 2L, 4L), pig = c(7L,
NA, 12L, 5L, NA)), .Names = c("id", "cat", "dog", "sheep", "cow",
"rabbit", "pig"), class = "data.frame", row.names = c(NA, -5L
))
I want to add an extra column 'complete.farm' to identify which rows have values in columns 'sheep' AND 'cow' AND 'pig'. Any rows with NAs in one or
more of these columns should get a 0 and rows with real values should get a 1.
If anyone could give me some advice on this, I would really appreciate it. I usually use complete cases to subset my dataframe, but this time, I only want to add this information in a column.
This seems to work:
> df1$complete.farm <- ifelse( !is.na(df1$pig) & !is.na(df1$sheep) & !is.na(df1$cow), 1,0)
> df1
id cat dog sheep cow rabbit pig complete.farm
1 a 5 7 NA 2 5 7 0
2 b 7 NA 6 10 3 NA 0
3 c 6 6 3 8 NA 12 1
4 d 2 13 6 9 2 5 1
5 e 8 2 2 1 4 NA 0
ifelse is vectorised so you just mention the condition on the first argument with 1 as the confirmed and 0 the non-confirmed.
Another (simpler) way as per #thelatemail 's comment below:
df1$col <- as.numeric(complete.cases(df1[c("sheep","cow","pig")]))
> df1
id cat dog sheep cow rabbit pig complete.farm col
1 a 5 7 NA 2 5 7 0 0
2 b 7 NA 6 10 3 NA 0 0
3 c 6 6 3 8 NA 12 1 1
4 d 2 13 6 9 2 5 1 1
5 e 8 2 2 1 4 NA 0 0

Resources