How can you transform empty strings to NA in a variable?
So the context is I have 4 datasets that are combined (4 surveys). In one of those datasets and in one variable in particular, the "NA" data are classified as (empty strings). How do I modify/transform/mutate those (empty strings) as NA without affecting the rest of the dataset?
What I saw that I think could work is this:
dat <- dat %>% mutate_all(na_if, "")
The problem is obviously that this selects all the data.
What worked was
nav <- c('', ' ')
ECEMD <- transform(ECEMD, CHILDREN_DIM = replace (CHILDREN_DIM, CHILDREN_DIM %in% nav, NA))
Assuming sort of this data.
dat
# want_na non_na
# 1 A A
# 2 B B
# 3
# 4 D D
# 5
# 6 F F
# 7
# 8 H H
# 9 I I
# 10 J J
Then define a vector containing all values you want to with NA, e.g.
nav <- c('', ' ')
and replace them in a defined variable, "want_na" in this example.
dat <- transform(dat, want_na=replace(want_na, want_na %in% nav, NA))
dat
# want_na non_na
# 1 A A
# 2 B B
# 3 <NA>
# 4 D D
# 5 <NA>
# 6 F F
# 7 <NA>
# 8 H H
# 9 I I
# 10 J J
Data:
dat <- structure(list(want_na = c("A", "B", "", "D", "", "F", "", "H",
"I", "J"), non_na = c("A", "B", "", "D", "", "F", "", "H", "I",
"J")), class = "data.frame", row.names = c(NA, -10L))
What worked was
nav <- c('', ' ')
ECEMD <- transform(ECEMD, CHILDREN_DIM = replace (CHILDREN_DIM, CHILDREN_DIM %in% nav, NA))
Related
I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))
I must imagine this question is not unique, but I was struggling with which words to search for so if this is redundant please point me to the post!
I have a dataframe
test <- data.frame(x = c("a", "b", "c", "d", "e"))
x
1 a
2 b
3 c
4 d
5 e
And I'd like to replace SOME of the values using a separate data frame
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
Resulting in:
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
A base R solution using match + replace
test <- within(test,x <- replace(as.character(x),match(metadata$a,x),as.character(metadata$b)))
such that
> test
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Importing your data with stringsAsFactors = FALSE and using dplyr and stringr, you can do:
test %>%
mutate(x = str_replace_all(x, setNames(metadata$b, metadata$a)))
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Or using the basic idea from #Sotos:
test %>%
mutate(x = pmax(x, metadata$b[match(x, metadata$a, nomatch = x)], na.rm = TRUE))
You can do,
test$x[test$x %in% metadata$a] <- na.omit(metadata$b[match(test$x, metadata$a)])
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Here's one approach, though I presume there are shorter ones:
library(dplyr)
test %>%
left_join(metadata, by = c("x" = "a")) %>%
mutate(b = coalesce(b, x))
# x b
#1 a a
#2 b b
#3 c REPLACE_1
#4 d REPLACE_2
#5 e e
(Note, I have made the data types match by loading metadata as character, not factors:
metadata <- data.frame(stringsAsFactors = F,
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
You can use match to make this update join.
i <- match(metadata$a, test$x)
test$x[i] <- metadata$b
# test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Or:
i <- match(test$x, metadata$a)
j <- !is.na(i)
test$x[j] <- metadata$b[i[j]]
test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Data:
test <- data.frame(x = c("a", "b", "c", "d", "e"), stringsAsFactors = FALSE)
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"), stringsAsFactors = FALSE)
I am currently trying to figure out a vectorized way to match by two values in the same row. I have the following two simplified data frames:
# Dataframe 1: Displaying all my observations
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"))
colnames(df1) <- c("ID", "Number1", "Number2")
> df1
ID Number1 Number2
1 1 A B
2 2 B E
3 3 C D
4 4 D A
5 5 A C
6 6 B A
7 7 A D
8 8 C A
# Dataframe 2: Matrix of observations I am interested in
df2 <- matrix(c("A", "B",
"D", "A",
"C", "B",
"E", "D"),
ncol = 2,
byrow = TRUE)
> df2
[,1] [,2]
[1,] "A" "B"
[2,] "D" "A"
[3,] "C" "B"
[4,] "E" "D"
What I am trying to accomplish is to create a new column in df1 that states TRUE only if the exact combination is present in df2 (for example ID = 1 is equivalent to the first row in df2 because both of them consist of A and B). Additionally, if there is a shortcut, I would also like the status to be TRUE if the numbers are reversed, i.e. df1$Number1 matches df2[i,2] and df1$Number2 matches df2[i,1] (for example for ID = 7, the combination in df1 is A,D and in df2, the combination is D,A --> TRUE).
My desired output looks like this:
> df1
ID Number1 Number2 Status
1 1 A B TRUE
2 2 B E FALSE
3 3 C D FALSE
4 4 D A TRUE
5 5 A C FALSE
6 6 B A TRUE
7 7 A D TRUE
8 8 C A FALSE
All I have gotten so far is this:
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2)) {
Status <- ifelse(df1$Number1[i] %in% df2[j,1] &&
df1$Number2[i] %in% df2[j,2], TRUE, FALSE)
StatusComb[i,j] <- Status
}
df1$Status[i] <- ifelse(any(StatusComb[i,]) == TRUE, TRUE, FALSE)
}
It is really inefficient (you can clearly tell I am new to R) and does not look very nice either. I would appreciate any help!
One method would be to merge things together.
Adapting your data, to account for reversed labels, I'll reverse df2 on itself and rbind it:
df2 <- rbind.data.frame(df2, df2[,c(2,1)])
colnames(df2) <- c("Number1", "Number2")
df2$a <- TRUE
df2
# Number1 Number2 a
# 1 A B TRUE
# 2 D A TRUE
# 3 C B TRUE
# 4 E D TRUE
# 5 B A TRUE
# 6 A D TRUE
# 7 B C TRUE
# 8 D E TRUE
I added a so that it'll be merged in. From there:
df3 <- merge(df1, df2, all.x = TRUE)
df3$a <- !is.na(df3$a)
df3[ order(df3$ID), ]
# Number1 Number2 ID a
# 1 A B 1 TRUE
# 5 B E 2 FALSE
# 7 C D 3 FALSE
# 8 D A 4 TRUE
# 2 A C 5 FALSE
# 4 B A 6 TRUE
# 3 A D 7 TRUE
# 6 C A 8 FALSE
If you look at it before !is.na(df3$a), you'll see that the column is wholly TRUE and NA (the NA were not present in df2); if that is enough for you, then you can omit the middle step. The order step is just because row-order with merge is not assured (in fact I find it always inconveniently different). Since it was previously ordered by ID, I reverted to that, but it was entirely for aesthetics here to match your desired output.
You could define the combination variable that you want to search for in alphabetical order as below:
combination <- apply(df2, 1, function(x) {
paste(sort(x), collapse = '')
})
combination
[1] "AB" "AD" "BC" "DE"
And then mutate the Status field based on the concatenation of the Number field
library(dplyr)
df1 %>%
rowwise() %>%
mutate(S = paste(sort(c(Number1, Number2)), collapse = "")) %>%
mutate(Status = ifelse(S %in% combination, TRUE, FALSE))
Source: local data frame [8 x 5]
Groups: <by row>
# A tibble: 8 x 5
ID Number1 Number2 S Status
<dbl> <chr> <chr> <chr> <lgl>
1 1 A B AB TRUE
2 2 B E BE FALSE
3 3 C D CD FALSE
4 4 D A AD TRUE
5 5 A C AC FALSE
6 6 B A AB TRUE
7 7 A D AD TRUE
8 8 C A AC FALSE
Data:
I set stringsAsFactors = F in the dataframe
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"), stringsAsFactors = F)
colnames(df1) <- c("ID", "Number1", "Number2")
I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362
I have a dataframe that is created by a for-loop with a changing number of columns.
In a different function I want the drop the last five columns.
The variable with the length of the dataframe is "units" and it has numbers between 10 an 150.
I have tried using the names of the columns to drop but it is not working. (As soon as I try to open "newframe" R studio crashes, viewing myframe is no problem).
drops <- c("name1","name2","name3","name4","name5")
newframe <- results[,!(names(myframe) %in% drops)]
Is there any way to just drop the last five columns of a dataframe without relying on names or numbers of the columns
length(df) can also be used:
mydf[1:(length(mydf)-5)]
You could use the counts of columns (ncol()):
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10), ws = rnorm(10))
# rm last 2 columns
df[ , -((ncol(df) - 1):ncol(df))]
# or
df[ , -seq(ncol(df)-1, ncol(df))]
Yo can take advantage of the list method for head() (which drops whole list elements, and works differently to the data.frame method which drops rows):
# data.frame with 26 columns (named a-z):
df <- setNames( as.data.frame( as.list(1:26)) , letters )
# drop last 5 'columns':
as.data.frame( head(as.list(df),-5) )
# a b c d e f g h i j k l m n o p q r s t u
#1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
My preferable method is using rev which makes the syntax cleaner. For mtcars data set
mtcars[-rev(seq_len(ncol(mtcars)))[1:5]]
Or using head (similar to Simons suggestion)
mtcars[head(seq_len(ncol(mtcars)), -5)]
A tidyverse option is to use last_col, where we first select the fifth column from the last column (i.e., last_col(offset = 4)) then to the last column number. Then, we use the - to remove the selected columns.
library(tidyverse)
df %>%
select(-(last_col(offset = 4):last_col()))
Output
x y z
1 1 10 5
2 2 9 5
3 3 8 5
4 4 7 5
5 5 6 5
6 6 5 5
7 7 4 5
8 8 3 5
9 9 2 5
10 10 1 5
Another option is to use ncol in the select:
df %>%
select(-((ncol(.) - 4):ncol(.)))
Or we could use tail with names:
df %>%
select(-tail(names(.), 5))
Data
df <- structure(list(x = 1:10, y = 10:1, z = c(5, 5, 5, 5, 5, 5, 5,
5, 5, 5), a = 11:20, b = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), c = c("t", "s", "r", "q", "p", "o", "n", "m",
"l", "k"), d = 30:39, e = 50:59), class = "data.frame", row.names = c(NA,
-10L))
If you are using data.table package for your data processing, one nice way can be
drops <- c("name1","name2","name3","name4","name5")
df[, .SD, .SDcols=!drops]
In fact, this allows you to drop any variables as you like.