My two dataframes:
df1
Col1
A
B
C
df2
Col1
A
D
E
F
I would like to add a 2nd column, Col2, to df1 where each value in the column is 1 if it's respective value in Col1 is also in Col1 of df2. Thus df1 would look like this:
df1
Col1 Col2
A 1
B 0
C 0
Thanks!
Add the col2 to df2
df2$Col2 <- 1
Perform a left-join merge:
df3 <- merge(df1, df2, all.x=T, by='Col1')
Replace the NAs with zeros
df3$Col2[which(is.na(df3$Col2))] <- 0
df3 is now
Col1 Col2
1 A 1
2 B 0
3 C 0
Edit: #ycw has done it more concisely using as.numeric and %in%. I like his answer, but I thought I'd edit mine to include a version of his work that doesn't use dplyr:
It's as simple as df1$Col2 <- as.numeric(df1$Col1 %in% df2$Col1). Much better than mine!
df3 is the final output.
library(dplyr)
df1 <- data_frame(Col1 = c("A", "B", "C"))
df2 <- data_frame(Col1 = c("A", "D", "E", "F"))
df3 <- df1 %>% mutate(Col2 = as.numeric(Col1 %in% df2$Col1))
Or the following approach is similar to HarlandMason's method but using dplyr and tidyr.
library(dplyr)
library(tidyr)
df3 <- df2 %>%
mutate(Col2 = 1) %>%
right_join(df1, by = "Col1") %>%
replace_na(list(Col2 = 0))
Two options using data.table
First one uses %chin% operator :
library(data.table)
x = data.table(v = LETTERS[1:3])
y = data.table(v = c("A","D","E","F"))
x[, found:= v %chin% y$v]
x
#> v found
#> 1: A TRUE
#> 2: B FALSE
#> 3: C FALSE
The second one is built on merging behaviour:
library(data.table)
x = data.table(v = LETTERS[1:3])
y = data.table(v = c("A","D","E","F"))
y[, found := TRUE]
x[, found:= y[.SD, .(ifelse(is.na(found), FALSE, TRUE)), on = .(v)]]
x
#> v found
#> 1: A TRUE
#> 2: B FALSE
#> 3: C FALSE
EDIT: Based on #frank comment, you could simplify with no ifelse - it is the same
x[, found:= y[.SD, !is.na(found), on = .(v)]]
x
#> v found
#> 1: A TRUE
#> 2: B FALSE
#> 3: C FALSE
For understanding what happens, here is the inside code I built on:
x[, found := NULL]
y[x, on = .(v)]
#> v found
#> 1: A TRUE
#> 2: B NA
#> 3: C NA
Related
This is the sample dataset:
library(data.table)
df = data.table(x = c(1000,2000,10,2), y = c('A','A','B','B'))
I only want to divide df$y == "A" by 1000. The final dataset should appear as:
df = data.table(x = c(1,2,10,2), y = c('A','A','B','B'))
You need to create a conditional statement.
In base R:
df$x <- ifelse(df$y == "A", df$x/1000, df$x)
In dplyr:
library(dplyr)
df <- df |>
mutate(x = if_else(y == "A", x/1000, x))
data.table option using fifelse like this:
library(data.table)
df = data.table(x = c(1000,2000,10,2), y = c('A','A','B','B'))
df[,x:=fifelse(y == "A", x/1000, x),]
df
#> x y
#> 1: 1 A
#> 2: 2 A
#> 3: 10 B
#> 4: 2 B
Created on 2023-02-18 with reprex v2.0.2
We could use data.table methods as the input is a data.table
library(data.table)
df[y == 'A', x := x/1000]
-output
> df
x y
1: 1 A
2: 2 A
3: 10 B
4: 2 B
Base R: Subsetting with [:
df$x[df$y == "A"] <- df$x[df$y == "A"]/1000
x y
1: 1 A
2: 2 A
3: 10 B
4: 2 B
I have a simple problem but i haven't figured out the solution yet. I don't know how to reference to a variable outside the data frame when I'm using dplyr. Here is a small chunk of code:
library(dplyr)
var <- 1
df <- data.frame(col1 = c("a", "b", "c"), col2 = c(1, 2, 3))
df %>% mutate(col2 = ifelse(var == 1, col2 + var, col2))
Result:
col1 col2
1 a 2
2 b 2
3 c 2
Desired output:
col1 col2
1 a 2
2 b 3
3 c 4
This is not a dplyr specific issue but when you have a condition to check of length 1 use if and else instead of vectorized ifelse.
library(dplyr)
df %>% mutate(col2 = if(var == 1) col2 + var else col2)
# col1 col2
#1 a 2
#2 b 3
#3 c 4
We could use rowwise and sum
df %>%
rowwise() %>%
mutate(col2 = ifelse(var == 1, sum(col2,var), col2))
col1 col2
<chr> <dbl>
1 a 2
2 b 3
3 c 4
We could use base R for this
i1 <- df$col2 == var
df$col2[i1] <- df$col2[i1] + var
-output
> df
col1 col2
1 a 2
2 b 2
3 c 3
Or use data.table
library(data.table)
setDT(df)[col2 == var, col2 := col2 + var]
I have a below lists (with sublists as well). But here the columns are unequal. "a" list has 2 columns and "b" lists has 3 columns.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
I need to append this list keeping references like below. For example,
COl1 COl2 COl3 Col4
a 1 false NA
b 2 true 3
As you can see above, there is a reference in col 1 from where the data object the lists is taken. Please guide
1) data.table Set names on the list giving the new list fnam and then use rbindlist from data.table:
library(data.table)
fnam <- lapply(f, function(x) setNames(x, paste0("COL", seq(2, length = length(x)))))
cbind(COL1 = names(f), rbindlist(fnam , fill = TRUE))
giving:
COL1 COL2 COL3 COL4
1: a 1 false <NA>
2: b 2 true 3
2) base R This alternative uses no packages. We create a character vector out of f and then read it in using read.table.
Lines <- paste(names(f), sapply(f, paste, collapse = " "))
nc <- max(lengths(f)) + 1
col.names <- paste0("COL", seq_len(nc))
read.table(text = Lines, header = FALSE, fill = TRUE, col.names = col.names)
giving:
COL1 COL2 COL3 COL4
1 a 1 false NA
2 b 2 true 3
Use some separator not appearing in the data if the data can contain spaces.
One option would be to set the names of the list elements using map and specify the .id as 'COL1' to create a new column based on the names of 'f'. Note that map returns a list, while map_df a tb_df/data.frame
1)
library(tidyverse)
f %>%
map_df(~ set_names(., paste0("COL", seq_along(.)+1)), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <dbl> <chr> <chr>
#1 a 1 false <NA>
#2 b 2 true 3
2) If the types are different, retype (from hablar) and then do
library(hablar)
f1 %>%
map_df(~ set_names(.x, paste0("COL", seq_along(.)+1)) %>%
map(retype), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
3) Or with type.convert
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
# A tibble: 2 x 4
# COL1 COL1 COL2 COL3
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
4) if the integer/numeric is giving an issue, then convert it to common type ie. to numeric
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
map_if(is.integer, as.numeric) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
5) As the types are mixed up, it may be better to do the retype after converting to a single data.frame
f %>%
map_df(~ map(.x, as.character) %>%
set_names(paste0("COL", seq_along(.x) + 1)), .id = "COL1") %>%
retype
data
f <- list(a = list(1, "false"), b = list(2, "true", "3"))
f1 <- list(a=list(1,"false"),b=list("2","true","3"))
How about another simple base R solution.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
m = matrix(NA,ncol=max(sapply(f,length)),nrow=length(f))
for(i in 1:nrow(m)) {
u = unlist(f[[i]])
m[i,1:length(u)] = u
}
your_data_frame = as.data.frame(m)
I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333
I have the following data frame
>data.frame
col1 col2
A
x B
C
D
y E
I need a new data frame that looks like:
>new.data.frame
col1 col2
A
x
C
D
y
I just need a method for reading from col1 and if there is ANY characters in Col1 then clear corresponding row value of col2. I was thinking about using an if statement and data.table for this but am unsure of how to relay the information for deleting col2's values based on ANY characters being present in col1.
Something like this works:
# Create data frame
dat <- data.frame(col1=c(NA,"x", NA, NA, "y"), col2=c("A", "B", "C", "D", "E"))
# Create new data frame
dat_new <- dat
dat_new$col2[!is.na(dat_new$col1)] <- NA
# Check that it worked
dat
dat_new
This depends on what you mean by 'remove'. Here I'm assuming a blank string "". However, the same principle will apply for NAs
## create data frame
df <- data.frame(col1 = c("", "x", "","", "y"),
col2 = LETTERS[1:5],
stringsAsFactors = FALSE)
df
# col1 col2
# 1 A
# 2 x B
# 3 C
# 4 D
# 5 y E
## subset by blank values in col1, and replace the values in col2
df[df$col1 != "",]$col2 <- ""
## or df$col2[df$col1 != ""] <- ""
df
# col1 col2
# 1 A
# 2 x
# 3 C
# 4 D
# 5 y
And as you mentioned data.table, the code for this would be
library(data.table)
setDT(df)
## filter by blank entries in col1, and update col2 by-reference (:=)
df[col1 != "", col2 := ""]
df
Using dplyr
library(dplyr)
df %>%
mutate(col2 = replace(col2, col1!="", ""))
# col1 col2
#1 A
#2 x
#3 C
#4 D
#5 y