How to add calculated columns to the source DataFrame - r

In SparkR (Databricks) I am able to calculate, let's say, mean value for column B based on grouped values from columns A and C as in here:
library(SparkR)
df <- createDataFrame (
list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
c("a", "b", "c", "d"))
result <- gapplyCollect(
df,
c("a", "c"),
function(key, x) {
y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
colnames(y) <- c( "key_a", "key_c", "mean_b")
y
})
Here the source data frame - df - is used to produce a new one - result - with the results - mean_b for key_a and key_c.
This is working fine, but how to do the same operation WITHOUT creating new data frame? So that mean_b is added as a new column to df?

A left join can add the result$mean_b as a new column to the original df using the aggregation keys.
Observe the following code:
library(SparkR)
df <- createDataFrame (
list(list(1L, 1, "x", 0.1), list(1L, 2, "x", 0.2), list(3L, 3, "y", 0.3)),
c("a", "b", "c", "d"))
result_schema <- structType(
structField("key_a", "integer"),
structField("key_c", "string"),
structField("mean_b", "double"))
result <- gapply(
df,
c("a", "c"),
function(key, x) {
y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
colnames(y) <- c("key_a", "key_c", "mean_b")
y
},
result_schema)
Note that I slightly changed the values of df because you had numbers as strings. Also, the gapply is used instead which returns a Spark Dataframe like df. It also requires an schema of the resulting df. In summary:
df:
a b c d
1 1 1 x 0.1
2 1 2 x 0.2
3 3 3 y 0.3
result:
key_a key_c mean_b
1 1 x 1.5
2 3 y 3.0
Now you can do the join of both Spark Dataframes:
df2 <- join(df, result, (df$a == result$key_a) & (df$c == result$key_c), "left")
collect(drop(df2, c("key_a", "key_c")))
The extra columns can be remove with drop and data returned to the driver with collect:
a b c d mean_b
1 1 1 x 0.1 1.5
2 1 2 x 0.2 1.5
3 3 3 y 0.3 3.0

Related

R Add values to existing column in data frame by merging with other data frame

Suppose I have the following data:
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
Now, I want to merge/join the data from dat2 and dat3 to dat1 one after the other in a way that the dat1$y values are replaced by the dat2.y or dat3.y values instead of adding these as new columns.
The problem is that merge or left_join would not add the values to the existing y column, but add a y.y column and rename the one from dat1 to y.x.
I also thought I could use the rows_update function from the tidyverse, but the problem is that in my real life case I'm not only matching by one column (here: id), but by several id columns together, but rows_update only allows the by variable to be one vector.
NOTE: in my real-life use case I have
~50 data frames to merge
the uniqueness of my rows can only be determined through multiple id columns
the id columns have different names in my dat1 and all other dat2 to dat50 data frames.
The expected output after merging dat2 and dat3 to dat1 would be:
id x y
"a" 1 9
"b" 2 8
"c" 3 7
"d" 4 6
Try with indexing using %in% to test the id variables:
#Data
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
#Code
dat1$y[dat1$id %in% dat2$id] <- dat2$y[dat2$id %in% dat1$id]
dat1$y[dat1$id %in% dat3$id] <- dat3$y[dat3$id %in% dat1$id]
Output:
id x y
1 a 1 9
2 b 2 8
3 c 3 7
4 d 4 6
You can use a loop with a list to store the objects from dat2 to datn and then make the assignation of values:
#Data
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
#Store Objects in a list
List <- list(dat2,dat3)
#Loop
for(i in 1:length(List))
{
#Data
df <- List[[i]]
#Assign
dat1$y[dat1$id %in% df$id] <- df$y[df$id %in% dat1$id]
}
Output:
dat1
id x y
1 a 1 9
2 b 2 8
3 c 3 7
4 d 4 6
You can get dataframes in a list and left_join them using reduce. If every row has only one y value we can use rowSums/rowMeans ignoring NA value.
library(dplyr)
mget(paste0('dat', 1:3)) %>%
purrr::reduce(left_join, by = 'id') %>%
mutate(y = rowSums(select(., starts_with('y')), na.rm = TRUE)) %>%
select(id, x, y)
# id x y
#1 a 1 9
#2 b 2 8
#3 c 3 7
#4 d 4 6
A very simple answer - but maybe not too generalizable - would be:
dat1$y = c(dat2$y, dat3$y)
With a loop, to do this to several data frames:
newy = numeric()
for(i in 2:ndf){ # Where "ndf" is the number of data frames you have
newy = c(newy, eval(parse(text=paste("dat",i,"$y",sep=""))))}
OBS: evaluating objects by strings, with the eval(parse(text=...)) normaly isn't the best way to do it in R. It is probably best if the data frames were created together in a list (as listing them now would be very manual, atleast with my knowledge), and the loop would be:
newy = numeric()
for(i in 2:ndf){
newy = c(newy, df.list[[i]]$y)}

Add a column to dataframe in R based on grater/less than condition based on the values in existing columns

I have a dataframe and want to create a new column Z populated with a value of "tw" or "ok" for each record. If x > y, z = "ok" , IF x < y, z = "tw".
x y
a 1 2
b 2 3
c 5 1
result
x y z
a 1 2 tw
b 2 3 tw
c 5 1 ok
Maybe you can try ifelse() like below
df <- within(df,z <- ifelse(x>y,"ok","tw"))
If you do not define the output for the case "x==y", maybe you should add the following line
df$z[df$x==df$y] <- NA
alternatively you can do it on the dataframe directly:
# creation of dataframe
df = data.frame("x" = c(1, 2, 5), "y" = c(2, 3, 1))
# column creation of z
df$z[(df$x > df$y)] <- "ok"
df$z[(df$x < df$y)] <- "tw"
We can do this directly :
df$z <- c("tw", "ok")[(df$x > df$y) + 1]
df
# x y z
#a 1 2 tw
#b 2 3 tw
#c 5 1 ok
Not exactly clear what you want to do when x == y (above assigns "ok").
We can also use case_when from dplyr to assign values based on various condition.
library(dplyr)
df %>%
mutate(z = case_when(x > y ~"ok",
x < y ~"tw",
TRUE ~ NA_character_))
data
df <- structure(list(x = c(1L, 2L, 5L), y = c(2L, 3L, 1L), z = c("tw",
"tw", "ok")), row.names = c("a", "b", "c"), class = "data.frame")

Removing same columns across tables in R [duplicate]

I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5
Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])
If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))
We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)

How to concatenate different values in different columns in a new column?

suppose I have something like this:
dat <- data.frame(ID = c("A", "B", "C"),
value = c(1, 2, 3))
I would like to add an extra column in which I have a value like this:
[A,1]
It would be a column in which each single values are the concatenation of "[" + A (value in the first column) + "," + B (value in the second column) + "]".
How can I do it? I tried with paste but I am doing something wrong.
Here's an approach that will work consistently with endless numbers of columns:
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Using your example:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value conc
1 A 1 [A,1]
2 B 2 [B,2]
3 C 3 [C,3]
Or if we have a dataframe with more columns:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3), value2 = c(4, 5, 6))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value value2 conc
1 A 1 4 [A,1,4]
2 B 2 5 [B,2,5]
3 C 3 6 [C,3,6]
Assuming this is your data
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
This would work
dat$concat <- paste0("[", dat$ID, ", ", dat$value, "]")
ID value concat
1 A 1 [A, 1]
2 B 2 [B, 2]
3 C 3 [C, 3]
Or if you did not want the space after the comma:
dat$concat <- paste0("[", dat$ID, ",", dat$value, "]")

Delete a column in a data frame within a list

I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5
Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])
If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))
We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)

Resources