Delete a column in a data frame within a list - r

I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5

Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])

If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))

We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2

If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)

Related

How to add calculated columns to the source DataFrame

In SparkR (Databricks) I am able to calculate, let's say, mean value for column B based on grouped values from columns A and C as in here:
library(SparkR)
df <- createDataFrame (
list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
c("a", "b", "c", "d"))
result <- gapplyCollect(
df,
c("a", "c"),
function(key, x) {
y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
colnames(y) <- c( "key_a", "key_c", "mean_b")
y
})
Here the source data frame - df - is used to produce a new one - result - with the results - mean_b for key_a and key_c.
This is working fine, but how to do the same operation WITHOUT creating new data frame? So that mean_b is added as a new column to df?
A left join can add the result$mean_b as a new column to the original df using the aggregation keys.
Observe the following code:
library(SparkR)
df <- createDataFrame (
list(list(1L, 1, "x", 0.1), list(1L, 2, "x", 0.2), list(3L, 3, "y", 0.3)),
c("a", "b", "c", "d"))
result_schema <- structType(
structField("key_a", "integer"),
structField("key_c", "string"),
structField("mean_b", "double"))
result <- gapply(
df,
c("a", "c"),
function(key, x) {
y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
colnames(y) <- c("key_a", "key_c", "mean_b")
y
},
result_schema)
Note that I slightly changed the values of df because you had numbers as strings. Also, the gapply is used instead which returns a Spark Dataframe like df. It also requires an schema of the resulting df. In summary:
df:
a b c d
1 1 1 x 0.1
2 1 2 x 0.2
3 3 3 y 0.3
result:
key_a key_c mean_b
1 1 x 1.5
2 3 y 3.0
Now you can do the join of both Spark Dataframes:
df2 <- join(df, result, (df$a == result$key_a) & (df$c == result$key_c), "left")
collect(drop(df2, c("key_a", "key_c")))
The extra columns can be remove with drop and data returned to the driver with collect:
a b c d mean_b
1 1 1 x 0.1 1.5
2 1 2 x 0.2 1.5
3 3 3 y 0.3 3.0

R Add values to existing column in data frame by merging with other data frame

Suppose I have the following data:
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
Now, I want to merge/join the data from dat2 and dat3 to dat1 one after the other in a way that the dat1$y values are replaced by the dat2.y or dat3.y values instead of adding these as new columns.
The problem is that merge or left_join would not add the values to the existing y column, but add a y.y column and rename the one from dat1 to y.x.
I also thought I could use the rows_update function from the tidyverse, but the problem is that in my real life case I'm not only matching by one column (here: id), but by several id columns together, but rows_update only allows the by variable to be one vector.
NOTE: in my real-life use case I have
~50 data frames to merge
the uniqueness of my rows can only be determined through multiple id columns
the id columns have different names in my dat1 and all other dat2 to dat50 data frames.
The expected output after merging dat2 and dat3 to dat1 would be:
id x y
"a" 1 9
"b" 2 8
"c" 3 7
"d" 4 6
Try with indexing using %in% to test the id variables:
#Data
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
#Code
dat1$y[dat1$id %in% dat2$id] <- dat2$y[dat2$id %in% dat1$id]
dat1$y[dat1$id %in% dat3$id] <- dat3$y[dat3$id %in% dat1$id]
Output:
id x y
1 a 1 9
2 b 2 8
3 c 3 7
4 d 4 6
You can use a loop with a list to store the objects from dat2 to datn and then make the assignation of values:
#Data
dat1 <- data.frame(id = c("a", "b", "c", "d"),
x = c(1, 2, 3, 4),
y = rep(NA, 4))
dat2 <- data.frame(id = c("a", "b", "c"),
y = c(9, 8, 7))
dat3 <- data.frame(id = c("d"),
y = c(6))
#Store Objects in a list
List <- list(dat2,dat3)
#Loop
for(i in 1:length(List))
{
#Data
df <- List[[i]]
#Assign
dat1$y[dat1$id %in% df$id] <- df$y[df$id %in% dat1$id]
}
Output:
dat1
id x y
1 a 1 9
2 b 2 8
3 c 3 7
4 d 4 6
You can get dataframes in a list and left_join them using reduce. If every row has only one y value we can use rowSums/rowMeans ignoring NA value.
library(dplyr)
mget(paste0('dat', 1:3)) %>%
purrr::reduce(left_join, by = 'id') %>%
mutate(y = rowSums(select(., starts_with('y')), na.rm = TRUE)) %>%
select(id, x, y)
# id x y
#1 a 1 9
#2 b 2 8
#3 c 3 7
#4 d 4 6
A very simple answer - but maybe not too generalizable - would be:
dat1$y = c(dat2$y, dat3$y)
With a loop, to do this to several data frames:
newy = numeric()
for(i in 2:ndf){ # Where "ndf" is the number of data frames you have
newy = c(newy, eval(parse(text=paste("dat",i,"$y",sep=""))))}
OBS: evaluating objects by strings, with the eval(parse(text=...)) normaly isn't the best way to do it in R. It is probably best if the data frames were created together in a list (as listing them now would be very manual, atleast with my knowledge), and the loop would be:
newy = numeric()
for(i in 2:ndf){
newy = c(newy, df.list[[i]]$y)}

bind rows on list of elements to list of data.frame

I have list of R elements and want to bind row all elements within the list.
Each row binds to data.frame based on the column class.
The actual data is quite large and each class has different columns. Here is sample
df_list <- list()
df_list[[1]] <- data.frame(Class = "x", y = 1, stringsAsFactors = F)
df_list[[2]] <- data.frame(Class = "x", y = 2, stringsAsFactors = F)
df_list[[3]] <- data.frame(Class = "a", y = 3, stringsAsFactors = F)
df_list[[4]] <- data.frame(Class = "x", y = 4, stringsAsFactors = F)
df_list[[5]] <- data.frame(Class = "a", y = 5, stringsAsFactors = F)
Desired output, looking this to be done programmatically
df_list_out <- list()
df_list_out[[1]] <- bind_rows(data.frame(Class = "x", y = 1,
stringsAsFactors = F),
data.frame(Class = "x", y = 2,
stringsAsFactors = F),
data.frame(Class = "x", y = 4,
stringsAsFactors = F))
df_list_out[[2]] <- bind_rows(data.frame(Class = "a", y = 3,
stringsAsFactors = F),
data.frame(Class = "a", y = 5,
stringsAsFactors = F))
One way would be to rbind the list of dataframes together and then split
temp <- do.call(rbind, df_list)
split(temp, temp$Class)
#$a
# Class y
#3 a 3
#5 a 5
#$x
# Class y
#1 x 1
#2 x 2
#4 x 4
In dplyr, we can do
library(dplyr)
df_list %>% bind_rows() %>% group_split(Class)
You could lapply() over a vector of "Class"es and thus achieve that only one "Class" is processed at a time.
lapply(c("x", "a"), function(x) do.call(rbind, df_list[Map(`[[`, df_list, "Class") == x]))
# [[1]]
# Class y
# 1 x 1
# 2 x 2
# 3 x 4
#
# [[2]]
# Class y
# 1 a 3
# 2 a 5

Removing same columns across tables in R [duplicate]

I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5
Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])
If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))
We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)

How to concatenate different values in different columns in a new column?

suppose I have something like this:
dat <- data.frame(ID = c("A", "B", "C"),
value = c(1, 2, 3))
I would like to add an extra column in which I have a value like this:
[A,1]
It would be a column in which each single values are the concatenation of "[" + A (value in the first column) + "," + B (value in the second column) + "]".
How can I do it? I tried with paste but I am doing something wrong.
Here's an approach that will work consistently with endless numbers of columns:
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Using your example:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value conc
1 A 1 [A,1]
2 B 2 [B,2]
3 C 3 [C,3]
Or if we have a dataframe with more columns:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3), value2 = c(4, 5, 6))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value value2 conc
1 A 1 4 [A,1,4]
2 B 2 5 [B,2,5]
3 C 3 6 [C,3,6]
Assuming this is your data
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
This would work
dat$concat <- paste0("[", dat$ID, ", ", dat$value, "]")
ID value concat
1 A 1 [A, 1]
2 B 2 [B, 2]
3 C 3 [C, 3]
Or if you did not want the space after the comma:
dat$concat <- paste0("[", dat$ID, ",", dat$value, "]")

Resources