I have a list containing a number of data frames, all with the same number of columns.
E.g, for a list df_list with two data frames, df1 and df2:
>df_list
df1
a b c
1 1 1
2 2 2
3 3 3
df2
a b c
3 2 1
3 2 1
3 2 1
I want to rename the headers of every data frame to new_headings <- c("A", "B", "C").
I constructed a for loop:
for (i in 1:length(list)) {
names(list[[i]]) <- new_headings
}
However, this doesn't work. The headings remain as they were. If I do it individually instead of in a loop, it works fine, however, e.g., names(list[[1]]) <- new_headings changes the headings appropriately.
My actual list is very long with many data frames. Can anyone explain why this isn't working or what other approach I can use? Thank you.
We can use Map with setNames
df_listNew <- Map(setNames, df_list, list(new_headings))
Or using lapply
lapply(df_list, setNames, new_headings)
#$df1
# A B C
#1 1 1 1
#2 2 2 2
#3 3 3 3
#$df2
# A B C
#1 3 2 1
#2 3 2 1
#3 3 2 1
data
df_list <- list(df1 = structure(list(a = 1:3, b = 1:3, c = 1:3),
class = "data.frame", row.names = c(NA,
-3L)), df2 = structure(list(a = c(3, 3, 3), b = c(2, 2, 2), c = c(1,
1, 1)), class = "data.frame", row.names = c(NA, -3L)))
You can use two for loops
a<-c(1,2,3)
b<-c(1,2,3)
c<-c(1,2,3)
df1<-as.data.frame(cbind(a,b,c))
a<-c(3,2,1)
b<-c(3,2,1)
c<-c(3,2,1)
df2<-as.data.frame(cbind(a,b,c))
df_list<-list(df1,df2)
new_headings <- c("A", "B", "C")
for (i in 1:length(df_list)) {
for (j in 1:length(df_list[[i]])) {
colnames(df_list[[i]])[j] <- new_headings[j]
}
}
df_list
Related
I have a
df = data.frame(a = c(1,2,3), b = c(6,7,8))
I want to add two columns of the distance from mean of a:
a
b
diff_a
diff_b
1
4
-1
2
2
5
0
3
3
6
1
4
I don't want to write columns separately in mutate, as it will calculate mean multiple times(mean is example here, I actually have a functions takes a lot time). I want to use one function like
calculates <- function(a, b){
e_a <- mean(a)
return list(a - e_a, b - e_a)
}
We many need
library(dplyr)
df %>%
mutate(Meana = mean(a), across(a:b,
~ . - Meana, .names = "diff_{.col}"), Meana = NULL)
-output
a b diff_a diff_b
1 1 4 -1 2
2 2 5 0 3
3 3 6 1 4
data
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6)),
class = "data.frame", row.names = c(NA,
-3L))
You may return a named list from the function and use cbind to add new columns to the dataframe.
df = data.frame(a = c(1,2,3), b = c(4,5,6))
calculates <- function(a, b){
e_a <- mean(a)
return(list(diff_a = a - e_a, diff_b = b - e_a))
}
cbind(df, calculates(df$a, df$b))
# a b diff_a diff_b
#1 1 4 -1 2
#2 2 5 0 3
#3 3 6 1 4
I have a list of data.frame (lst1). In each data.frame in lst1, we have some variables that looks like test.x, test.y, try.x, try.y. etc.
I want to filter out those variables that were created by merging dataset without filter out those variable first (try, test, etc.). How should I filter them out now?
Thanks.
You can also try this:
#Data
List <- list(A=data.frame(a=1,b=5,test.x=NA,test.y=5),
B=data.frame(a=5,b=6,test.x=NA,try.x=7))
#Remove
myfun <- function(x)
{
i <- which(grepl('.x|.y',names(x)))
x <- x[,-i]
return(x)
}
#Apply
List2 <- lapply(List,myfun)
Output:
List2
$A
a b
1 1 5
$B
a b
1 5 6
Here's a tidyverse approach:
We can use the dplyr::select function to select only the columns we want. matches() allows us to select columns using regular expressions. \\.[xy]$ matches columns that contain a period followed by x or y and $ anchors the match to the end of the string.
The purrr::map function allows us to apply the selection to each list element. ~ defines a formula which is automatically converted to a function.
library(tidyverse)
lst2 <- lst1 %>%
map(~dplyr::select(.,-matches("\\.[xy]$")))
map(lst2, head, 2)
#[[1]]
# ID name
#1 1 A
#2 2 B
#[[2]]
# ID name
#1 1 A
#2 2 B
#[[3]]
# ID name
#1 1 A
#2 2 B
#[[4]]
# ID name
#1 1 A
#2 2 B
#[[5]]
# ID name
#1 1 A
#2 2 B
Sample Data:
lst1 <- replicate(5,data.frame(ID = 1:15, name = LETTERS[1:15], test.x = runif(15), test.y = runif(15)),simplify = FALSE)
map(lst1, head, 2)
#[[1]]
# ID name test.x test.y
#1 1 A 0.03772391 0.2630905
#2 2 B 0.11844048 0.2929392
#[[2]]
# ID name test.x test.y
#1 1 A 0.398029 0.5151159
#2 2 B 0.348489 0.9534869
#[[3]]
# ID name test.x test.y
#1 1 A 0.7447383 0.6862136
#2 2 B 0.3623562 0.7542699
#
#[[4]]
# ID name test.x test.y
#1 1 A 0.9341495 0.8660333
#2 2 B 0.8383039 0.6299427
#[[5]]
# ID name test.x test.y
#1 1 A 0.02662444 0.04502225
#2 2 B 0.29855214 0.46189116
In base R, we can use endsWith
lapply(List, function(x) x[!(endsWith(names(x),
'.x')|endsWith(names(x), '.y'))])
-output
#$A
# a b
#1 1 5
#$B
# a b
#1 5 6
data
List <- list(A = structure(list(a = 1, b = 5, test.x = NA, test.y = 5), class = "data.frame", row.names = c(NA,
-1L)), B = structure(list(a = 5, b = 6, test.x = NA, try.x = 7), class = "data.frame", row.names = c(NA,
-1L)))
I have 5 data sets, each containing some columns. The data sets have common column names, but all columns are not present in all the data sets. So whenever a column name (that appears in at least one of the data set) is not present in some other data set, I want to create a column of all zeros with that column name in that data set. So that all the data sets have same number of columns (and same column names).
Put the dataframes in the list, get the all the unique column names present in all the dataframes combined and add columns which are absent in each dataframe with 0.
all_names <- unique(unlist(sapply(list_df, names)))
lst1 <- lapply(list_df, function(x) {x[setdiff(all_names, names(x))] <- 0;x})
lst1
#[[1]]
# a b c
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[2]]
# a c b
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[3]]
# a c b
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
If you need separate dataframes you can use lst1[[1]], lst1[[2]] individually again.
data
df1 <- data.frame(a = 1:5, b = 6:10)
df2 <- data.frame(a = 1:5, c = 6:10)
df3 <- data.frame(a = 1:5, c = 6:10, b = 11:15)
list_df <- list(df1, df2, df3)
We can use a for loop to do this
un1 <- Reduce(union, lapply(lst1, names))
for(i in seq_along(lst1)) lst1[[i]][setdiff(un1, names(lst1[[i]]))] <- 0
data
lst1 <- list(structure(list(a = 1:5, b = 6:10, c = c(0, 0, 0, 0, 0)),
row.names = c(NA,
-5L), class = "data.frame"), structure(list(a = 1:5, c = 6:10,
b = c(0, 0, 0, 0, 0)),
row.names = c(NA, -5L), class = "data.frame"),
structure(list(a = 1:5, c = 6:10, b = 11:15),
class = "data.frame", row.names = c(NA,
-5L)))
I would use dplyr's bind_rows, which automatically fills missing values with NA. If you include .id = "df_id" a column will be added connecting each row to the original dataframe:
library(dplyr)
bind_rows(df1, df2, df3, .id = "df_id")
#### OUTPUT ####
df_id x y z
1 1 1 2 NA
2 2 3 NA 4
3 3 NA 5 6
If you want 0s instead of NAs just runt df[is.na(df)] <- 0. If you want a more informative df_id column you can pass in a named list:
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = "df_id")
#### OUTPUT ####
df_id x y z
1 df1 1 2 NA
2 df2 3 NA 4
3 df3 NA 5 6
If you want your dataframes separate then simply split by df_id, which generates a list of dataframes:
df <- bind_rows(df1, df2, df3, .id = "df_id")
split(df, df$df_id)
#### OUTPUT ####
$`1`
df_id x y z
1 1 1 2 NA
$`2`
df_id x y z
2 2 3 NA 4
$`3`
df_id x y z
3 3 NA 5 6
Data:
df1 <- data.frame(x = 1, y = 2)
df2 <- data.frame(x = 3, z = 4)
df3 <- data.frame(y = 5, z = 6)
In addition to the previous answers, you can use the bind_rows function in order to quickly combine all your data frames, which will take care of differences in column names:
library(dplyr)
x <- data.frame(
a = 1:3,
b = 4:6
)
y <- data.frame(
a = 4:7
)
z <- data.frame(
c = 8:10
)
xyz <- bind_rows(x, y, z)
xyz %>% replace(., is.na(.), 0)
This is a followup question of this question.
Imagine the following data frame:
a <- c(rep("A", 3), rep("B", 3), rep("A",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
which gives
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 1
6 B 1
7 A 2
8 A 2
I reduce it to it's unique rows by:
df_unique <- unique(df)
Now, I am wondering how can I keep track of the merged rows. I would like to create a new column in which each component has a list of row names that have been merged. Something like the following:
df_unique_informative =
a b track
1 A 1 [1,2]
3 A 2 [3,7,8]
4 B 4 [4]
5 B 1 [5,6]
res = aggregate(x = list(track = 1:NROW(df)), by = list(a = df$a, b = df$b), function(x) x)
# OR perhaps you want
#res = aggregate(x = list(track = 1:NROW(df)), by = list(a = df$a, b = df$b), function(x)
# paste(x, collapse = ", "))
res
# a b track
#1 A 1 1, 2
#2 B 1 5, 6
#3 A 2 3, 7, 8
#4 B 4 4
#Shorter code
res = aggregate(list(track = 1:NROW(df)), df[,1:2], '[')
Update
a <- c(rep("A", 3), rep("B", 3), rep("A",2))
b <- c(1,1,2,4,1,1,2,2)
c = letters[1:8]
df <-data.frame(a,b,c, stringsAsFactors = FALSE)
res = aggregate(x = list(track = 1:NROW(df)), by = list(a = df$a, b = df$b), function(x) df$c[x])
res
# a b track
#1 A 1 a, b
#2 B 1 e, f
#3 A 2 c, g, h
#4 B 4 d
Here is one option with tidyverse
library(tidyverse)
rownames_to_column(df, 'rn') %>%
group_by(a, b) %>%
summarise(track = list(rn))
I want to Transform R Dataframe factor into Indicator Variable using some index in R.
Given following representation
StudentID Subject
1 A
1 B
2 A
2 C
3 A
3 B
I need following representation using StudentID as index
StudentID SubjectA SubjectB SubjectC
1 1 1 0
2 1 0 1
3 1 1 0
We can use table
table(df1)
# Subject
#StudentID A B C
# 1 1 1 0
# 2 1 0 1
# 3 1 1 0
If we need a data.frame
as.data.frame.matrix(table(df1))
Here's how I got it, using dcast from reshape2 as suggested in the comment above
library(reshape2)
ID <- c(1, 1, 2, 2, 3, 3)
Subject <- c('A', 'B', 'A', 'C', 'A', 'B')
data <- data.frame(ID, Subject)
data <- dcast(data, ID ~ Subject)
data[is.na(data)] <- 0
f <- function(x) {
x <- gsub('[A-Z]', 1, x)
}
as.data.frame(apply(data, 2, f))
# ID A B C
#1 1 1 1 0
#2 2 1 0 1
#3 3 1 1 0
Now that I look at this solution it may not be very efficient. But it is much more dynamic than some other solutions. There might also be a way to use data.table directly but I cannot figure it out. This might help though:
library(data.table)
df <- structure(list(StudentID = c(1, 1, 2, 2, 3, 3),
Subject = structure(c(1L,
2L, 1L, 3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("StudentID",
"Subject"), row.names = c(NA, -6L), class = "data.frame")
df <- data.table(df)
### here we pull the unique student id's to use in group by
studentid <- as.character(unique(df$Subject))
### here we group by student ID's and paste which Subjects exist
x <- df[,list("Values"=paste(Subject,collapse="_")),by=StudentID]
### then we go through each one and try to match it to the unique vector
tmp <- strsplit(x$Values,"_")
res <- do.call(rbind,lapply(tmp,function(i) match(studentid,i)))
### change the results to the indicator variable desired
res[!is.na(res)] <- 1
res[is.na(res)] <- 0
res <- data.frame("StudentID"=x$StudentID,res)
colnames(res) <- c("StudentID",studentid)