Add new column to data frames in list - r

I have a set of data frames named df_1968, df_1969, df_1970, ..., df_2016 collected in a list called my_list.
I want to add a new column in each of these data frames which simply is the current year (1968 in df_1968 and so on). I've managed to do it by looping through the data frames but I am looking for a more neat solution. I've tried the following:
# Function to extract year from name of data frames
substrRight <- function(y, n) {
substr(y, nchar(y) - n + 1, nchar(y))
}
# Add variable "year" equal to 1968 in df_1968 and so on
my_list <- lapply(my_list, function(x) cbind(x, year <- as.numeric(substrRight(names(x), 4 ))))
However this throws the error:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing numbers of rows: 18878, 7
I can see that the way I assign the value to the variable probably does not make sense but can't wrap my head around how to do it instead. Help appreciated.
Note that the substrRight function seems to be working perfectly fine and that
as.numeric(substrRight(names(x), 4 ))
yields the vector of years 1968-2016

This works in Base-R
years <- sub(".*([0-9]{4}$)","\\1",names(my_list))
new_list <- lapply(1:length(years), function(x) cbind(my_list[[x]],year=years[x]))
names(new_list) <- names(my_list)
with this self-made example data
df_1968 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1969 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1970 = data.frame(a=c(1,2,3),b=c(4,5,6))
my_list <- list(df_1968,df_1969,df_1970)
names(my_list) <- c("df_1968","df_1969","df_1970")
I get this output
> new_list
$df_1968
a b year
1 1 4 1968
2 2 5 1968
3 3 6 1968
$df_1969
a b year
1 1 4 1969
2 2 5 1969
3 3 6 1969
$df_1970
a b year
1 1 4 1970
2 2 5 1970
3 3 6 1970

The following function will loop through a named list of data frames and create a column year with the 4 last characters of the list's names.
I have simplified the function substrRighta bit. Since it's the last characters that are needed, it uses substring, with no need for a last character position.
substrRight <- function(y, n) {
substring(y, nchar(y) - n + 1)
}
my_list <- lapply(names(my_list), function(x){
my_list[[x]][["year"]] <- as.numeric(substrRight(x, 4))
my_list[[x]]
})
Data creation code.
my_list <- lapply(1968:1970, function(i) data.frame(a = 1:5, b = letters[1:5]))
names(my_list) <- paste("df", 1968:1970, sep = "_")

Related

How to add a column to every dataframe in the workspace based on its name?

Background
First I'll initialize some dummy dataframes (NOTE: in the real example there will be >40 dataframes):
colOne <- c(1,2,3)
colTwo <- c(6,5,4)
df_2004 <- data.frame(colOne,colTwo)
df_2005 <- data.frame(colTwo,colOne)
Problem
Now what I want to do is loop through every data frame in the workspace and add a column called year to them, filled with 2004 if the suffix is _2004 and 2005 if the suffix is _2005.
I can start by getting a list of all of the data frames in the workspace.
dfs <- ls()[sapply(ls(),function(t) is.data.frame(get(t)))]
dfs
[1] "df_2004" "df_2005"
But that's as far as I've managed to get.
Attempted Solution
This is what I tried:
for (d in dfs) {
d <- lapply(d, function(x){
t <- get(x)
if (grepl('2004',x)) {
t$year <- 2004
} else {
t$year <- 2005
}
t
})
}
This does not throw an error, but it doesn't do anything either other than set d to "2005".
If I add a line print(t) right before the line returning t, I get this output in the console:
colOne colTwo year
1 1 6 2004
2 2 5 2004
3 3 4 2004
colTwo colOne year
1 6 1 2005
2 5 2 2005
3 4 3 2005
suggesting that the code gets through that part fine, because that's exactly what I want df_2004 and df_2005 to look like respectively. But df_2004 and df_2005 are not actually changed, which is what I want.
Let's say two list as d_2018 and d_2019. Using lapply,
d_2018 <- lapply(d_2018, function(x){
x$year <- 2018
x
})
d_2019 <- lapply(d_2010, function(x){
x$year <- 2019
x
})
will helps
Here is one way using purrr to add new column from the dataframe name.
library(purrr)
year_data <- list(data_2018, data_2019)
res <- map(year_data, function(x)
imap(x, function(data, name) {
transform(data, year = sub('.*?_', '', name))
}))

Importing and combining multiple CSV files in R with differing numbers and names of rows

I have a folder with a couple hundred .csv files that I'd like to import and merge.
Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).
How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA.
I'd like column x to go away.
Although, the rows should stay in the order they were originally in from the csv.
Here's an example:
Data frame 112_c1.csv:
x y
1 -0.5604
3 -0.2301
4 1.5587
5 0.0705
6 0.1292
Dataframe 112_c2.csv:
x y
2 -0.83476
3 -0.82764
8 1.32225
9 0.36363
13 0.9373
42 -1.5567
50 -0.12237
51 -0.4837
Dataframe 113_c1.csv:
x y
5 1.5783
6 0.7736
9 0.28273
15 1.44565
23 0.999878
29 -0.223756
=
Desired result
112_c1.y 112_c2.y 113_c1.y
-0.5604 -0.83476 1.5783
-0.2301 -0.82764 0.7736
1.5587 1.32225 0.28273
0.0705 0.36363 1.44565
0.1292 0.9373 0.999878
NA -1.5567 -0.223756
NA -0.12237 -0.223756
NA -0.12237 NA
NA -0.4837 NA
I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:
df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))
Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.
Suggested solution using a function to calculate summary statistics right when loading data:
readCalc <- function(file_path) {
df <- read.csv(file_path)
return(data.frame(file=file_path,
column = names(df),
averages = apply(df, 2, mean),
N = apply(df, 2, length),
min = apply(df, 2, min),
stringsAsFactors = FALSE, row.names = NULL))
}
df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))
If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.
Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.
library(tidyverse)
# store csv file paths
data_path <- "data" # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")
# read csv files and combine into a single dataframe
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names
mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names
tidyr::unnest(Data) %>% #unnest the Data column into multiple columns
select(-File) %>% #remove the File column
na.omit() %>% #remove the NA rows
spread(name, X2) %>% #reshape the dataframe from long to wide
select(-X1) %>% #remove the x column
mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows
Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?
x <- c(1:6)
y <- c(1:3)
z <- c(1:10)
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df <- as.data.frame(cbind.fill(x,y,z))
colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")
112_c1.y 112_c2.y 113_c1.y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 NA 4
5 5 NA 5
6 6 NA 6
7 NA NA 7
8 NA NA 8
9 NA NA 9
10 NA NA 10

Find the mean of one variable subseted by another variable

I have a list of identical dataframes. Each data frame contains columns with unique variables (temp/DO) and with repeated variables (eg-t1).
[[1]]
temp DO t1
1 4 1
3 9 1
5 7 1
I want to find the mean of DO when the temperature is equal to t1.
t1 represents a specific temperature, but the value varies for each data frame in the list so I can't specify an actual value.
So far I've tried writing a function
hvod<-function(DO, temp, depth){
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
hfit<-hvod(data$DO, data$temp, data$depth)
But for whatever reason t1 is not recognized. Any ideas on the function OR
a way to combine select (dplyr function) and lapply to solve this?
I've seen similar posts put none that apply to the issue of a specific value (t1) that changes for each data frame.
I would just take the dataframe as argument and do rest of the logic inside function as it gives more control to the function. Something like this would work,
hvod<-function(data){
temp <- data$temp
t1 <- data$t1
DO <- data$DO
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
You can try using dplyr::bind_rows function to combine all data.frames from list in one data.frame.
Then group on data.frame number to find the mean of DO for rows having temp==t1 as:
library(dplyr)
bind_rows(ll, .id = "DF_Name") %>%
group_by(DF_Name) %>%
filter(temp==t1) %>%
summarise(MeanDO = mean(DO)) %>%
as.data.frame()
# DF_Name MeanDO
# 1 1 4.0
# 2 2 6.5
# 3 3 6.7
Data:
df1 <- read.table(text =
"temp DO t1
1 4 1
3 9 1
5 7 1",
header = TRUE)
df2 <- read.table(text =
"temp DO t1
3 4 3
3 9 3
5 7 1",
header = TRUE)
df3 <- read.table(text =
"temp DO t1
2 4 2
2 9 2
2 7 2",
header = TRUE)
ll <- list(df1, df2, df3)
Thank you Thiloshon and MKR for the help! I had initial combined the data I needed into one list of data frames but to answer this I actually had my data in separate data frames (fitsObs and df1).
The variables I was working with in the code were 1 to 1, so by finding the range where depth and d2 were the same (I used temp and t1 in the example), I could find the mean over that range .
for(i in 1:1044){
df1 <- GLNPOsurveyCTD$data[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
deptho <- -abs(df1$depth) #defining temp and depth in the loop
to <- df1$temp
do <- df1$DO
xx <- which(deptho <= fitObs$d2) #mean over range xx
mhtemp <- mean(to[xx], na.rm=TRUE)
mHDO <- mean(do[xx], na.rm=TRUE)
}

Applying function using multiple columns as argument , function returns a data.frame

I am trying to apply a function that uses multiple columns of a dataframe as arguments, with the function returning a dataframe for each row. I can use a for loop here, but Wanted to check if there is any other way of doing this
A simple example is being provided here. my original problem is slightly more complicated.
DF1<-data.frame(start=seq(from=1, to=5, by=1),end=seq(from=10, to=14, by=1))
rep_fun <- function(x,y)
{
data.frame( A=seq(x, y)) #produces a sequence between x and y
}
DF2<-data.frame()
for (i in 1:nrow(DF1)){
temp<-data.frame(rep_fun(DF1$start[i],DF1$end[i]))
DF2<-rbind(temp,DF2) # this contains a dataframe that has a sequence between 'start' and 'end' for each row in DF1
}
The desired result which I am able to obtain through a for-loop is shown below. Not all rows are being shown here. Rows 1 to 10, shows the sequence corresponding to row 5 in DF1
> DF2
A
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13
10 14
11 4
12 5
1) lapply Split DF1 by nrow(DF1):1 so that it comes out in reverse order and then lapply over that list and rbind its components together. No packages are used.
DF3 <- do.call("rbind", lapply(split(DF1, nrow(DF1):1), with, rep_fun(start, end)))
rownames(DF3) <- NULL
identical(DF2, DF3)
## [1] TRUE
2) Map or this alternative:
fun <- function(x) with(x, rep_fun(start, end))
DF4 <- do.call("rbind", Map(fun, split(DF1, nrow(DF1):1), USE.NAMES = FALSE))
identical(DF4, DF2)
## [1] TRUE
3) Map/rev Like (2) this uses Map but this time using rep_fun directly. Also, it uses rev to order the output after the computation rather than split to order the input before the computation.
DF5 <- do.call("rbind", with(DF1, rev(Map(rep_fun, start, end))))
identical(DF5, DF2)
## [1] TRUE

How to use a string to refer to a data frame in R?

I have 3 data frames called 'first', 'second' and 'third'
I also have a string vector with their names in it
frames <- c("first","second","third")
I would like to loop through the vector and refer to the data frames
for (i in frames) {
#set first value be 0 in each data frame
i[1,1] <- 0
}
This does not work, what am I missing?
This is really not the optimal way to do this but this is one way to make your specific example work.
first <- data.frame(x = 1:5)
second <- data.frame(x = 1:5)
third <- data.frame(x = 1:5)
frames <- c("first","second","third")
for (i in frames) {
df <- get(i)
df[1,1] <- 45
assign(as.character(i), df, envir= .GlobalEnv)
}
> first
x
1 45
2 2
3 3
4 4
5 5
> second
x
1 45
2 2
3 3
4 4
5 5
> third
x
1 45
2 2
3 3
4 4
5 5
As Justin mentioned, R way would be to use a list. So given that you only have the data frame names as strings, you can copy them in a list.
frames <- lapply(c("first", "second", "third"), get)
(frames <- lapply(frames, function(x) {x[1,1] <- 0; x}))
However, you are working on a copy of first, second and third within frames.

Resources