Merging many CSVs into different data frames - r

I many CSVs, each corresponding to a day's worth of data, stored like this:
Day1.csv:
ID, height, weight, color
a1, 3, 45, blue
a2, 3, 44, green
a3, 4, 48, blue
Day 2.csv:
ID, height, weight, color
a1, 4, 47, green
a2, 4, 44, green
a3, 5, 49, yellow
I want to make a separate data frame for each feature (i.e. height, weight, etc.) with information from each csv. The output would look like this for each feature:
height.df:
ID, Day1, Day2
a1, 3, 4
a2, 3, 4
a3, 3, 5
I have tried to use merge(), but that requires that I input only two columns at a time. I'm also not sure how to use the filename to label the column.

I would consider just putting all the data into a list and rbinding the data together (if the columns are of the same types).
Example:
## Assume you have read in files and saved them as `data.frame`s named
## "day1", "day2", and so on....
temp <- mget(ls(pattern = "day\\d+"))
long <- do.call(rbind, lapply(names(temp), function(x) cbind(Day = x, temp[[x]])))
From there, you can do transformations quite easily. For instance, make the entire dataset into a "wide" dataset:
reshape(long, direction = "wide", idvar = "ID", timevar = "Day")
# ID height.day1 weight.day1 color.day1 height.day2 weight.day2 color.day2
# 1 a1 3 45 blue 4 47 green
# 2 a2 3 44 green 4 44 green
# 3 a3 4 48 blue 5 49 yellow
Or, just a specific variable:
library(data.table)
dcast.data.table(as.data.table(long), ID ~ Day, value.var = "height")
# ID day1 day2
# 1: a1 3 4
# 2: a2 3 4
# 3: a3 4 5

If you really want to make separate data frames, here's one way you could do it:
Day1.csv <- read.table(header=T, sep=",", text="
ID, height, weight, color
a1, 3, 45, blue
a2, 3, 44, green
a3, 4, 48, blue")
Day2.csv <- read.table(header=T, sep=",", text="
ID, height, weight, color
a1, 4, 47, green
a2, 4, 44, green
a3, 5, 49, yellow")
library(tidyr)
l <- mget(ls(pattern = "Day\\d+\\.csv"))
df <- do.call(rbind, lapply(seq(l), function(x) transform(l[[x]], Day = paste0("Day", gsub("\\D", "", names(l)[x])))))
df <- gather(df, variable, value, -ID, -Day)
list2env(
setNames(lapply(levels(df$variable), function(x) {
spread(df[df$variable == x, -which(names(df) == "variable")], Day, value, fill = 0)
}), paste0(levels(df$variable), ".df")), globalenv())
weight.df
# ID Day1 Day2
# 1 a1 45 47
# 2 a2 44 44
# 3 a3 48 49
height.df
# ID Day1 Day2
# 1 a1 3 4
# 2 a2 3 4
# 3 a3 4 5
color.df
# ID Day1 Day2
# 1 a1 blue green
# 2 a2 green green
# 3 a3 blue yellow

This seems to work with two dataframes.
file_names <- list.files('C:/mydirectory/mycsvs', full.names = T)
file_list <- lapply(file_names, read.csv, stringsAsFactors = F)
So put your csvs in a directory, read them into a list.
library(dplyr)
new_dataframes = list()
for(i in 2:ncol(dataframes[[1]])){
new_list <- list()
for(j in 1:length(file_list)){
new_list[[j]] <- file_list[[j]][, c(1, i)]
}
joined_df <- new_list[[1]]
for(j in 2:length(new_list)){
joined_df <- inner_join(joined_df, new_list[[j]], by = 'ID')
}
for(j in 2:ncol(joined_df)){
colnames(joined_df)[j] <- paste0('day ', j - 1)
}
feature_name <- colnames(new_list[[1]])[2]
new_dataframes[[feature_name]] <- joined_df
}
new_dataframes
$height
ID day 1 day 2
1 a1 3 4
2 a2 3 4
3 a3 4 5
$weight
ID day 1 day 2
1 a1 45 47
2 a2 44 44
3 a3 48 49
$color
ID day 1 day 2
1 a1 blue green
2 a2 green green
3 a3 blue yellow
This assumes that you have sequential days (e.g. you have day 1 through day n with no missing days. If that's not the case, it's not that difficult to extract out the day from the file title, assuming it's in the title. This also assumes each dataframe is more or less identical, when it comes to number/order of columns. If that's not the case, this won't work. It also does an inner_join, so you're only going to get records that have matching IDs.
And if anyone has an idea for getting rid of those loops, I'd love to hear it, especially for that join step. That can probably get relatively slow depending on the size of the data.
Anyway, you end up with a list of dataframes, where the name of the list element is the feature. Depending on how much data you have, this list could get very large.

Related

Construct loop in R to search and replace values in data.frame A based on matched subsets of column values from data.frame B?

I have a Raw 'data.frame A' containing results from a set of measurements taken in a time course experiment. There are Control and Test Treatment variables, two Animals per Treatment, three measurements per Animal, and Day 1, 2, and 3 as Time points.
data.frame A
I have written code to generate a separate 'data.frame B' that converts a number of outliers into NA's. These NA's are associated with specific combinations of Treatment-Animal-Measure column values. My goal is to use a list of such combined values from 'data.frame B' to search for matched cases in 'data.frame A' and replace the number in the value column with NA, across all Timepoints in the data set.
data.frame B
I have looked into indexing, lapply(), and for loops to tackle this problem, but am getting stuck pretty early in each case. Here is an image of the desired 'data.frame C' showing the replacements I am after:
Resultant "data.frame C"
Any guidance on best course of action, or a solution, would be much appreciated!
Here's one solution using dplyr. Make sure that your dfb only has the rows that you want to change to NA then we'll do a left join and a simple case_when to do the work.
dfa <- data.frame(
Treatment = rep(c(rep("Control", 6), rep("Test", 6)), 3),
Timepoint = c(rep("Day1", 12), rep("Day2", 12), rep("Day3", 12)),
Animal = rep(c(rep("A", 3), rep("B", 3)), 6),
Measure = rep(c(c("A1", "A2", "A3"), c("B1", "B2", "B3")), 6),
Value = c(10, 11, 9, 10, 2, 9, 10, 11, 9, 10, 2, 9, rep(10, 24))
)
Note the minor modifications to dfb...
dfb <- data.frame(
Treatment = c("Test", "Control"),
Animal = c("B", "B"),
Measure = c("B2", "B2"),
ReplaceValue = c(TRUE, TRUE)
)
dfb
Treatment Animal Measure ReplaceValue
1 Test B B2 TRUE
2 Control B B2 TRUE
library(dplyr)
dfc <-
left_join(dfa, dfb, by = c("Treatment", "Animal", "Measure")) %>%
mutate(Value = case_when(
is.na(ReplaceValue) ~ Value,
TRUE ~ NA_real_
)
) %>%
select(-ReplaceValue)
head(dfc, 12)
#> Treatment Timepoint Animal Measure Value
#> 1 Control Day1 A A1 10
#> 2 Control Day1 A A2 11
#> 3 Control Day1 A A3 9
#> 4 Control Day1 B B1 10
#> 5 Control Day1 B B2 NA
#> 6 Control Day1 B B3 9
#> 7 Test Day1 A A1 10
#> 8 Test Day1 A A2 11
#> 9 Test Day1 A A3 9
#> 10 Test Day1 B B1 10
#> 11 Test Day1 B B2 NA
#> 12 Test Day1 B B3 9

Add missing index in a dataframe

Hi I have a messy data frame as follows:
df <- data.frame(age.band = c("0-5","5-10"), beg.code = c("A1","B1"), end.code=c("A5","B3"),value = c(10,5))
age.band beg.code end.code value
0-5 A1 A5 10
5-10 B1 B3 5
I would like to transform it into a friendlier format such as:
index age.band value
A1 0-5 10
A2 0-5 10
A3 0-5 10
A4 0-5 10
A5 0-5 10
B1 5-10 5
B2 5-10 5
B3 5-10 5
Can anyone help me to find a way to add all the missing indexes for this dataframe? Thanks
A solution using dplyr and tidyr. Nptice that I added stringsAsFactors = FALSE to avoid creating factor columns when creating your example data frame. If you run the code on your original data frame, you will receive warning message due to the factor columns, but it will not affect the end results.
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Code, Value, ends_with("code")) %>%
extract(Value, into = c("Group", "Index"), regex = "([A-Za-z+].*)([\\d].*$)",
convert = TRUE) %>%
select(-Code) %>%
group_by(Group) %>%
complete(Index = full_seq(Index, period = 1)) %>%
unite(Index, c("Group", "Index"), sep = "") %>%
fill(-Index)
df2
# # A tibble: 8 x 3
# Index age.band value
# * <chr> <chr> <dbl>
# 1 A1 0-5 10
# 2 A2 0-5 10
# 3 A3 0-5 10
# 4 A4 0-5 10
# 5 A5 0-5 10
# 6 B1 5-10 5
# 7 B2 5-10 5
# 8 B3 5-10 5
DATA
df <- data.frame(age.band = c("0-5","5-10"), beg.code = c("A1","B1"), end.code=c("A5","B3"),value = c(10,5),
stringsAsFactors = FALSE)
Here is one option with base R. The idea is to remove the non-numeric characters from the 'code' columns, convert it to numeric and get the sequence stored as a list. Then, paste the non-numeric characters and finally, based on the lengths of the list, expand the rows of the original dataset with rep and create a new column 'index' by unlisting the list
lst <- do.call(Map, c(f = `:`, lapply(df[2:3], function(x) as.numeric(sub("\\D+", "", x)))))
lst1 <- Map(paste0, substr(df[,2], 1, 1), lst)
data.frame(index = unlist(lst1), df[rep(seq_len(nrow(df)), lengths(lst1)), -(2:3)])

R + Data frame name as string variable

I have a list of data frames, and I'm looking to assign to each data frame within the list a variable column that is simple a character vector of the given dataframe's name.
data <- list(
d1 = data.frame(animal = sample(c("cat","dog","bird"), 5, replace = T)),
d2 = data.frame(animal = sample(c("cat","dog","bird"), 5, replace = T)),
d3 = data.frame(animal = sample(c("cat","dog","bird"), 5, replace = T))
)
This yields:
> data
$d1
animal
1 cat
2 bird
3 cat
4 cat
5 cat
$d2
animal
1 dog
2 cat
3 cat
4 cat
5 bird
$d3
animal
1 cat
2 dog
3 cat
4 cat
5 cat
What I want to do is create something like the following:
> newdata
$d1
animal newvar
1 cat d1
2 cat d1
3 cat d1
4 dog d1
5 cat d1
$d2
animal newvar
1 bird d2
2 cat d2
3 bird d2
4 cat d2
5 cat d2
$d3
animal newvar
1 bird d3
2 bird d3
3 cat d3
4 cat d3
5 bird d3
But I can't quite figure out how to actually reference the data frame name --in a list of data frames-- and turn it into a character vector appropriately.
Something like the following does not work:
namefunc <- function(x) {
x <- x %>% transform(newvar = as.character(x))
}
newdata <- namefunc(data)
We can use Map to cbind the corresponding list elements of 'data' with the names of 'data'
Map(cbind, data, newvar= names(data))
lapply(names(data), function(d) transform(data[[d]], newvar=d))
or eventually:
L <- lapply(names(data), function(d) transform(data[[d]], newvar=d))
names(L) <- names(data)

Convert columns i to j to percentage

Suppose I have the following data:
df1 <- data.frame(name=c("A1","A1","B1","B1"),
somevariable=c(0.134,0.5479,0.369,NA),
othervariable=c(0.534, NA, 0.369, 0.3333))
In this example, I want to convert columns 2 and 3 to percentages (with one decimal point). I can do it with this code:
library(scales)
df1 %>%
mutate(somevariable=try(percent(somevariable),silent = T),
othervariable=try(percent(othervariable),silent = T))
But I'm hoping there is a better way, particularly for the case where I have many columns instead of just 2.
I tried mutate_each but I'm doing something wrong...
df1 %>%
mutate_each(funs = try(percent(),silent = T), -name)
Thanks!
Here's an alternative approach using custom function. This function will only modify numeric vectors, so no need to worry about try or removing non-numeric columns. It will also handle NAs by defult
myfun <- function(x) {
if(is.numeric(x)){
ifelse(is.na(x), x, paste0(round(x*100L, 1), "%"))
} else x
}
df1 %>% mutate_each(funs(myfun))
# name somevariable othervariable
# 1 A1 13.4% 53.4%
# 2 A1 54.8% <NA>
# 3 B1 36.9% 36.9%
# 4 B1 <NA> 33.3%
Try
df1 %>%
mutate_each(funs(try(percent(.), silent=TRUE)), -name)
# name somevariable othervariable
#1 A1 13.4% 53.4%
#2 A1 54.8% NA%
#3 B1 36.9% 36.9%
#4 B1 NA% 33.3%
if you need to filter out the NAs from getting the percentage,
df1 %>%
mutate_each(funs(try(ifelse(!is.na(.), percent(.), NA),
silent=TRUE)),-name)
# name somevariable othervariable
#1 A1 13.4% 53.4%
#2 A1 54.8% <NA>
#3 B1 36.9% 36.9%
#4 B1 <NA> 33.3%

merge two data.frames and replace values of certain columns of df1 with values of df2

i have two data.frames that i want to merge and replace values of certain columns of df1
with values of df2. in this working example there are only 3 columns. but in the original data,
there are about 20 columns that should remain in the final data.frame.
NO <- c(2, 4, 7, 18, 25, 36, 48)
WORD <- c("apple", "peach", "plum", "orange", "grape", "berry", "pear")
CLASS <- c("p", "x", "x", "n", "x", "p", "n")
ColA <- c("hot", "warm", "sunny", "rainy", "windy", "cloudy", "snow")
df1 <- data.frame(NO, WORD, CLASS, ColA)
df1
# NO WORD CLASS ColA
# 1 2 apple p hot
# 2 4 peach x warm
# 3 7 plum x sunny
# 4 18 orange n rainy
# 5 25 grape x windy
# 6 36 berry p cloudy
# 7 48 pear n snow
NO <- c(4, 18, 36)
WORD <- c("patricia", "oliver", "bob")
CLASS <- c("p", "n", "x")
df2 <- data.frame(NO, WORD, CLASS)
df2
# NO WORD CLASS
# 1 4 patricia p
# 2 18 oliver n
# 3 36 bob x
i want to merge the two data.frames and replace the values of WORD and CLASS from df1
with the values of WORD and CLASS from df2
my data.frame should look like this:
# NO WORD CLASS ColA
# 1 2 apple p hot
# 2 4 patricia p warm
# 3 7 plum x sunny
# 4 18 oliver n rainy
# 5 25 grape x windy
# 6 36 bob x cloudy
# 7 48 pear n snow
Try this
auxind<-match(df2$NO, df1$NO) # Stores the repeated rows in df1
dfuni<-(rbind(df1[,1:3],df2)[-auxind,]) # Merges both data.frames and erases the repeated rows from the first three colums of df1
dfuni<-dfuni[order(dfuni$NO),] # Sorts the new data.frame
df1[,1:3]<-dfuni
This approach could work as well though is more playing around than the best answer to the question:
library(qdap); library(qdapTools)
df1[, 2] <- as.character(df1[, 2])
trms <- strsplit(df1[, 1] %lc% colpaste2df(df2, 2:3, keep.orig = FALSE), "\\.")
df1[sapply(trms, function(x) !all(is.na(x))), 2:3] <-
do.call(rbind, trms[sapply(trms, function(x) !all(is.na(x)))])

Resources