Apply changes (group by) on multiple dataframes using for loop - r

I have around 33 dataframes (df1, df2, df3, df4 ...) that look like this:
Date Month Value
2018-07-16 2018-07 10
2018-07-17 2018-07 2
2018-07-18 2018-07 4
2018-07-19 2018-07 45
2018-07-20 2018-07 13
and I would like to group each data frame by month, like this:
df1 = df1 %>% group_by(Month)%>%
summarise(
sd_value = sd(value)
)
How can I do this on all dataframes without repeating it?
Also, I will need to export the results as separate data frames.
I've tried to duplicate some other people's solutions using for loop but doesn't work.
Also, I have all the dataframes separately in my Environment, they are not in a list.

You can get them in list using mget with your pattern, loop over them using lapply and then aggregate
list_name <- ls(pattern = "df\\d+")
list_df <- lapply(mget(list_name), function(x) aggregate(Value~Month, x, sd))
list_df
#$df1
# Month Value
#1 2018-07 17.45566
#$df2
# Month Value
#1 2018-07 185.8744
Or if you want to use tidyverse
library(tidyverse)
list_df <- map(mget(list_name),
. %>% group_by(Month) %>% summarise(sd_value = sd(Value)))
To write them in separate csv's we can use mapply
mapply(function(x, y) write.csv(x,
paste0("path/to/file/", y, ".csv"), row.names = FALSE), list_df, list_name)
data
df1 <- structure(list(Date = structure(1:5, .Label = c("2018-07-16",
"2018-07-17", "2018-07-18", "2018-07-19", "2018-07-20"), class = "factor"),
Month = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "2018-07", class = "factor"),
Value = c(10L, 2L, 4L, 45L, 13L)), class = "data.frame", row.names =
c(NA, -5L))
df2 <- structure(list(Date = structure(1:5, .Label = c("2018-07-16",
"2018-07-17", "2018-07-18", "2018-07-19", "2018-07-20"), class = "factor"),
Month = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "2018-07", class = "factor"),
Value = c(11L, 2L, 4L, 423L, 13L)), class = "data.frame", row.names =
c(NA, -5L))

We can use data.table methods
library(data.table)
lapply(mget(ls(pattern = "df\\d+")), function(x)
setDT(x)[, .(sd_value = sd(Value)), by = Month])

Related

How do I write a function to manipulate several dataframes the same way?

Complete novice. I do not know how to write a function. I have several dataframes that all need to be manipulated in the same way and the output should be dataframes with the same names. I have functioning code that can manipulate a single dataframe. I would like to be able to manipulate several at once.
Here are 2 example df's:
ex1 <- structure(list(info1 = c("Day", "2018.04.03 10:47:33", "2018.04.03 11:20:04", "2018.04.03 11:35:04"), info2 = c("Status_0", "Ok", "Ok", "Ok"
), X = c(200L, 1L, 2L, 3L), X.1 = c(202.5, 1, 2, 3), X.2 = c(205L,
1L, 2L, 3L), X.3 = c(207.5, 1, 2, 3), X.4 = c(210L, 1L, 2L, 3L
), X.5 = c(212.5, 1, 2, 3), X.6 = c(215L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -4L))
ex2 <- structure(list(info1 = c("Day", "2018.04.10 12:47:33", "2018.04.10 13:20:04", "2018.04.10 13:35:04"), info2 = c("Status_0", "Ok", "Ok", "Ok"
), X = c(200L, 1L, 2L, 3L), X.1 = c(202.5, 1, 2, 3), X.2 = c(205L,
1L, 2L, 3L), X.3 = c(207.5, 1, 2, 3), X.4 = c(210L, 1L, 2L, 3L
), X.5 = c(212.5, 1, 2, 3), X.6 = c(215L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -4L))
Here is the functioning code to manipulate 'ex1'
library(tidyverse)
library(lubridate)
colnames(ex1) <- ex1[1,]
ex1 <- ex1 %>%
slice(-1) %>%
rename(Date.Time = "Date/Time") %>%
mutate(timestamp = parse_date_time(Date.Time, "%Y.%m.%d %H:%M:%S")) %>%
select(timestamp, Date.Time, everything()) %>% select(-Date.Time) %>%
select(-c(Status_0:"202.5", "212.5":"215"))
colnames(ex1)[-1] <- paste("raw", colnames(ex1)[-1], sep = "_")
Secondary question: let's say I wanted to change the function so it accepted a df, but also a type (i.e., raw or comp) and the function input would be tidydatafunc(df, type). If I input type=comp it would change the last line of the code where I have "raw" to "comp". How could I change the function to accomodate this?
Any help is greatly appreciated. I'm sure this is basic stuff for most of you!
Wrap your script in function and specify params.
my_fun <- function(df, type = 'comp') {
# basic input validation is extremely useful
stopifnot(is.data.frame(df))
stopifnot(is.character(type))
colnames(df) <- df[1,]
ex1 <- df %>%
slice(-1) %>%
rename(Date.Time = "Date/Time") %>%
mutate(timestamp = parse_date_time(Date.Time, "%Y.%m.%d %H:%M:%S")) %>%
select(timestamp, Date.Time, everything()) %>% select(-Date.Time) %>%
select(-c(Status_0:"202.5", "212.5":"215"))
# pass the character type
colnames(df)[-1] <- paste(type, colnames(df)[-1], sep = "_")
return(df)
}
Then you can use it.
my_fun(ex1, "comp") # view
new_ex1 <- my_fun(ex1, "comp") # save to variable new_ex1

Match a pattern within any element of the data using data table rather than plyr

I have a very big data set and have not used data.table before. I am finding the syntax a bit difficult to follow. My main question is how can i reproduce the 'apply' function for a data table?
My data is as follows
dat1 <- structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"), diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor")), .Names = c("id", "diag1", "diag2", "diag3"), row.names = c(NA, -4L), class = "data.frame")
I want to add a variable for all records that have a diagnostic code either within the columns diag1, diag2 or diag 3 of I20, I21 or I60. Using apply and regex i have done the following.
code.list <- c("I20","I21","I60")
dat1$index <- apply(dat1[2:4],1, function(i) any(grep(paste(code.list,
collapse="|"), i)))
I get the final dataset that i want is illustrated as below
structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"),diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor"), index = c(TRUE, TRUE, FALSE, TRUE)), .Names = c("id","diag1", "diag2", "diag3", "index"), row.names = c(NA, -4L), class = "data.frame")
However this is going to take far too long using plyr. I was hoping to get the syntax for a data table. Would anybody be able to help?
Thanks in advance
A
We can do this with data.table
library(data.table)
setDT(dat1)[, index := Reduce(`|`, lapply(.SD, grepl,
pattern = paste(code.list, collapse="|"))), .SDcols = 2:4]
dat1
# id diag1 diag2 diag3 index
#1: 1 I20.1 I60.9 TRUE
#2: 1 I21.3 I50 I38.1 TRUE
#3: 2 I48 FALSE
#4: 3 I60.8 TRUE

Simple text cleaning into all columns of a dataframe frame

I have a dataframe which I would like to implement some basic formation rules.
The dataframe is:
df <- structure(list(colname1 = structure(c(2L, 1L, 1L), .Label = c("",
"TEXTA"), class = "factor"), colname2 = structure(c(2L, 1L, 3L
), .Label = c("TEXTA", "TEXTB", "TEXTE"), class = "factor"),
colname3 = structure(c(2L, 3L, 1L), .Label = c("", "TEXTC",
"TEXTD"), class = "factor")), .Names = c("colname1", "colname2",
"colname3"), class = "data.frame", row.names = c(NA, -3L))
I try to run the following for the whole dataframe data:
df2 <- as.data.frame(tolower(df))
df2 <- as.data.frame(gsub("[[:punct:]]", "", df2))
but this converts the column names of dataframe to rows. What can I do to make in lower case and remove punctuation from all rows of the example dataframe (I am not interesting for colnames)?
We remove the punctuation characters on each column by looping through the columns (lapply(df, ..), assign the output back to the original dataset
df[] <- lapply(df, function(x) gsub("[[:punct:]]+", "", tolower(x)))
Using tidyverse, this can be done by
library(dplyr)
df %>%
mutate_all(funs(gsub("[[:punct:]]+", "", tolower(.))))

how to find similar strings within a data

My data looks like this
df<- structure(list(A = structure(c(7L, 6L, 5L, 4L, 3L, 2L, 1L, 1L,
1L), .Label = c("", "P42356;Q8N8J0;A4QPH2", "P67809;Q9Y2T7",
"Q08554", "Q13835", "Q5T749", "Q9NZT1"), class = "factor"), B = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("P62861", "P62906",
"P62979;P0CG47;P0CG48", "P63241;Q6IS14", "Q02413", "Q07955",
"Q08554", "Q5T749", "Q9UQ80"), class = "factor"), C = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("", "P62807;O60814;P57053;Q99879;Q99877;Q93079;Q5QNW6;P58876",
"P63241;Q6IS14", "Q02413", "Q16658", "Q5T750", "Q6P1N9", "Q99497",
"Q9UQ80"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA,
-9L))
I want to count how many elements are in each columns including those that are separated with a ; , for example in this case
first column has 9, second column has 12 elements and the third column has 16 elements. then I want to check how many times a element is repeated in other columns . for example
string number of times columns
Q5T749 2 1,2
then remove the strings which are seen more than once from the df
One way to approach this is to start by re-organizing the data into a form that is more convenient to work with. The tidyr and dplyr packages are useful for that sort of thing.
library(tidyr)
df$index <- 1:nrow(df)
df <- gather(df, key = 'variable', value = 'value', -index, na.rm = TRUE)
df <- separate(df, "value", into = paste("x", 1:(1 + max(nchar(gsub("[^;]", "", df$value)))), sep = ""), sep = ";", fill = "right")
df <- gather(df, "which", "value", -index, -variable)
Once you do that counting each element is easy:
addmargins(t(table(df[, c("variable", "value")])), margin = 2)
Dropping duplicates is also easy.
df <- df[!duplicated(df$value), ]
If you really want to put the data back into the original for you can (though I don't recommend it).
df <- spread(df, key = "variable", value = "value")
library(dplyr)
summarize(group_by(df, index),
A = paste(na.omit(A), collapse = ";"),
B = paste(na.omit(B), collapse = ";"),
C = paste(na.omit(C), collapse = ";"))
For the count of elements in each column use this
sapply(df,function(x) length(unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";"))))
For counting the repetition use this
words <- lapply(df,function(x) unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";")))
dup_table <- table(unlist(words))
dup_table
There is a very bad approach to remove the repetition
pat <- names(dup_table)[unname(dup_table)>1]
for(i in pat)
df <- as.data.frame.list(lapply(df,function(x) gsub(pattern = i,replacement = "",x)))
But, there is only one problem. It will replace all the occurences of a particular pattern.

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

Resources