R: manipulate multiple files in a folder and combine - r

I have a folder in which I have many .csv files that I want to first, a) manipulate, and b) add together in a way that each file is going to turn into a new row in the new file
See an example file that I have below. All others have the same format:
data <- structure(list(view_history = c("[{\"page_index\":0,\"viewing_time\":3078.7250000284985},{\"page_index\":1,\"viewing_time\":1287.8200000268407}]",
NA, NA, NA, NA, NA), rt = c("4367.33", "32741.89", "84982.255",
"44164.12", "16395.195", "21816.545"), trial_type = c("instructions",
"html-button-response", "survey-multi-choice", "survey-multi-choice",
"survey-multi-choice", "survey-multi-choice"), trial_index = c(0,
1, 2, 3, 4, 5), time_elapsed = c(4369, 37115, 122101, 166268,
182665, 204484), internal_node_id = c("0.0-0.0", "0.0-1.0", "0.0-2.0",
"0.0-3.0", "0.0-4.0", "0.0-5.0"), stimulus = c(NA, "The price of hourly piano course has a mean of $100 with a standard devation of $20. Random samples are taken from the population from small to large sample sizes.</br><img src='LLI_wrong.png' style= 'width:25%; height:30%'><img src= 'LLI_graph_2.png' style= 'width:25%; height:30%'> <br/><img src= 'LLI_wrong2.png' style= 'width:25%; height:30%'><img src= 'LLI_wrong3.png' style= 'width:25%; height:30%'>",
NA, NA, NA, NA), button_pressed = c(NA, 3, NA, NA, NA, NA), responses = c(NA,
NA, "{\"WQ2\":\"<strong>B.</strong> You should go to the large office.\"}",
"{\"WQ3\":\"<strong>B.</strong> The number of days on which mean heights were over 71 inches would be greater for the large post office than for the small post office.\"}",
"{\"WQ4\":\"<strong>B.</strong> The large street\"}", "{\"R2\":\"<strong>A. </strong> As the sample size increases, its mean will tend to be closer to that of the population\"}"
), question_order = c(NA, NA, "[0]", "[0]", "[0]", "[0]"), correct_response = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), accuracy = c(NA,
NA, NA, NA, NA, NA), key_press = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Next, I'm organizing and manipulating this data:
#keep only the related columns
data2 <- select(data, time_elapsed, button_pressed, responses, accuracy)
#add time on task
data3 <- mutate(data2, time = tail(time_elapsed, 1))
#data shrunk + time on task added
transformed_data <- select(data3, -time_elapsed)
#select the necessary cells and turn the data into a vector
new_data <- c(transformed_data$button_pressed[2], transformed_data$responses[3:6],
transformed_data$button_pressed[7], transformed_data$time[1])
Next, I transpose the data and write it to a csv file:
new_data <- t(new_data)
write.csv(as.data.frame(new_data), "hello_data.csv")
What I want to do next and that I couldn't figure out:
Loop this process through all .csv files in the folder in a way that each row in your new file corresponds to the data from one file.

Get all the files from the folder with list.files
files <- list.files(path = "/path/to/folder", pattern = "\\.csv$", full.names = TRUE)
Then loop over the files and read the files
library(dplyr)
library(purrr)
library(stringr)
out <- map_dfr(files, ~ {
transformed_data <- readr::read_csv(.x) %>%
dplyr::select(time_elapsed, button_pressed, responses, accuracy) %>%
dplyr::mutate(time = time_elapsed[n()], time_elapsed = NULL)
new_data <- as.data.frame(list(transformed_data$button_pressed[2], transformed_data$responses[3:6],
transformed_data$button_pressed[7], transformed_data$time[1]))
new_data
})
readr::write_csv(out, "hello_data.csv")

Related

Updating multiple date columns in R conditioned on a particular column

I have a table that consists of only columns of type Date. The data is about shopping behavior of customers on a website. The columns correspond to the first time an event is triggered by a customer (NULL if no occurrence of the event). One of the columns is the purchase motion.
Here's a MRE for the starting state of the Database:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, 10613), class = "Date"), action_C = structure(c(10707,
10729, 10739, NA, NA), class = "Date")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"))
I want to update the table so that all the columns of a particular row, all the cells that did not occur within 30 days prior to the purchase are replaced with NULL. However, if the purchase motion is NULL, I'd like to keep the dates of the other events.
So after my envisioned transformation, the above table should look as the following:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, NA), class = "Date"), action_C = structure(c(10707,
10729, NA, NA, NA), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I have yet to be able to achieve this transformation, and would appreciate the help!
Finally, I'd like to transform the above table into a binary format. I've achieved this via the below code segment; however, I'd like to know if I can do this in a simpler way.
df_c <- df_b %>%
is.na() %>%
magrittr::not() %>%
data.frame()
df_c <- df_c * 1
I assume that by saying "replaced by NULL" you actually mean "replaced by NA".
I also assume that the first structure in your question is df_a.
df_b <- df_a %>% mutate(across(starts_with("action"),
~ if_else(purchase - . > 30, as.Date(NA), .)))
mutate(across(cols, func)) applies func to all selected cols.
the real trick here is to use if_else and cast NA into Date class. Otherwise, the dates will be converted to numeric vectors.
Result:
# Tibble (class tbl_df) 4 x 5:
│purchase │action_A│action_B │action_C
1│NA │NA │NA │NA
2│NA │NA │NA │NA
3│1999-05-18│NA │1999-05-02│1999-05-28
4│NA │NA │NA │NA
5│1999-05-26│NA │NA │NA
One problem which remains as a homework exercise: how do you modify the if_else such that you will keep the action if purchase is NA? (this should be now very simple!) I did not include that on purpose because you omitted it from the question.

Joining 'n' number of lists and perform a function in R

I have a dataframe which contains many triplicate (3 columns set). And I have grouped the dataframe into each triplicate as a seperate group of list.
The example dataset is,
example_data <- structure(list(`1_3ng` = c(69648445400, 73518145600, NA, NA,
73529102400, 75481088000, NA, 73545910600, 74473949200, 77396199900
), `2_3ng` = c(71187990600, 70677690400, NA, 73675407400, 73215342700,
NA, NA, 69996254800, 69795686400, 76951318300), `3_3ng` = c(65032022000,
71248214000, NA, 72393058300, 72025550900, 71041067000, 73604692000,
NA, 73324202000, 75969608700), `4_7-5ng` = c(NA, 65845061600,
75009245100, 64021237700, 66960666600, 69055643600, NA, 64899540900,
NA, NA), `5_7-5ng` = c(65097201700, NA, NA, 69032126500, NA,
70189899800, NA, 74143529100, 69299087400, NA), `6_7-5ng` = c(71964413900,
69048485800, NA, 71281569700, 71167596500, NA, NA, 68389822800,
69322289200, NA), `7_10ng` = c(71420403700, 67552276500, 72888076300,
66491357100, NA, 68165019600, 70876631000, NA, 69174190100, 63782945300
), `8_10ng` = c(NA, 71179401200, 68959365100, 70570182700, 73032738800,
NA, 74807496700, NA, 71812102100, 73855098500), `9_10ng` = c(NA,
70403756100, NA, 70277421000, 69887731700, 69818871800, NA, 71353886700,
NA, 74115466700), `10_15ng` = c(NA, NA, 68487581700, NA, NA,
69056997400, NA, 67780479400, 66804467800, 72291939500), `11_15ng` = c(NA,
63599643700, NA, NA, 60752029700, NA, NA, 63403655600, NA, 64548492900
), `12_15ng` = c(NA, 67344750600, 61610182700, 67414425600, 65946654700,
66166118400, NA, 70830837700, 67288305700, 69911451300)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L)
And after grouping I got the four lists, since the above example dataset contains 4 groups. I have used the following R code for grouping the data,
grouping_data<-function(df){ #df= dataframe
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/3)-1),3)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}
group_data <-grouping_data(example_data)
Please suggest useful R code for do a particular function for all the lists at a same time.
For example the below function I have done by following way,
#VSN Normalization
vsnNorm <- function(dat) {
dat<-as.data.frame(dat)
vsnNormed <- suppressMessages(vsn::justvsn(as.matrix(dat)))
colnames(vsnNormed) <- colnames(dat)
row.names(vsnNormed) <- rownames(dat)
return(as.matrix(vsnNormed))
}
And I have tried like below,
vsn.dat0 <- vsnNorm(group_data$data0)
vsn.dat1 <- vsnNorm(group_data$data1)
vsn.dat2 <- vsnNorm(group_data$data2)
vsn.dat3 <- vsnNorm(group_data$data3)
vsn.dat <- cbind (vsn.dat0,vsn.dat1,vsn.dat2,vsn.dat3)
It is working well.
But the dataset triplicate (3 columns set) value may be change from dataset to dataset. And calling all the lists everytime become will be tedious.
So kindly share some codes which will call all the resulted lists for performing a function and combine the result as a single file.
Thank you in advance.
The shortcut you are looking for is:
vsn.dat <- do.call("rbind", lapply(group_data, vsnNorm))

CREATE MULTIPLE DATAFRAMES

I have a dataframe(df) that looks like below:
Objective: I want to create 52 DATAFRAMES, I don't know how to use it with dplyr
Assuming your dataframe is in variable df, try the following code:
library(dplyr)
columns_name = names(df) #names of column in your dataframe
df_list =list() #empty list to store output dataframes
#loop through columns of the original dataframe,
#selecting the first and i_th column and storing the resulting dataframe in a list
for (i in 1:(length(columns_name) -1)){
df_list[[i]] = df %>% select(columns_name[1],columns_name[i+1]) %>% filter_all(all_vars(!is.na(.)))
}
#access smaller dataframes using the following code
df_list[[1]]
df_list[[2]]
Try next code:
library(dplyr)
library(tidyr)
#Code
new <- df %>% pivot_longer(-1) %>%
group_by(name) %>%
filter(!is.na(value))
#List
List <- split(new,new$name)
#Set to envir
list2env(List,envir = .GlobalEnv)
Some data used:
#Data
df <- structure(list(id_unico = c("112172-1", "112195-1", "112257-1",
"112268-1", "112383-1", "112452-1", "112715-1", "112716-1", "112761-1",
"112989-1"), P101COD = c(NA, NA, NA, NA, NA, 411010106L, NA,
NA, 411010106L, NA), P102COD = c(421010102L, 421010102L, 421010102L,
421010102L, 421010102L, NA, 421010108L, 421010108L, NA, 421010102L
), P103COD = c(441010109L, 441010109L, 441010109L, 441010109L,
441010109L, 441010109L, 441010109L, 441010109L, 441010109L, 441010101L
), P110_52_COD = c(NA, 831020103L, 831020103L, NA, 831020103L,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))

Conditionally replace cells in data frame based on another data frame

In the interest of learning better coding practices, can anyone show me a more efficient way of solving my problem? Maybe one that doesn't require new columns...
Problem: I have two data frames: one is my main data table (t) and the other contains changes I need to replace in the main table (Manual_changes). Example: Sometimes the CaseID is matched with the wrong EmployeeID in the file.
I can't provide the main data table, but the Manual_changes file looks like this:
Manual_changes = structure(list(`Case ID` = c(46605, 25321, 61790, 43047, 12157,
16173, 94764, 38700, 41798, 56198, 79467, 61907, 89057, 34232,
100189), `Employee ID` = c(NA, NA, NA, NA, NA, NA, NA, NA, 906572,
164978, 145724, 874472, 654830, 846333, 256403), `Age in Days` = c(3,
3, 3, 12, 0, 0, 5, 0, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
temp = merge(t, Manual_changes, by = "Case ID", all.x = TRUE)
temp$`Employee ID.y` = ifelse(is.na(temp$`Employee ID.y`), temp$`Employee ID.x`, temp$`Employee ID.y`)
temp$`Age in Days.y`= ifelse(is.na(temp$`Age in Days.y`), temp$`Age in Days.x`, temp$`Age in Days.y`)
temp$`Age in Days.x` = NULL
temp$`Employee ID.x` = NULL
colnames(temp) = colnames(t)
t = temp
We could use coalesce
library(dplyr)
left_join(t, Manual_changes, by = "Case ID") %>%
mutate(Employee_ID.y = coalesce(`Employee ID.x`, `Employee ID.y`),
`Age in Days.y` = coalesce(`Age in Days.x`, `Age in Days.y`))
Or with data.table
library(data.table)
setDT(t)[Manual_changes,
c('Employee ID', 'Age in Days') :=
.(fcoalesce(`Employee ID.x`, `Employee ID.y`),
fcoalesce(`Age in Days.x`, `Age in Days.y`)),
on = .(`Case ID`)]

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

Resources