I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.
Related
I have around 400'000 objects that have an id and a specific status. Every day (365 days a year) a status file is generated as follows:
Example of a status file:
n = 400000
df <- data.frame(id = sample(200000000000:299999999999, size = n, replace = FALSE),
status = sample( seq(100, 900, 10), size = n, replace = TRUE))
time <- paste0(Sys.Date(), "_", format(Sys.time(), "%H%M"))
writexl::write_xlsx(df, paste0("status_", time, ".xlsx"))
Now, my question is: How can I most efficiently put these files together so that I can analyse the development of certain status over the time? To save some memory I didn't create a date variable within each status file as it would be the same for all cases (the date is just in the filename).
One status file is around 6 or 7 MB, so over a year this would sum up to around 2.5 GB.
The easiest way would be to list the files, then read them in, and then join them.
So you could do something like this:
dir <- "path to directory"
files <- list.files(dir, full.names = TRUE)
# can change based on file type (if you convert to csv, you can use data.table::fread)
data <- purrr::map(files, readxl::read_excel)
# you'll probably want the date of each
dates <- dir %>% list.files() %>% stringr::str_remove(".xlsx") %>% stringr::str_remove("status_")
data <- purrr::map2(files, data, ~ mutate(.x, status = rep_len(.y, nrow(.x))
data_join <- purrr::reduce(data, rbind)
# you might want to convert the dates into date-times, here's an example
data_join <- data_join %>% separate(dates, into = c("date", "time"), by = "_") %>% mutate(date = lubridate::ymd(date), time = lubridate::hm(time)) %>% mutate(date_time = date + time)
This might not be the fastest solution in R, but is one of the fastest to get you running from the excel files.
So basically I am trying the following loop:
rawData = read.csv(file = "SampleData.csv")
companySplit = split(rawData, rawData$Company)
NameOfCompany <- numeric()
DateOfOrder <- character()
WhichProducts <- numeric()
for (i in 1:length(companySplit)){
company_DateSplit = split(companySplit[[i]], companySplit[[i]]$Date)
for (j in 1:length(company_DateSplit)){
WhichProducts[j] <- (paste0(company_DateSplit[[j]]$ID, collapse=","))
DateOfOrder[j] <- (paste0(company_DateSplit[[j]]$Date[1]))
NameOfCompany[j] <- (paste0(companySplit[[i]]$Company[[1]]))
}
}
df <- data.frame(NameOfCompany,DateOfOrder, WhichProducts)
write.csv(df, file = "basket.csv")
If you check basket.csv there is output for only company D. It is not writing because of nesting of for loops I guess. I am not able to get out of it.
I need exact output as basket.csv but for all companies.
Here are the CSVs:
Input Data: Link
Output of code basket.csv: Link
The output should look like this:
Company,Date, All IDs comma seperated.
e.g.
A,Jan-18,(1,2,4)
A,Feb-18,(1,4)
B,Jan-18,(2,3,4)
I'm able to get it from the above code. But Not able to save it in CSV for all A,B,C,D companies. It saves values for only company D which is the last value in looping. (check output file link)
The initial error is that you import your data without the parameter stringsAsFactors = FALSE which happens all the time. Also, looping in R is usually less efficient and harder to reason about than using a more functional approach. I think what you're trying to do can be done with the aggregate function
rawData <- read.csv(file = "SampleData.csv", stringsAsFactors = FALSE)
df <- aggregate(ID ~ Company + Date, data = rawData, FUN = paste, collapse = ",")
colnames(df) <- c("NameOfCompany", "DateOfOrder", "ID")
df = split(df, df$NameOfCompany)
Or using a tidy approach
df <- rawData %>% group_by(Company, Date) %>%
summarise(WhichProducts=paste(ID,collapse=',')) %>%
rename(DateOfOrder = Date) %>%
rename(NameOfCompany = Company) %>%
group_split()
I need to run a script for each station (I was replacing the numbers 1 by 1 in the script) but there're more than 100 stations.
I thought maybe loop in script could save my time. Never done loop before, don't know if it's possible to do what I want. I've tried as the bellow but doesn't work.
Just a bit of my df8 data (txt):
RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0
Code:
station <- sample(50:150,53,replace=F)
for(i in station)
{
df08_1 <- filter(df08, V7==station [i])
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp))
write.csv(final,file="../station [i].csv", row.names = FALSE)
}
If you're not averse to using some tidyverse packages, I think you could simplify this a bit:
Updated with your new sample data - this runs ok on my computer:
Code:
library(dplyr)
dat %>%
select(-RowNum) %>%
distinct() %>%
group_by(date_hour = lubridate::floor_date(date, 'hour'), gauging_station, code) %>%
summarize(precp = sum(precp)) %>%
split(.$gauging_station) %>%
purrr::map(~write.csv(.x,
file = paste0('../',.x$gauging_station, '.csv'),
row.names = FALSE))
Data:
dat <- data.table::fread("RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0") %>%
mutate(date = as.POSIXct(date, format = '%m/%d/%Y %H:%M'))
Can't comment for a lack of reputation, but if the code works if you change station [i] for the number of the station, it sounds like each station is a part of and has to be extracted from the df08 object (dataframe).
If I understand you correctly, I would do this as follows:
stations <- c(1:100) #put your station IDs into a vector
for(i in stations) { #run the script for each entry in the list
#assuming that 'V7' is the name of the (unnamed) seventh column of df08, it could
#work like this:
df08_1 <- filter(df08, df08$V7==i) #if your station names are something like
#'station 1' as a string, use paste("station", 1, sep = "")
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp)) #floor_date here is probably your own function
write.csv(final,file=paste("../station", i, ".csv", sep=""), row.names = FALSE)
#automatically generate names. You can modify the string to whatever you want ofc.
}
If this and all of the other examples don't work, could you provide us with some dummy data to work with, just to see what the df08 dataframe looks like? And also what the floor_date() function does?
Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate
I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)