In R: How to perform a str() on multiple files

In R: How to perform a str() on multiple files - r

How could I go about performing a str() function in R on all of these files loaded in the workspace at the same time? I simply want to export this information out, but in a batch-like process, to a .csv file. I have over 100 of them, and want to compare one workspace with another to help locate incongruities in data structure and avoid mismatches.
I came painfully close to a solution via UCLA's R Code Fragment, however, they failed to include the instructions for how to form the read.dta function which loops through the files. That is the part I need help on.
What I have so far:
#Define the file path
f <- file.path("C:/User/Datastore/RData")
#List the files in the path
fn <- list.files(f)
#loop through file list, return str() of each .RData file
#write to .csv file with 4 columns (file name, length, size, value)
EDIT
Here is an example of what I am after (the view from RStudio--it simply lists the Name, Type, Length, Size, and Value of all of the RData Files). I want to basically replicate this view, but export it out to a .csv. I am adding the tag to RStudio in case someone might know a way of exporting this table out automatically? I couldn't find a way to do it.
Thanks in advance.

I've actually written a function for this already. I also asked a question about it, and dealing with promise objects with the function. That post might be of some use to you.
The issue with the last column is that str is not meant to do anything but print a compact description of objects and therefore I couldn't use it (but that's been changed with recent edits). This updated function gives a description for the values similar to that of the RStudio table. The data frames and lists are tricky because their str output is more than one line. This should be good.
objInfo <- function(env = globalenv())
{
obj <- mget(ls(env), env)
out <- lapply(obj, function(x) {
vals1 <- c(
Type = toString(class(x)),
Length = length(x),
Size = object.size(x)
)
val2 <- gsub("|^\\s+|'data.frame':\t", "", capture.output(str(x))[1])
if(grepl("environment", val2)) val2 <- "Environment"
c(vals1, Value = val2)
})
out <- do.call(rbind, out)
rownames(out) <- seq_len(nrow(out))
noquote(cbind(Name = names(obj), out))
}
And then we can test it out on a few objects..
x <- 1:10
y <- letters[1:5]
e <- globalenv()
df <- data.frame(x = 1, y = "a")
m <- matrix(1:6)
l <- as.list(1:5)
objInfo()
# Name Type Length Size Value
# 1 df data.frame 2 1208 1 obs. of 2 variables
# 2 e environment 11 56 Environment
# 3 l list 5 328 List of 5
# 4 m matrix 6 232 int [1:6, 1] 1 2 3 4 5 6
# 5 objInfo function 1 24408 function (env = globalenv())
# 6 x integer 10 88 int [1:10] 1 2 3 4 5 6 7 8 9 10
# 7 y character 5 328 chr [1:5] a b c d e
Which is pretty close I guess. Here's the screen shot of the environment in RStudio.

I would write a function, something like below. And then loop through that function, so you basically write the code for a single dataset
library(foreign)
giveSingleDataset <- function( oneFile ) {
#Read .dta file
df <- read.dta( oneFile )
#Give e.g. structure
s <- ls.str(df)
#Return what you want
return(s)
}
#Actually call the function
result <- lapply( fn, giveSingleDataset )

Related

Can I aggregate with parameters taken from data frame?

I'd like to perform different aggregations in a loop to be applied to different row subsets of my data, but it seems tricky to achieve (if possible at all):
t <- data.frame(agg=c(list("field1"=field1, "field2"=field2), ...),
fun=c(mean, ...))
f <- function(x) {
for (i in 1:nrow(t) {
y <- aggregate(x, by=t$agg[i], FUN=t$fun[i])
# do something with y
}
}
One problem is that the field list agg triggers an error when trying to build the data frame ("object 'field1' not found"), and the other problem is that R does not like to assign a function value to fun ("cannot coerce class ""function"" to a data.frame").
Appendix:
A concrete example for my data (just to match the definitions above) could be:
> d <- data.frame(field1=round(rnorm(5, 10, 1)),field2=letters[round(rnorm(5, 10, 1))], field3=1:5)
> d
field1 field2 field3
1 11 j 1
2 11 i 2
3 10 j 3
4 12 i 4
5 11 j 5
> with(d, aggregate(d$field3,by=list(field1, field2),FUN=mean))
Group.1 Group.2 x
1 11 i 2
2 12 i 4
3 10 j 3
4 11 j 3
Playing tricks with the variable names in the data frame, I still get this:
> with(d,t <- data.frame(agg=c(list("field1"=field1, "field2"=field2)),fun=c(mean)))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame

The problems were several, mostly caused by R making exceptions to general processing:
First a vector cannot be nested, but only lists can. Still all the elements are required to have the same type.
Second, data.frame does some magic treatment when constructing the variables (causing the inability to assign closures), so it cannot be used.
Finally I had to refer to variables to aggregate by name
So the definition looks like this (where , ... means "add more similar items"):
t <- list(agg=list(c("field1", "field2"), ...),
fun=list(mean, ...))
f <- function(x) {
for (i in 1:length(t$agg)) {
agg <- t$agg[[i]]
aggList <- lapply(agg, FUN=function(e) x[[e]])
names(aggList) <- agg
y <- aggregate(x, by=aggList, FUN=t$fun[[i]])
# do something with y
}
}
Note: In the actual solution I added another list holding the names of the columns to select for the aggregated data frame to avoid warnings about mean returning NA.

Loop tasks by replacing one unique filename part by another

I am new to R and have just written a code, which works fine. I would like to loop this so that it also applies to the other identical 41 data.frames.
The inputfiles are called "weatherdata.. + UNIQUE NUMBER", the output files I would like to call "df + UNIQUE NUMBER".
The code I have written applies now only to the file weatherdata..5341. I could just press CTRL + F and replace all 5341 and run which is easy to do. But could I also do this with some sort of loop? or do you have a nice tutorial for me that could teach me how to do this? I have seen a tutorial with the for-loop but I couldn't figure out how to apply it for my code.
A small part of the code is provided below! I think that if the loop works on the code given below it will also work for the rest of the code. All help appreciated! :)
#List of part of the datafiles just 4 out of 42 files
list.dat <- list(weatherdata..5341,weatherdata..5344, weatherdata..5347,
weatherdata..5350)
# add colum with date(month) as a decimal number
weatherdata..5341$Month <- format( as.Date(weatherdata..5341$Date) , "%m")
# convert to date if not already
weatherdata..5341$Date <- as.Date(weatherdata..5341$Date, "%d-%m-%Y")
#Try rename columns
colnames(weatherdata..5341)[colnames(weatherdata..5341)=="Max.Temperature"] <- "TMPMX"
# store as a vector
v1 <- unlist(Tot1)
# store in outputfile dataframe
Df5341<- as.data.frame.list(v1)

You can create a list of all the dataframes and then use sapply to loop through each one of them. Here is a sample code :
> v1 <- list(data.frame(x = c(1,2), y = c('a', 'b')), data.frame(x = c(3,4), y = c('c', 'd')))
> v1
[[1]]
x y
1 1 a
2 2 b
[[2]]
x y
1 3 c
2 4 d
> sapply(v1 , function(x){(x$x <- x$x/4)})
[,1] [,2]
[1,] 0.25 0.75
[2,] 0.50 1.00
Then you can replace content inside the function. Hope this helps.

Something like this should work:
## Assuming that your files are CSV files and are alone in the folder
Fnames <- list.files() # this pulls the names of all the files
# in your working directory
Data <- lapply(Fnames, read.csv)
for( i in 1:length(Data)){
# Put your code in here, replacing the Df names with Data[[i]]
# for example:
# add colum with date(month) as a decimal number
Data[[i]]$Month <- format( as.Date(Data[[i]]$Date) , "%m")
# convert to date if not already
Data[[i]]$Date <- as.Date(Data[[i]]$Date, "%d-%m-%Y")
#Try rename columns
colnames(Data[[i]])[colnames(Data[[i]])=="Max.Temperature"] <- "TMPMX"
# And so on..
}

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.

This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.

Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

Merge and name data frames in for loop

I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.

You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.

If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}

How to change multiple variable names in R?

I am new to R
I am trying to create a variable list with variables containing the log of some other variables. I managed to create the list, but I don't know how to rename each variable. Moreover, I don't know how to make these variables be a part of my dataset. Here is what I do:
Imagine somedata is a csv file of the form:
v1, v2, v3, ..., vn
1, 4, 6, ..., 1
...
Then here is my script
##################
## Import Data
###################
lights <- read.csv("somedata.csv")
##################
## Variable Lists
###################
lights.varlist1 <- subset(lights, select=c(v1,v2,...,vJ))
###########
## Logs
###########
lights.logsvarlist1=apply(lights.varlist1, 2, function(y) log(y))
This part seems to be working just fine as the results of print(lights.logsvarlist1)[1,] make sense
To change the names of the variables I do:
for (i in 1:length(lights.logsvarlist1[1,]) {
name <-paste("l", names(lights.varlist1)[i], separator="")
names(lights.logsvarlist1)[i]=name
}
I have now two problems.
When I print(lights.logsvarlist1[1,], the names don't seem to have changed. I still have my old variable names as headers.
When I print(names(lights)), my newly created variables don't seem to be part of the dataset (they are not in the list).
What am I doing wrong? I am very new to R and I really want to continue, I'd appreciate any help.

DF <- data.frame(a=1:3, v1=4:6, v2=7:9, v3=10:12)
sub <- c("v1", "v2", "v3")
DF[, paste0("l", sub)] <- lapply(DF[, sub], log)
# a v1 v2 v3 lv1 lv2 lv3
# 1 1 4 7 10 1.386294 1.945910 2.302585
# 2 2 5 8 11 1.609438 2.079442 2.397895
# 3 3 6 9 12 1.791759 2.197225 2.484907

This works for me and avoids the for loop
data = as.data.frame(matrix(abs(rnorm(100)), 10))
ldata = log(data)
names(ldata) = paste('log', names(ldata), sep = '')
Some other tips
apply(lights.varlist1, 2, function(y) log(y))
Can be replaced by
apply(lights.varlist1, 2, log)
As log is a function or simply
log(lights.varlist1)
Instead of the following
for (i in 1:length(lights.logsvarlist1[1,])
use
ncol(lights.logsvarlist1)
Your new variables aren't in the lights data frame. They are in a data frame called
lights.logsvarlist1
To put them in the data frame use merge or cbind. Type ?merge etc

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

In R: How to perform a str() on multiple files - r

Related

Can I aggregate with parameters taken from data frame?

Loop tasks by replacing one unique filename part by another

Assigning value to an R object without using its name with get()

Merge and name data frames in for loop

How to change multiple variable names in R?

Categories

Resources