I have to read some external files, extract some columns and complete the missing values with zeros. So if the first file has in the column$Name: a, b, c, d, and the column$Area with discrete values; the second file has in the some column: b, d, e, f and so on for the further files I need to create a data frame such this:
a b c d e f
File1 value value value value 0 0
File2 0 value 0 value value value
This is the dummy code I wrote to try to better explain my problem:
listDFs <- list()
for(i in 1:10){
listDFs[[i]] <-
data.frame(Name=c(
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse=""))),
Area=runif(7))
}
lComposti <- sapply(listDFs, FUN = "[","Name")
dfComposti <- data.frame(matrix(unlist(lComposti),byrow=TRUE))
colnames(dfComposti) <- "Name"
dfComposti <- unique(dfComposti)
#
## The CORE of the code
lArea <- list()
for(i in 1:10){
lArea[[i]] <-
ifelse(dfComposti$Name %in% listDFs[[i]]$Name, listDFs[[i]]$Area, 0)}
#
mtxArea <- (matrix(unlist(lArea),nrow=c(10),ncol=dim(dfComposti)[1],byrow=TRUE))
The problem is about the "synchronization" between the column name and each values.
Have you some suggestion??
If my code result to be un-clear I can also upload the files I work with.
Best
The safest is never to lose track of the names: they could be put back in the wrong order...
You can concatenate all your data.frames into a tall data.frame, with do.call(rbind, ...), and then convert it to a wide data.frame with dcast.
# Add a File column to the data.frames
names( listDFs ) <- paste( "File", 1:length(listDFs) )
for(i in seq_along(listDFs)) {
listDFs[[i]] <- data.frame( listDFs[[i]], file = names(listDFs)[i] )
}
# Concatenate them
d <- do.call( rbind, listDFs )
# Convert this tall data.frame to a wide one
# ("sum" is only needed if some names appear several times
# in the same file: since you used "replace=TRUE" for the
# sample data, it is likely to happen)
library(reshape2)
d <- do.call( rbind, listDFs )
d <- dcast( d, file ~ Name, sum, value.var="Area" )
Related
I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?
I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)
I need to add rows to a data frame. I have many files with many rows so I have converted the code to a function. When I go through each element of the code it works fine. When I wrap everything in a function each row from my first loop gets added twice.
My code looks for a string (xx or x). If xx is present is replaces the xx with numbers 00-99 (one row for each number) and 0-9. If x is present it replaces it with number 0-9.
Create DF
a <- c("1.x", "2.xx", "3.1")
b <- c("single", "double", "nothing")
df <- data.frame(a, b, stringsAsFactors = FALSE)
names(df) <- c("code", "desc")
My dataframe
code desc
1 1.x single
2 2.xx double
3 3.1 nothing
My function
newdf <- function(df){
# If I run through my code chunk by chunk it works as I want it.
df$expanded <- 0 # a variable to let me know if the loop was run on the row
emp <- function(){ # This function creates empty vectors for my loop
assign("codes", c(), envir = .GlobalEnv)
assign("desc", c(), envir = .GlobalEnv)
assign("expanded", c(), envir = .GlobalEnv)
}
emp()
# I want to expand xx with numbers 00 - 99 and 0 - 9.
#Note: 2.0 is different than 2.00
# Identifies the rows to be expanded
xd <- grep("xx", df$code)
# I used chr vs. numeric so I wouldn't lose the trailing zero
# Create a vector to loop through
tens <- formatC(c(0:99)); tens <- tens[11:100]
ones <- c("00","01","02","03","04","05","06","07","08","09")
single <- as.character(c(0:9))
exp <- c(single, ones, tens)
# This loop appears to run twice when I run the function: newdf(df)
# Each row is there twice: 2.00, 2.00, 2.01 2.01...
# It runs as I want it to if I just highlight the code.
for (i in xd){
for (n in exp) {
codes <- c(codes, gsub("xx", n, df$code[i])) #expanding the number
desc <- c(desc, df$desc[i]) # repeating the description
expanded <- c(expanded, 1) # assigning 1 to indicated the row has been expanded
}
}
# Binds the df with the new expansion
df <- df[-xd, ]
df <- rbind(as.matrix(df),cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Empties the vector to begin another expansion
emp()
xs <- grep("x", df$code) # This is for the single digit expansion
# Expands the single digits. This part of the code works fine inside the function.
for (i in xs){
for (n in 0:9) {
codes <- c(codes, gsub("x", n, df$code[i]))
desc <- c(desc, df$desc[i])
expanded <- c(expanded, 1)
}
}
df <- df[-xs,]
df <- rbind(as.matrix(df), cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
assign("out", df, envir = .GlobalEnv) # This is how I view my dataframe after I run the function.
}
Calling my function
newdf(df)
My first post here so please tell me if I'm missing any important information.
I am handling a lot of data in form of time(1:30=rowID) vs value all stored in a number of dataframes and I need to keep it as a data.frame.
I wrote a function that gets dataframes from my global environment and sorts the columns in each set into new data frames depending on their values.
So I start with a list of names of my data frames as input for my function and then end with assigning the created new dataframes to my global environment while using the assign function.
The dataframes I get all are 30 rows long, but have different column length depending on how often a case appears in a dataset. The names of each dataframe represent one data set and the column names inside represent one timeline. I use data frames, so I don't loose the information of the column name.
This works for having 0 cases and everything above 1.
But if a data.frame ends up with only one column and I use the assign function it appears as a vector in my global environment instead of a data frame. Therefore I loose the name of the column and my other functions that only use data frames stop at such a case and throw errors.
Here is a basic example of my problem:
#create two datasets with different cases
data1 <- data.frame(matrix(nrow=30, ncol=5))
data1[1] <- c(rep(1,each=30))
data1[2] <- c(rep(5, each=30))
data1[3] <- c(rep(5, each=30))
data1[4] <- c(rep(10, each=30))
data1[5] <- c(rep(10, each=30))
data2 <- data.frame(matrix(nrow=30, ncol=6))
data2[1] <- c(rep(5,each=30))
data2[2] <- c(rep(1, each=30))
data2[3] <- c(rep(1, each=30))
data2[4] <- c(rep(0, each=30))
data2[5] <- c(rep(0, each=30))
data2[6] <- c(rep(10, each=30))
#create list with names of datasets
names <- c('data1','data2')
#function for sorting
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[,-1]
data.1 <- data.1[,-1]
data.5 <- data.5[,-1]
data.10 <- data.10[,-1]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
As explained before, if you run this you will end up with data frames of 0 variables and >1 variables.
And three vectors, where the data frame had only one column.
So my questions are:
1. Is there any way to keep the data type and forcing R to assign it to a data frame instead of a vector?
2. Or is there an alternative function I could use instead of assign()? If I use <<- how can I do the name assigning as above?
You can use drop = FALSE when subsetting:
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[ , -1, drop = FALSE]
data.1 <- data.1[ , -1, drop = FALSE]
data.5 <- data.5[ , -1, drop = FALSE]
data.10 <- data.10[ , -1, drop = FALSE]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
Let's take a look at the one-column dataframes:
str(data1.1)
'data.frame': 30 obs. of 1 variable:
$ X1: num 1 1 1 1 1 1 1 1 1 1 ...
str(data2.10)
'data.frame': 30 obs. of 1 variable:
$ X6: num 10 10 10 10 10 10 10 10 10 10 ...
Now, all that said, I agree with Roland's comment -- you almost never want to take this approach of assigning to the global environment in a complicated way, and instead should return a list; that's best practice. However, you'd still need drop = FALSE to keep the column names.
Really, to me, there's probably an entirely different approach to doing whatever kind of data wrangling you're wanting to do that is a much better approach. I just don't have a good grasp of your task to make a suggestion.
I have a data.frame, and I want to subset it every 10 rows and then applied a function to the subset, save the object, and remove the previous object. Here is what I got so far
L3 <- LETTERS[1:20]
df <- data.frame(1:391, "col", sample(L3, 391, replace = TRUE))
names(df) <- c("a", "b", "c")
b <- seq(from=1, to=391, by=10)
nsamp <- 0
for(i in seq_along(b)){
a <- i+1
nsamp <- nsamp+1
df_10 <- df[b[nsamp]:b[a], ]
res <- lapply(seq_along(df_10$b), function(x){...}
saveRDS(res, file="res.rds")
rm(res)
}
My problem is the for loop crashes when reaching the last element of my sequence b
When partitioning data, split is your friend. It will create a list with each data subset as an item which is then easy to iterate over.
dfs = split(df, 1:nrow(df) %/% 10)
Then your for loop can be simplified to something like this (untested... I'm not exactly sure what you're doing because example data seems to switch from df to sc2_10 and I only hope your column named b is different from your vector named b):
for(i in seq_along(dfs)){
res <- lapply(seq_along(dfs[[i]]$b), function(x){...}
saveRDS(res, file = sprintf("res_%s.rds", i))
rm(res)
}
I also modified your save file name so that you aren't overwriting the same file every time.
Using R, I wanted to save each variable's value when running lapply().
Below is what I tested now:
list_C <- list()
list_D <- list()
n <- 1
data_partition <- split(data, with(data, paste(A, B, sep=":")))
final_result <- lapply(data_partition,
function(dat) {
if(... condition ...) {
<Some R codes to run>
list_C[[n]] <- dat$C
list_D[[n]] <- dat$D
n <- n + 1
}
})
However, after running the code, 'n' remains just '1' and there's no change. How can I change the variable of 'n' to get the right saving lists of 'list_C' and 'list_D'?