Looping a series of variables in R while doing a cbind() - r

I have already read a dataset in R though read.csv and after doing some calculations created the following 10 variables of similar names, PI_1,PI_2,.....,PI_10
Now I combine the newly formed variables with my existing dataset (TempData).
x<-cbind(TempData,PI_1,PI_2,PI_3,PI_4,PI_5,PI_6,PI_7,PI_8,PI_9,PI_10)
Is there any smarter way of doing this (maybe by a loop). Any help is greatly appreciated

Assuming that the files are in the working directory and all of them starts with PI_ followed by some digits \\d+, we can use list.files with pattern argument in case there are other files also in the directory. To check the working directory, use getwd()
files <- list.files(pattern='^PI_\\d+')
This will give the file names. Now, we can use lapply and read those files in the list using read.table. Once, we are done with that part, use do.call(cbind to bind all the dataset columns together.
res <- do.call(cbind,lapply(files, function(x)
read.table(x, header=TRUE)))
Update
I guess you need to create 10 variables based on some PI. In the code that was provided in the comments, PI seems to be an object with some value inside it. Here, I am creating PI as the value as it is not clear. I created a dummy dataset.
TempData[paste0('PI_', 1:10)] <- Map(function(x,y) c('', 'PI')[(x==y)+1],
1:10, list(TempData$Concept))
head(TempData,3)
# Concept Val PI_1 PI_2 PI_3 PI_4 PI_5 PI_6 PI_7 PI_8 PI_9 PI_10
#1 10 -0.4304691 PI
#2 10 -0.2572694 PI
#3 3 -1.7631631 PI
You could use write.table to save the results
data
set.seed(42)
dat <- data.frame(Concept=sample(1:10,50, replace=TRUE), Val=rnorm(50))
write.csv(dat, 'Concept.csv', row.names=FALSE,quote=FALSE)
TempData <- read.csv('Concept.csv')
str(TempData)
#'data.frame': 50 obs. of 2 variables:
# $ Concept: int 10 10 3 9 7 6 8 2 7 8 ...
# $ Val : num -0.43 -0.257 -1.763 0.46 -0.64 ...

Related

How can I read a double-semicolon-separated .txt in r?

I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!
Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4
It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')

Removing specific columns from multiple data frames (.tab) and then merging them in R

I have 24 ".tab" files in a folder with names file1.tab, file2.tab, ..... file24.tab. Each of the files is a dataframe with 4 columns and 50,000 rows: The file looks like the image attached-
This is how each of the dataframe file looks like.
The first column is same in all the 24 files, but columns 2,3 and 4 have different values in each of the 24 files. For me, the columns 3 and 4 of each dataframe are irrelevant. I can get rid of the columns in each dataframe individually by following steps :
filenames <- Sys.gob("*.tab") #reads all the 24 file names
dataframe1 <- read.tab(filenames[1])
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe
However, this becomes very hectic when I have to repeat the above operation individually on 24 (or more) files which are similar. Is there a way to perform the above operation i.e. removing 3rd and 4th columns from all the 24 files by one code ?
Second part:
After removing the 3rd and 4th columns from each of the 24 files, I want to create a new dataframe which has 25 columns, such that the first column is the Column1 (which is same in all the files) and the subsequent columns are column2 from each of the files.
For two dataframes df1 and df2, I use :
merge(df1,df2,1,1)
and it creates a new data frame. It would be extremely tedious to do the merge operation individually for 24 modified dataframes. Could you please help me?
PS - I tried to find answers to any similar question (if asked before) and could not find it. So, in case it is marked as duplicate, it would be very kind if you please put a link to where it has been answered.
I have just started learning R and have no prior experience.
Regards,
Kshitij
First lets make a list of fake files
fakefile <- 'a\tb\tc\td
1\t2\t3\t4'
# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))
## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4
# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)
## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...
In principle, all solutions will have the same underlying work as a list
and then merge concept (although the merge might be different here and there).
Solution 1 - If you can rely on the order of column 1
If you can rely on the ordering of the columns, then you dont really need to
read columns 1 and 4 of each file, but just col 4 and bind them.
# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]
# Reading cols 4 in all files
# We first make a function that does our tasks (reading and removing cols)
reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}
# Then we use lapply to use that function on each elment of our list
cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)
## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4
# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)
# And finally add column 1 to it
data.frame(col1, cols4_mat)
## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4
Solution 2 - If you can not rely in the order
The implementation is easier but it is a lot slower in most situations
# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)
# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)
# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4
# you will need to modify the column names btw ...
Bonus - And the most concise solution ...
Depending on how big your data sets are, you might want to ignore the extra
columns from the start (instead of reading them and then removing them).
You can use fread from the data.table package to do that for you.
reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}
my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
While the answer above by Sebastian worked perfectly fine, I myself figured out another way to solve the above question using the for-loop. So, I am sharing that solution in case anyone else has similar question and feels comfortable using this method.
First of all, I set the working directory to the folder which contains the files. This is done using setwd() command.
setwd("/absolute path to the folder containing files/") #set working directory to the folder containing files
Now, I define the path to the files so that I can list the files.
path <- "/absolute path to the folder containing files/" #define the path to the folder
I create the list of filenames that I am interested in.
filenames<- dir(path, "*.tab") #List the files in the folder
Now, I create a new file with the Column 1 and Column 2 of the first file by following code
out_file<- read.table(filenames[1])[,c(1:2)] #create an output file with column1 and column2 of the first file
I write a for-loop that now reads only the second column of the files 2 to 24, and adds this second column from each of the files to the out_file defined above.
for(i in 2:length(filenames)){ #iterates from the second file as the first 2 columns of the first file has already been assigned to out_file
file<-read.table(filenames[i], header=FALSE, stringsAsFactors= FALSE) #reads files
out_file<- cbind(out_file, file[,2]) #adds second column of each file
}
What the above code actually does is that it iterates through each of the files, extracts the column 2 and adds it to the out_file, thereby creating the file of my interest.

Filter rows based on ID over multiple data frames with for loop

How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")

In R: How to perform a str() on multiple files

How could I go about performing a str() function in R on all of these files loaded in the workspace at the same time? I simply want to export this information out, but in a batch-like process, to a .csv file. I have over 100 of them, and want to compare one workspace with another to help locate incongruities in data structure and avoid mismatches.
I came painfully close to a solution via UCLA's R Code Fragment, however, they failed to include the instructions for how to form the read.dta function which loops through the files. That is the part I need help on.
What I have so far:
#Define the file path
f <- file.path("C:/User/Datastore/RData")
#List the files in the path
fn <- list.files(f)
#loop through file list, return str() of each .RData file
#write to .csv file with 4 columns (file name, length, size, value)
EDIT
Here is an example of what I am after (the view from RStudio--it simply lists the Name, Type, Length, Size, and Value of all of the RData Files). I want to basically replicate this view, but export it out to a .csv. I am adding the tag to RStudio in case someone might know a way of exporting this table out automatically? I couldn't find a way to do it.
Thanks in advance.
I've actually written a function for this already. I also asked a question about it, and dealing with promise objects with the function. That post might be of some use to you.
The issue with the last column is that str is not meant to do anything but print a compact description of objects and therefore I couldn't use it (but that's been changed with recent edits). This updated function gives a description for the values similar to that of the RStudio table. The data frames and lists are tricky because their str output is more than one line. This should be good.
objInfo <- function(env = globalenv())
{
obj <- mget(ls(env), env)
out <- lapply(obj, function(x) {
vals1 <- c(
Type = toString(class(x)),
Length = length(x),
Size = object.size(x)
)
val2 <- gsub("|^\\s+|'data.frame':\t", "", capture.output(str(x))[1])
if(grepl("environment", val2)) val2 <- "Environment"
c(vals1, Value = val2)
})
out <- do.call(rbind, out)
rownames(out) <- seq_len(nrow(out))
noquote(cbind(Name = names(obj), out))
}
And then we can test it out on a few objects..
x <- 1:10
y <- letters[1:5]
e <- globalenv()
df <- data.frame(x = 1, y = "a")
m <- matrix(1:6)
l <- as.list(1:5)
objInfo()
# Name Type Length Size Value
# 1 df data.frame 2 1208 1 obs. of 2 variables
# 2 e environment 11 56 Environment
# 3 l list 5 328 List of 5
# 4 m matrix 6 232 int [1:6, 1] 1 2 3 4 5 6
# 5 objInfo function 1 24408 function (env = globalenv())
# 6 x integer 10 88 int [1:10] 1 2 3 4 5 6 7 8 9 10
# 7 y character 5 328 chr [1:5] a b c d e
Which is pretty close I guess. Here's the screen shot of the environment in RStudio.
I would write a function, something like below. And then loop through that function, so you basically write the code for a single dataset
library(foreign)
giveSingleDataset <- function( oneFile ) {
#Read .dta file
df <- read.dta( oneFile )
#Give e.g. structure
s <- ls.str(df)
#Return what you want
return(s)
}
#Actually call the function
result <- lapply( fn, giveSingleDataset )

Change multiple dataframes in a loop

I have, for example, this three datasets (in my case, they are many more and with a lot of variables):
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame2 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
on each data frame I want to add a variable resulting from a transformation of an existing variable on that data frame. I would to do this by a loop. For example:
datasets <- c("data_frame1","data_frame2","data_frame3")
vars <- c("a","b","c")
for (i in datasets){
for (j in vars){
# here I need a code that create a new variable with transformed values
# I thought this would work, but it didn't...
get(i)$new_var <- log(get(i)[,j])
}
}
Do you have some valid suggestions about that?
Moreover, it would be great for me if it were possible also to assign the new column names (in this case new_var) by a character string, so I could create the new variables by another for loop nested in the other two.
Hope I've not been too tangled in explain my problem.
Thanks in advance.
You can put your dataframes in a list and use lapply to process them one by one. So no need to use a loop in this case.
For example you can do this :
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
## transformation
### .....
df
})
the dataframe1 becomes :
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
I had the same need and wanted to change also the columns in my actual list of dataframes.
I found a great method here (the purrr::map2 method in the question works for dataframes with different columns), followed by
list2env(list_of_dataframes ,.GlobalEnv)

Resources