R - extracting column in dataframes of a loop - r

I need to save a list of csv files and extract values from thr 13th row on of a specific column (the second one) from each of dataframes.
Here's my try:
temp <- list.files(FILEPATH, pattern="*\\.csv$", full.names = TRUE)
for (i in 1:length(temp)){
assign(temp[i], read.csv(temp[i], header=TRUE, ski[=13, na.strings=c("", "NA")))
subset(temp[i], select=2) #extract the second column of the dataframe
temp[i] <- na.omit(temp[i])
However, this doesn't work. On the one hand, I think that's because of the skip argument of the read.csv command, as it apparently ignores the headers. On the other hand, if skip is not used, the following error pops up:
Error in subset.default(temp[i], select = 2) : argument "subset" is
missing, with no default
When I insert the argument subset=TRUE in the subset command, it doesn't give any error, but no extraction is performed.
Any possible solution?

Without seeing the files it's not easy to tell, but I would use lapply, not a for loop. Maybe you can get inspiration from something like the follwing. I use read.table because you skip = 13 lines and read.csv reads in the first line as column headers. Note that I avoid the use of assign.
df_list <- lapply(temp, read.table, sep = ",", skip = 13, na.strings = c("", "NA"))
names(df_list) <- temp
col2_list <- lapply(df_list, `[[`, 2)
col2_list <- lapply(col2_list, na.omit)
names(col2_list) <- temp
col2_list
If you want col2_list to be a list of df's with just one column each, column 2 of the original files, then, like I've said in comment use
col2_list <- lapply(df_list, `[`, 2)
And to rename that one column and renumber the rows consecutively
new_name <- "the_column_of_choice" # change this!
col2_list <- lapply(col2_list, function(x){
names(x) <- new_name
row.names(x) <- NULL
x
})

Related

How do I write a R dataframe to a csv file when every row has its own dataframe?

I have a dataframe where the rows all have their own dataframes. When I use the write.csv() function to save this dataframe into a csv file, I receive the following error:
Error in write.table(staff, "Chiefs of Staff.csv", col.names = NA, sep = ",", :
unimplemented type 'list' in 'EncodeElement'
Here is the code I used
chiefs_of_staff<-jsonlite::fromJSON("http://www.infogo.gov.on.ca/infogo/v1/individuals/search?&keywords=chief%20of%20staff&topOrgId=0&locale=en&_=1569503878383")
staff<-chiefs_of_staff$individuals
write.csv(staff,'Chiefs of Staff.csv')
Any help would be much appreciated.
The following code does what the question asks for.
The problem is complicated by the fact that some of the dataframes in staff[[1]] or staff$assignments have more than 1 row and therefore the dataframe resulting from their rbinding has more than 49 rows.
Also, I have substituted underscores for the spaces in the output filename.
chiefs_of_staff <- jsonlite::fromJSON("http://www.infogo.gov.on.ca/infogo/v1/individuals/search?&keywords=chief%20of%20staff&topOrgId=0&locale=en&_=1569503878383")
staff <- chiefs_of_staff$individuals
assignments <- do.call(rbind, staff[[1]])
assignments$positionTitle <- gsub('<.*>', '', assignments$positionTitle)
assignments$positionTitle <- trimws(assignments$positionTitle)
l <- sapply(staff[[1]], nrow)
n <- nrow(staff[-1])
tmp <- lapply(seq_len(n), function(k){
sapply(staff[k, -1], rep, l[k])
})
tmp <- do.call(rbind, tmp)
out <- cbind(assignments, tmp)
write.csv(out,'Chiefs_of_Staff.csv')
rm(tmp, l, n) # final clean up
You have to convert your json file to a format that write.csv can work with: calling rbind to your list makes a matrix writable to csv.
staff_csv <- do.call("rbind", staff)
write.csv(staff_csv,'Chiefs of Staff.csv')
The assignments column is a list of data.frame, there are a number of ways to handle this. Here is one:
staff$assignments = as.character(staff$assignments)
write.csv(staff,'Chiefs_of_Staff.csv')
That will work.

rbindlist - how to get an additional column with info about a source?

I have more than 30 large .csv files stored in one folder. I would like to read them in R as one data.frame/data/table with the following criteria:
(1) first and last 25 rows of each file should be skipped (number of rows differs in each file)
(2) last column should contain unique information on the source of the row (eg. filename.csv.rownumber from the raw file). A number of columns differ in each file as well.
So far I have this:
ASC_files <- list.files(pattern="*.csv")
read_ASC <- function(x){
ASC <-fread(x, skip=25)
return(ASC[1:(nrow(ASC)-25),])
}
ASC_list <-lapply(ASC_files, read_ASC)
ASC_all <- rbindlist(ASC_list, use.names=TRUE)
However, I have no idea how to get an additional column with information on the source of each row...
Thanks everyone for commenting my question. Finally, I came out with this solution:
ASC_files <- list.files(pattern="*.asc")
ASC_all <- sapply(ASC_files, function(x) read.csv(x, header=FALSE, col.names
paste0('V', 1:1268) , sep="", stringsAsFactors = FALSE))
#adding a new column with name of the source file
ASC_all <- mapply(cbind, ASC_all, "source"=ASC_files, SIMPLIFY = FALSE)
#adding a new column with row number
ASC_all <- map(ASC_all, ~rowid_to_column(.x, var="row"))
#removing last and first 25 rows in each dataframe of the list
ASC_all <- lapply(ASC_all, function(x) x[x$row<(nrow(x)-25),])
ASC_all <- lapply(ASC_all, function(x) x[x$row>25,])
#transforming the list into a dataframe with all data
ASC_all <- rbindlist(ASC_all)
#complementing the kolumn source with the row number (result: filename.csv.rownumber)
ASC_all$file <- paste0(ASC_all$file, '.', ASC_all$row)
#removing column with row numbers
ASC_all$row <- NULL
Maybe it's not the most elegant and efficient code but at least it works.

read.csv from list to get unique colnames

I am reading my files into file_list. The data is read using read.csv, however, I want the data in datalist to have colnames as the file-names the file_list. The original files does not have a header.
How do I change function(x) so that the the second column has colname similar to the file-name. The first column does not have to be unique.
file_list = list.files(pattern="*.csv")
datalist = lapply(file_list, function(x){read.csv(file=x,header=F,sep = "\t")})
How do I change function(x) so that the the second column has colname similar to the file-name?
datalist = lapply(file_list, function(x){
dat = read.csv(file=x, header=F, sep = "\t")
names(dat)[2] = x
return(dat)
})
This will put the name of the file as the name of the second column. If you want to edit the name, use gsub or substr (or similar) on x to modify the string.
You can just add another step.
names(datalist) <- file_list

How to get the column name with reference to an object in r

I have multiple csv files for which I want to access second column for every file and do a regex which will remove all strings after ";". this Pattern is same for all the files.
I have referred this
In R, how to get an object's name after it is sent to a function?
This is a sample of my file
ID POLL
1 1,2:ksd ksj
2 3:jj
3 6:ok0j
This is what I have tried
setwd("D:/Data/STN")
temp = list.files(pattern="*.csv")
for(i in 1:length(temp)){
DF1=read.csv(temp[i])
col2=colnames(DF1)[2]
assign(paste(DF1,"$"),col2)
DF1$col2 = gsub(":.*","",DF1$col2)
In temp I have all names of all files, I tried with assign but no output.
Thanks in advance
We can use lapply to loop over the list of data.frames and replace the suffix part in the second column using sub.
lst1 <- lapply(lst, function(x) {x[,2] <- sub(":.*", "", x[,2])
x})
As noted below, the data.frames are read into a list
data
temp <- list.files(pattern="*.csv")
lst <- lapply(temp, read.csv, stringsAsFactors=FALSE)

`rbind` unique entries of all columns of a data frame and write it to a csv file

##Initialise empty dataframe
g <-data.frame(x= character(), y= character(),z=numeric())
## Loop through each columns and list out unique values (with the column name)
for(i in 1:ncol(iris))
{
a<-data.frame(colnames(iris)[i],unique(iris[,i]),i)
g<-rbind(g,a)
setNames(g,c('x','y','z'))
}
## write the output to csv file
write.csv(g,"1.csv")
The output CSV file is something like this
Now the Column headers I want are not proper. I want column headers to be 'x','y','z' respectively. Also the first column should not be there.
Also if you have any other efficient way to do this, let me know. Thanks!
This will do the work:
for(i in 1:ncol(iris))
{
a<-data.frame(colnames(iris)[i],unique(iris[,i]),i)
g<-rbind(g,a)
}
g <- setNames(g,c('x','y','z')) ## note the `g <-`
write.csv(g, file="1.csv", row.names = FALSE) ## don't write row names
setNames returns a new data frame with names "x", "y" and "z", rather than updating the input data frame g. You need the explicit assignment <- to do the "replacement". You may hide such <- by using either of the two
names(g) <- c('x','y','z')
colnames(g) <- c('x','y','z')
Alternatively, you can use the col.names argument inside write.table:
for(i in 1:ncol(iris))
{
a<-data.frame(colnames(iris)[i],unique(iris[,i]),i)
g<-rbind(g,a)
}
write.table(g, file="a.csv", col.names=c("x","y","z"), sep =",", row.names=FALSE)
write.csv() does not support col.names, hence we use write.table(..., sep = ","). Trying to use col.names in write.csv will generate a warning.
A more efficient way
I would avoid using rbind inside a loop. I would do:
x <- lapply(iris, function (column) as.character(unique(column)))
g <- cbind.data.frame(stack(x), rep.int(1:ncol(iris), lengths(x)))
write.table(g, file="1.csv", row.names=FALSE, col.names=c("x","y","z"), sep=",")
Read ?lapply and ?stack for more.

Resources