Facing a issue with unique() function in R and class of objects - r

I'm merging multiple files together and then trying to get the unique data out of a particular column. This idea works perfectly fine, when I'm running the code for a single pattern.
united_tweets <- load_data("united")
nrow(united_tweets)
united_unique <- unique(united_tweets[,2])
But, When i run the same code inside a for loop , the unique function seems to create an error. The output of a unique function, or when i try to get a single column saved , the class of the variable changes from 'list' to 'factor'. Trying to find unique values from it returns NULL values. Can someone point out what is wrong here?
for(i in 1:length(airlines)){
tmp <- load_data(airlines[i])
tweet <- as.list(tmp$text)
print(class(tweet))
tmp1 <- as.list(unique.default(tweet))
print(nrow(tmp1))
}

Here is the code i used. Only two differences from yours, read.csv and length(tmp1).
## file names
airlines = c("Delta03262017123126.csv", "Delta03262017124221.csv")
for(i in 1:length(airlines)){
tmp <- read.csv(airlines[i])
tweet <- as.list(tmp$text)
print(class(tweet))
tmp1 <- as.list(unique.default(tweet))
print(length(tmp1))
}
# [1] "list"
# [1] 1
# [1] "list"
# [1] 3495

Related

Applying a function to multiple lists

I am doing research on U.S. Lobbying, who publishes their data as an open API that is very poorly integrated and only seems to allow 250 observations to be downloaded at one time. I would like to compile the whole data set into one data table but am struggling with the last step to do so. This is what I have thus far
base_url <- sample("https://lda.senate.gov/api/v1/contributions/?page=", 10, rep = TRUE) #Set the number between the commas as how many pages you want
numbers <- 1:10 #Set the second number as how many pages you want
pagesize <- sample("&page_size=250", 10, rep = TRUE) #Set the number between the commas as how many pages you want
pages <- data.frame(base_url, numbers, pagesize)
pages$numbers <- as.character(pages$numbers)
pages$url <- with(pages, paste0(base_url, numbers, pagesize)) # creates list of pages you want. the list is titled pages$url
for (i in 1:length(pages$url)) assign(pages$url[i], GET(pages$url[i])) # Creates all the base lists in need of extraction
The last two things I need to do are extract the data table from the created lists and then full join all of them. I know how to join all of them but extracting the data frames is proving to be challenging. basically, to all the created lists I need to apply the function fromJSON(rawToChar(list$content)). I have tried using lapply but have yet to figure it out. any help would be greatly welcomed!
When you were assigning GET(pages$url[i])) to your data frame you were coercing it to a character vector. Better to assign it to a list and keep it as a response:
library(httr)
library(jsonlite)
library(dplyr) # for bind_rows
page_content <- list()
for (i in 1:length(pages$url)) page_content[[i]] <- GET(pages$url[i]) # Creates all the base lists in need of extraction
Then you can use the code you had written - fromJSON(rawToChar()) - to extract it from raw bytes to characters:
results_list <- lapply(
page_content,
\(page) fromJSON(rawToChar(page[["content"]]))["results"][[1]]
)
results_table <- do.call(bind_rows, results_list)
dim(results_table) # 2500 27
names(results_table)
# [1] "url" "filing_uuid" "filing_type" "filing_type_display" "filing_year"
# [6] "filing_period" "filing_period_display" "filing_document_url" "filing_document_content_type" "filer_type"
# [11] "filer_type_display" "dt_posted" "contact_name" "comments" "address_1"
# [16] "address_2" "city" "state" "state_display" "zip"
# [21] "country" "country_display" "registrant" "lobbyist" "no_contributions"
# [26] "pacs" "contribution_items"

Apply function to all dataframes

I work with SAS files (sas7bdat = dataframes) and SAS formats (sas7bcat).
My sas7bdat files are in a "data" file, so I can get a list in object files_names.
Here is the first part of my code, working perfectly
files_names <- list.files(here("data"))
nb_files <- length(files_names)
data_names <- vector("list",length=nb_files)
for (i in 1 : nb_files) {
data_names[i] <- strsplit(files_names[i], split=".sas7bdat")
}
for (i in 1:nb_files) {
assign(data_names[[i]],
read_sas(paste(here("data", files_names[i])), "formats/formats.sas7bcat")
)}
but I get some issues when trying to apply function as_factor from package haven (in order to apply labels on my new dataframes and get like SEX = "Male" instead of SEX = 1).
I can make it work dataframe by dataframe like the code below
df_labelled <- haven::as_factor(df, only_labelled = TRUE)
I would like to create a loop but didn't work because my data_names[i] isn't a dataframe and as_factor requires a dataframe in first argument.
I'm quite new to R, thank you very much if someone could help me.
you might want to think about using different data structures, for example you can use a named list to save your dataframes then you can easily loop through them.
In fact you could do everything in one loop, I'm sure there's a more efficient way to do this, but here's an example of one way without changing your code too much :
files_names <- list.files(here("data"))
raw_dfs <- list()
labelled_dfs <- list()
for (file_name in files_names) {
# # strsplit returns a list either extract the first element
# # like this
# df_name <- (strsplit(file_name, split=".sas7bdat"))[[1]]
# # or use something else like gsub
df_name <- gsub(".sas7bdat", '', file_name)
raw_dfs[df_name] <- read_sas(paste(here("data", file_name)), "formats/formats.sas7bcat")
labelled_dfs[df_name] <- haven::as_factor(raw_dfs[[df_name]], only_labelled = TRUE)
}

Errors in finding column mean of .csv file with NA cells in R

I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA

read multiple ENVI files and combine them in one csv

I'm fairly new in working with R but trying to get this done. I have dozens of ENVI spectral datasets stored in a directory. Each dataset is seperated into two files. They all have the same name convention, i.e.:
ID_YYYYMMDD_350-200nm.asr
ID_YYYYMMDD_350-200nm.hdr
The task is to read the dataset, add two columns (ID and date from filename), and store the results in a *.csv-file. I got this to work for a single file (hardcoded).
library(caTools)
setwd("D:/some/path/software_scripts")
### filename without extension
name <- "011a_20100509_350-2500nm"
### split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
### get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
### add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
### CSV-Dataset with all values
write.csv(spectrum, file = name,".csv", sep=",")
I want to combine all available files into one *.csv file. I know that I've to use list.files but have no idea, how to implement the read.ENVI function and add the resulting matrices ongoing to CSV.
Update:
library(caTools)
setwd("D:/some/path/mean")
files <- list.files() # change or leave totally empty if setwd() put you in the right spot
all_names <- sub("^([^.]*).*", "\\1", files) # strip off extensions
name <- unique(all_names) # get rid of duplicates from .esl and .hdr
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(name, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "all_results.csv")
you can find a sample dataset here: Sample dataset
I work with a lot of lab data where I can rely on the output files being in a reliable format (same column order, column name, header format, etc). So this is assuming that the .ENVI files you have are similar to that. If your files are not like that, I'm happy to help with that too, I'd just need to see a dummy file or two.
Anyways here's the idea:
library(caTools)
library(lubridate)
library(magrittr)
setwd("~/Binfo/TST/Stack/") # adjust as needed
files <- list.files("data/", full.name = T) # adjust as needed
all_names <- gsub("\\.\\D{3}", "", files) # strip off extensions
names1 <- unique(all_names) # get rid of duplicates
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
f <- gsub(".*\\/(\\d{3}\\D)_.*", "\\1", name)
d <- gsub(".*_(\\d+)_.*", "\\1", name) %>% ymd()
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=f,Datum= as.character(d),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(names1, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "data/all_results.csv")
Let me know if you have any problems and we an go from there. Cheers.
I edited my answer a bit, I think the problem you were hitting is in list.files() it should have had the argument full.name = T. I also adjusted you parsing method to be a little more defensive and use grep capture expressions. I tested the code with your two example files (4 really) but I can build out a large matrix (66743 elements). Also I used lubridate, I think it's a better way to work with dates and times.

R loop perform function on multiple csv files

I have tried to create a for loop that does something for each of 4 csv files similar to this but with more files.
dat1<- read.csv("female.csv", header =T)
dat2<- read.csv("male.csv", header =T)
for (i in 1:2) {
message("Female, Male")
Temp <- dat[i][(dat[i]$NAME == "Temp"), ]
Temp <- Temp[complete.cases(Temp)]
print(mean(Temp$MEAN))
However, I get an error:
Error in Temp$MEAN : $ operator is invalid for atomic vectors
Not sure why this isn't working. Any help would be appreciated for looping through csv files!
Personally, I think the easiest way to do this is with the plyr package:
library(plyr)
myFiles <- c("male.csv", "female.csv")
dat <- ldply(myFiles, read.csv)
dat <- dat[complete.cases(dat), ]
mean(dat$MEAN)
The way this works is that you first create a vector of file names. Then the ldply() function performs the function read.csv() on the vector of filenames, and converts the output automatically to a data.frame. Then you do the complete.cases() and mean() in the usual way.
Edit:
But if you want the mean of each file then here is one way of doing it:
# create a vector of files
myFiles <- c("male.csv", "female.csv")
# create a function that properly handles ONLY ONE ELEMENT
readAndCalc <- function(x){ # pass in the filename
tmp <- read.csv(x) # read the single file
tmp <- tmp[complete.cases(tmp), ] # complete.cases()
mean(tmp$MEAN) # mean
}
x <- "male.csv"
readAndCalc(x) # test with ONE file
sapply(myFiles, readAndCalc) # run with all your files
The way this works is that you first create a vector of filenames, just like before. Then you create a function that processes ONLY ONE file at a time. Then you can test that the function works using the readAndCalc function you just created. Finally do it for all your files with the sapply() function. Hope that helps.

Resources