Unexpected results when merging two vectors into a dataframe - r

I am extracting data from multiple CSV files and attempting to combine them into a single data frame. The source data is formatted weirdly, so I have to extract the data from specific locations in the source, then place them in a logical pattern in my resulting data frame.
I created two vectors of equal length and pulled the data from my source files. The end result is that I wind up with two vectors of length 3 (as expected), but instead of having a 3x2 data frame (3 observations of 2 variables), I wind up with a 1x6 data frame (1 observation of 6 variables).
What is curious to me is that although RStudio deems them both to be "List of 3", when I show them in the console, they display very differently:
The source code which doesn't work:
#set the working directory to where the data files are stored
setwd("/foo")
# identify how many data files are present
files = list.files("/foo")
# create vectors long enough to contain all the postal codes and income data
postalCodeData=vector(length=length(files))
medianIncomeData=vector(mode="character", length=length(files))
# loop through all the files, pulling data from rows 2 and 1585.
for(i in 1:length(files)) {
x = read.csv(files[i],skip=1,nrows= 1,header=F)
y = read.csv(files[i], skip = 1584, nrows = 1,header=F)
postalCodeData[i]=x
medianIncomeData[i]=y[2]
}
#create the data frame
Results=data.frame(postalCodeData,medianIncomeData)
#name the columns
names(Results)=c("FSA", "Median Income")
My data frame winds up looking like this:
Source code which does work:
setwd("/Users/Perry/Downloads/Postal Code Data/")
files = list.files("/Users/Perry/Downloads/Postal Code Data/")
postalCodeData=c("K0A","K0B","K0C")
medianIncomeData=c("10000","20000","30000")
Results=data.frame(postalCodeData,medianIncomeData)
names(Results)=c("FSA", "Median Income")
Unfortunately, I can't specify the values explicitly because I have a few hundred files to extract the information from. Any advice on how I can correct the loop to get the desired results would be appreciated.

The output of "read.csv" is a data frame, so, when you store
medianIncomeData[i]=y[2]
you are storing a column of a data frame, use
medianIncomeData[i]=y[2][1]
instead, to store only the value that you want, the same for x

Related

Importing multing multiple JSON files in R single dataframe

Hei,
I want to import JSON files from a folder to R data frame (as a single matrix). I have about 40000 JSON files with one observation each and different variable sizes.
I tried following codes
library(rjson)
jsonresults_all <- list.files("mydata", pattern="*.json", full.names=TRUE)
myJSON <- lapply(jsonresults_all, function(x) fromJSON(file=x))
myJSONmat <- as.data.frame(myJSON)
I want my data frame something like (40000 observations (rows) and some 175 variables (column) with some variable values NA.
But I get a single row containing each observation appended to the right.
Many thanks for your suggesion.

rds file decompressed has inconsistent size

I have a downloaded a .rds file that I have decompressed in R using:
t<-readRDS("myfile.rds")
the file is easily decompressed into a data frame. ncol(t)=24, nrow(t)=20.
When I view the file in R studio, the table has actually 1572 columns and 20 rows.
I would like to know what I am actually dealing with here, mainly because when I try to save this data frame on a mysql server using RMySQL and DBI (dbWriteTable() ), R freezes.
For your information, class(t)='data.frame', typeof(t)='list'.
str(t) yields
tidyr::unnest(t) yields
thank you for your assistance
From your str call, consider walking down each nested element and flatten each one accordingly with either [[, unlist, or cbind to generate a main data frame. The recurring property is that most components appear to have length of 20 items, being number of observations of t.
# FLATTEN LIST OF CHR VECTORS INTO DATA FRAME COLUMN
t$alt_names <- unlist(t$alt_names)
# FLATTEN ONE COLUMN DATA FRAME
t$official_sites <- t$official_sites[[1]]
# ADJUST COLUMNS TO ALL NAs DUE TO ALL EMPTY LISTS
t$stats$previous_seasons <- NA
# CREATE TWENTY NEW COLUMNS FROM LIST OF CHR VECTORS
t$stats[paste0("seasonGoals_away", 1:20)] <- unlist(t$stats$seasonGoals_away)
t$stats$seasonGoals_away <- NULL # REMOVE ORIGINAL COLUMN
# SEPARATE LARGE DATA FRAME PIECES
stats <- t$stats
t$stats <- NULL # REMOVE ORIGINAL COLUMN
# COMBINE LARGE DATA FRAME PIECES
main_df <- cbind(t, stats) # DATA FRAME OF 20 OBS BY 1053 COLUMNS
Add same like steps for other nested objects not shown in screenshots. Ultimately, main_df should be a data frame of only flat, atomic types of 20 obs by 1053 (24 + 1023) columns.

How to read cell from data frame in a for loop where the frame name increases with the loop

Sorry for the terrible title. First post here, and new with R.
I am trying to import data from multiple CSV files, extract a single row from each CSV to individual data frames then make a new data frame for a specific value from each initial data frame. I hope this makes sense.
Here is the code I have used so far:
# Take downloaded IFD csv's for 15 points, extract 1% AEP, 6 hour rainfall depths.
files <- list.files(path = "C:PATH")
for (i in 1:length(files)){ # Head of for-loop, length is 15 files
assign(paste0("data", i), # Read and store data frames for row containing 6 hour depths
read.csv2(paste0("C:PATH", files[i]), sep = ",", header = FALSE, nrows = 1, skip = 26))
}
#final value in data frame, position [1,9] is the 1% AEP depth for 6 hours. Extract all of these values from the initial 15 data frames into new dataframes.
for (i in 1:15) {
SixHourOnePercentAEP[i] <- data[i][1,9]
}
In the last argument, an error is returned trying to call data[i][1,9] since dataframe[x,y] is trying to find a cell where the iteration of the i occurs. Looking for a way around this.
It seems that you are trying to create dataframes such as data1, data2, etc for each corresponding file. Then you are trying to access the i-th dataframe with the syntax data[i].
But that's not how it works. "data" is not an array of dataframes, but instead you have different variables named data1, data2, etc. What you need is to access given variable by name. You can do it this way:
for (i in 1:15) {
SixHourOnePercentAEP[i] <- get(paste0("data",i))[1,9]
}
The get() function gets a variable whose name has been passed as a character argument.
I found however your code extremely inefficient. Why gather all the entire dataframes beforehand, when the only thing you need is one cell from each one? You should rewrite your first loop to extract the desired value from the dataframe immediately then store it, discarding the rest of the data right away if I understand you purpose correctly.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

Appending a single variable from multiple files to a data frame or other object type

I'm just learning R. I have 300 different files containing rainfall data. I want to create a function that takes a range of values (i.e., 20-40). I will then read csv files named "020.csv", "021.csv", "022.csv" etc. up to "040.csv".
Each of these files has a variable named "rainfall". I want to open each csv file, extract the "rainfall" values and store (append) them to some sort of object, like a data frame (maybe something else is better?). So, when I'm done, I'll have a data frame or list with a single column containing rainfall data from all processed files.
This is what I have...
rainfallValues <- function(id = 1:300) {
df = data.frame()
# Read anywhere from 1 to 300 files
for(i in id) {
# Form a file name
fileName <- sprintf("%03d.csv",i)
# Read the csv file which has four variables (columns). I'm interested in
# a variable named "rainfall".
x <- read.csv(fileName,header=T)
# This is where I am stuck. I know how to exact the "rainfall" variable values from
# x, I just don't know how to append them to my data frame.
}
}
Here is a method using lapply that will return a list of rainfalls
rainList <- lapply(id, function(i) {
temp <- read.csv(sprintf("%03d.csv",i))
temp$rainfall
})
To put this into a single vector:
rainVec <- unlist(rainList)
comment
The unlist function will preserve the order that you read in the files, so the first element of rainVec will be the first observation of the first rainfall column from the first file in id and the second element the second observation in that files and so on to the last observation of the last file.

Resources