The amount of rows when converting pandas dataframe into a VDF look considerably less than individual pandas dataframe - vertica-python

I am working on titanic data set. I read the train.csv and test.csv files using the read_csv function
from verticapy import *
#drop table if already exists
drop('v_temp_schema.train', method='table')
train_vdf = read_csv("train.csv")
train_vdf
from verticapy import *
#drop table if already exists
drop('v_temp_schema.test', method='table')
test_vdf = read_csv("test.csv")
test_vdf
Each vdf contains at least 100 rows.
Then I combine the two vdfs
import pandas as pd
#combine is a list, but its elements are vDataFrame. concat requires the elements to be Series or DataFrame. So need to convert
# vdf to panda data frames
combine = [train_vdf.to_pandas(), test_vdf.to_pandas()] #gives a list
combine_pdf = pd.concat(combine)
combine_pdf
The output seems to have 1234 rows × 12 columns amount of data.
Now when I convert the pandas of compbined data back to vdf, I seem to get very less no. of rows. I see only two rows in the output. Why?
set_option("sql_on", False)
combine_vdf = pandas_to_vertica(combine_pdf)
combine_vdf

Related

How to extract a common column from multiple tsv files and combine them into one dataframe in R?

I want to extract a common column named "framewise_displacement" from 162 tsv files arranged by subject ID numbers (eg., sub-CC123_timeseries.tsv, sub-CC124_timeseries.tsv, etc) with different number of columns and same number of rows, and merge them into a single dataframe.
The new dataframe is desired to have the columns to be the "framewise_displacement" from different subjects files with subject ID along, and the rows to be the same from the original files.
I tried to use vroom function in R, but failed because the files have different number of columns.
Also tried this code, but the output stacked all the columns into 1 single columns.
files = fs::dir_ls(path = "Documents/subject_timeseries", glob = "*.tsv")
merged_df <- map_df(files, ~vroom(.x, col_select=c(framewise_displacement)))
What should I do to merge them into one dataframe with the desired column side by side?
Any suggestions would be appreciated.
Many thanks!!!

Importing multing multiple JSON files in R single dataframe

Hei,
I want to import JSON files from a folder to R data frame (as a single matrix). I have about 40000 JSON files with one observation each and different variable sizes.
I tried following codes
library(rjson)
jsonresults_all <- list.files("mydata", pattern="*.json", full.names=TRUE)
myJSON <- lapply(jsonresults_all, function(x) fromJSON(file=x))
myJSONmat <- as.data.frame(myJSON)
I want my data frame something like (40000 observations (rows) and some 175 variables (column) with some variable values NA.
But I get a single row containing each observation appended to the right.
Many thanks for your suggesion.

rds file decompressed has inconsistent size

I have a downloaded a .rds file that I have decompressed in R using:
t<-readRDS("myfile.rds")
the file is easily decompressed into a data frame. ncol(t)=24, nrow(t)=20.
When I view the file in R studio, the table has actually 1572 columns and 20 rows.
I would like to know what I am actually dealing with here, mainly because when I try to save this data frame on a mysql server using RMySQL and DBI (dbWriteTable() ), R freezes.
For your information, class(t)='data.frame', typeof(t)='list'.
str(t) yields
tidyr::unnest(t) yields
thank you for your assistance
From your str call, consider walking down each nested element and flatten each one accordingly with either [[, unlist, or cbind to generate a main data frame. The recurring property is that most components appear to have length of 20 items, being number of observations of t.
# FLATTEN LIST OF CHR VECTORS INTO DATA FRAME COLUMN
t$alt_names <- unlist(t$alt_names)
# FLATTEN ONE COLUMN DATA FRAME
t$official_sites <- t$official_sites[[1]]
# ADJUST COLUMNS TO ALL NAs DUE TO ALL EMPTY LISTS
t$stats$previous_seasons <- NA
# CREATE TWENTY NEW COLUMNS FROM LIST OF CHR VECTORS
t$stats[paste0("seasonGoals_away", 1:20)] <- unlist(t$stats$seasonGoals_away)
t$stats$seasonGoals_away <- NULL # REMOVE ORIGINAL COLUMN
# SEPARATE LARGE DATA FRAME PIECES
stats <- t$stats
t$stats <- NULL # REMOVE ORIGINAL COLUMN
# COMBINE LARGE DATA FRAME PIECES
main_df <- cbind(t, stats) # DATA FRAME OF 20 OBS BY 1053 COLUMNS
Add same like steps for other nested objects not shown in screenshots. Ultimately, main_df should be a data frame of only flat, atomic types of 20 obs by 1053 (24 + 1023) columns.

Comparing column headers of two files to fetch data in R

I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.
I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]
if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.
First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')

Retaining numerical data from csv file

I am trying to import a csv dataset which is about the number of benefit recipients per month and district. The table looks like this:
There are 43 variables (months) and 88 observations (districts).
Unfortunately, when I import the dataset with the following code:
D=read.csv2(file="D.csv", header=TRUE, sep=";", dec=".")
all my numbers get converted to characters.
I tried the as.is=T argument, and to use read.delim, as suggested by Sam in this post: Imported a csv-dataset to R but the values becomes factors
but it did not help.
I also tried deleting the first two columns in the original csv file to get rid of the district names (which is the only real non-numeric column) but I stil get characters in the imported data frame. Can you please help how I could retain my numerics?

Resources