How to bind filename as column - r

The code so far looks like this:
abc <- import_list(dir("MyData/", pattern = "*.xlsx",
full.names = TRUE), rbind = TRUE, rbind_label = "source")
Using the "rio" package this code imports many excel files at once putting one table under the other. The columns are sorted by column name (rbind = TRUE) in order to avoid a situation where data is put into the wrong columns (e.g. if some tables have more columns than others).
I want to have a FIRST column that entails the name of the excel file so that I know from where the data comes. However, there are two problems with rbind_label = "source"
It creates a column but in that column it's not the name of the file, but the whole path of it (pretty long)
The column is not at the beginning of the newly created table, but somewhere in the middle.
How can I solve these two problems?

Assuming the name of the source column is source. This will make it the first column:
abc <- abc[c('source', setdiff(names(abc),'source'))]
This will change that columns value from the full path to the filename:
abc$source <- basename(abc$source)

Related

R: Read specific columns of .dta file and converting variable names to lower case without reading whole file

I have a folder with multiple .dta files and I'm using the read_dta() function of the haven library to bind them. The problem is that some of the files have thier column names in lower case and others have them in upper case.
I was wondering if there is a way to only read the specific columns by changing their name to lower case in every case without reading the whole file and then selecting the columns, since the files are really large and this would take forever.
I was hoping that by using the .name_repair = element in the read_dta() function I could do this, but I really don't know how.
Im trying something like this
#Set working directory:
setwd("T:/")
#List of .dta file names to bind:
list_names<-list_names[grepl("_sdem.dta", list_names)]
#Variable names to select form those files:
vars_select<-c("r_def", "c_res", "ur", "con", "n_hog", "v_sel", "n_pro_viv","fac", "n_ren", "upm","eda", "clase1", "clase2", "clase3", "ent", "sex", "e_con", "niv_ins", "eda7c", "tpg_p8a","emp_ppal", "tue_ppal", "sub_o" )
#Read and bind ONLY the selected variables form the list of files
dataset <- data.frame()
for (i in 1:length(list_names)){
temp_data <- read_dta(list_names[i], col_select = vars_select)
dataset <- rbind(dataset, temp_data)
}
The problem is that when some of the files have their variable names in upper case format, their variables are not in the vars_select list and therefore, the next error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
I was trying to use the .name_repair = element in the read_dta() function to try to correct this, by using the tolower() function.
I was trying something like this with a specific file that has an upper case variable name format:
example_data <- read_dta("T:/2017_2_sdem.dta", col_select = vars_select, .name_repair = tolower(names()))
But the same error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
Thanks so much for your help!

Changing the name of a dataframe inside of a dataframe

I am working with a folder of csv files. These files were imported using the following code:
data_frame <- list.files("path", pattern = ".csv", all.files = TRUE, full.names = TRUE)
csv_data <- lapply(data_frame, read.csv)
names(csv_data) <- gsub(".csv","",
list.files("path", pattern = ".csv", all.files = TRUE, full.names = FALSE),
fixed = TRUE)
After this has been generated the dataframes hold the name of the csv. Since I have over 3000 csv files, I was wondering how to change the name of them to keep track of them better.
For example, instead of 'City, State, US', it will generate 'City-State-US'.
I apologize if this has already been asked, but I cannot find anything that could help.
So, if I understand your question correctly, you have CSV files names "City, State, US.csv" (so "Chicago, IL, USA.csv", etc.), you are reading them in to R and storing them in a list where the list element name is the CSV name, and you want to make some changes to that element name?
You can access the names of the list item using names(csv_data), as you did above, and then treat it however you like and write it back to the same.
For instance, the example you gave:
names(csv_data) <- gsub(", ", "-", names(csv_data), fixed = TRUE)
This should do what you need. If you need to do something else, just change the gsub parameters or function to something else - the key is that you can extract and write back the list item names in one shot.
You are already sort of doing this with the third line, where you name the items - you could even make the treatment before you assign the names.
Edit: Also, a quick note: You are already storing the output of list.files in the data_frame variable - you could just reuse that variable in the third line instead of calling list.files again.

How can I parse a json string on a txt file?

I need to parse two columns from a txt file in which I have:
first column: id (with only one value)
second column: parameters (which has more json fields inside, SEE BELOW).
Here an example:
ID;PARAMETERS
Y;"""Pr""=>""22"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""99"", ""Opt1""=>""67"", ""Opt2""=>""0"",
S;"""Pr""=>""5"", ""Err""=>""255"", ""Opt1""=>""0"", ""Opt2""=>""0"", ""Opt3""=>""55"", ""Opt4""=>""0"",
K;"""Pr""=>""1"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""21"", ""Opt1""=>""0"", ""Opt2""=>""0"",
P;"""Pr""=>""90"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""20"", ""Opt1""=>""0"", ""Opt2""=>""0"",
My dataset is in csv format, but I have tried also to transform it in a txt file or in json file and I tried to import it in R but I cannot parse the two columns.
I would like to obtain each parameter in one column, and If an ID does not have a parameter I need NA
Can you help me please?
I have tried this R code but it does not work:
setwd("D:/")
df <- read.delim("file name", header = TRUE, na.strings = -1)
install.packages("jsonlite")
library(jsonlite)
filepath<-"file name"
prova <- fromJSON(filepath)
prova <- fromJSON(filepath)
Can you help me please?
Thanks
You can import the csv directly in Rstudio with multiples functions :
data <- read.csv("your_data.csv")
Otherwise you can load it by looking in 'Import dataset' in the Environment tab. It will open a new window in which you can browse to your file without setting working directory.
To set NA for 0 values, here is an example
# I create a new data for example,
# you will have another one if you succed in importing your data.
data <- data.frame(ID = c("H1","H2","H3"), PARAMETERS_Y = c(42,0,3))
data[which(data$PARAMETERS_Y == 0),2] = NA
It look for rows in the Param column which are equal to 0 and replace them with NAs. Don't forget to change the column name to match your data loaded previously

read.csv importing two columns instead of one

I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.
The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!
Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.
If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv

Select multiple rows in multiple DFs with loop in R

I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)

Resources