Records less in data frame - r

I have found few people posted similar issues but still can't solve my problem. The expected objects in the dataframe were 957463 but only 392400 extracted out.
I used read.delim2("test.csv", header = TRUE, sep = ",", quote = "\"", fill = TRUE) to create dataframe with the expected records but the output were less than expected.
#set working directory --------------------------------
L <- setwd("C:/Users/abmo8004/Documents/R Project/csv/")
#List files in the path ------------------------
l <- list.files(L)
#form dataframe from csv file ---------------------------
df <- read.delim2("test.csv", header = TRUE, sep = ",", quote = "\"", fill = TRUE)
I expect the output to be 957463 , but the actual output is 392400. Can anyone please look at the codes?

Related

How to read multiple text files, add column headings to every file, and overwrite the old files with new ones in R?

I have multiple EEG data files in .txt format all saved in a single folder, and I would like R to read all the files in said folder, add column headings (i.e., electrode numbers denoted by ordered numbers from 1 to 129) to every file, and overwrite old files with new ones.
rm(list=ls())
setwd("C:/path/to/directory")
files <- Sys.glob("*.txt")
for (file in files){
# read data:
df <- read.delim(file, header = TRUE, sep = ",")
# add header to every file:
colnames(df) <- paste("electrode", 1:129, sep = "")
# overwrite old text files with new text files:
write.table(df, file, append = FALSE, quote = FALSE, sep = ",", row.names = FALSE, col.names = TRUE)
}
I expect the column headings of ordered numbers (i.e., electrode1 to electrode129) to appear on first row in every text file but the code doesn't seem to work.
I bet the solution is ridiculously simple, but I just haven't found any useful information regarding this issue...
Try this one
for (file in files) {
df = read.delim(file,header = FALSE,sep = ",")
colnames(df) = paste("electrode",1:129,sep = "")
write.table(df, file = "my_data.txt", sep = ",")
}

Why in R string entries do not get read into the data.frame?

I have a data.tsv file (tabs separate entries). The full file can be found here.
The entries in the file look like this:
">173D:C" "TVPGVXTVPGV" "CCSCCCCCCCC"
">173D:D" "TVPGVXTVPGV" "CCCCCCCCSCC"
">185D:A" "SAXVSAXV" "CCBCCCBC"
">1A0M:B" "GCCSDPRCNMNNPDYCX" "CCTTSHHHHHTCTTTCC"
">1A0M:A" "GCCSDPRCNMNNPDYCX" "CGGGSHHHHHHCTTTCC"
">1A0N:A" "PPRPLPVAPGSSKT" "CCCCCCCCSTTCCC"
I am trying to read string entries into the data frame (into a matrix
containing 3 columns):
data = data.frame(read.csv(file = './data.tsv', header = FALSE, sep = '\t'))
but only the first column is read. All other columns are empty.
I also tried different commands, such as
data = read.csv(file = './data.tsv', header = FALSE, sep = '\t')
data = read.csv(file = './data.tsv', sep = '\t')
data = data.frame(read.csv(file = './data.tsv'))
but without success. Can someone foresee why the input does not get read
successfully?
Using the file defined reproducibly in the Note at the end this works:
DF <- read.table("myfile.dat", as.is = TRUE)
gives:
> DF
V1 V2 V3
1 >173D:C TVPGVXTVPGV CCSCCCCCCCC
2 >173D:D TVPGVXTVPGV CCCCCCCCSCC
3 >185D:A SAXVSAXV CCBCCCBC
4 >1A0M:B GCCSDPRCNMNNPDYCX CCTTSHHHHHTCTTTCC
5 >1A0M:A GCCSDPRCNMNNPDYCX CGGGSHHHHHHCTTTCC
6 >1A0N:A PPRPLPVAPGSSKT CCCCCCCCSTTCCC
Note
Lines <- '">173D:C" "TVPGVXTVPGV" "CCSCCCCCCCC"
">173D:D" "TVPGVXTVPGV" "CCCCCCCCSCC"
">185D:A" "SAXVSAXV" "CCBCCCBC"
">1A0M:B" "GCCSDPRCNMNNPDYCX" "CCTTSHHHHHTCTTTCC"
">1A0M:A" "GCCSDPRCNMNNPDYCX" "CGGGSHHHHHHCTTTCC"
">1A0N:A" "PPRPLPVAPGSSKT" "CCCCCCCCSTTCCC"'
writeLines(Lines, "myfile.dat")
Use sep=''
data = read.csv(file = './data.tsv', header = FALSE, sep = '')
See this answer.

Skip different number of rows in import

I'm importing a lot of datasets. All of them have some empty lines at the top (before header), however it's not always the same number of rows that I need to skip.
Right now I'm using:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = 9)
But sometimes I only need to skip 3 lines fx.
Can I somehow set up a rule that when my column B (in Excel) contains one of the following words at the beginning of a sentence:
Datastatistik
Overførte records
FI-CA
Oprettet
Column A is always empty but I delete this in a code after the import.
This is an example of my data (I have hidden personal numbers):
My first variable header is called "Bilagsnummer" or "Bilagsnr.".
I don't know if it's possible to set up a rule that says something like the first occurrence of this word is my header? Really I'm just brainstorming here, cause I have no idea how to automatise this data import.
---EDIT---
I looked at the post #Bram linked to, and it did solve some of my problem.
I changed some of it.
This is the code I used:
temp <- readLines("file.xls")
skipline <- which(grepl("\tDatastatistik", temp) |
grepl("\tOverførte", temp) |
grepl("FI-CA", temp) |
grepl("Oprettet", temp) |
temp == "")
So the skipline interger that I made contains those lines that need to be skipped. These are correct using the grepl function (since the wording at the end of sentence changes from time to time).
Now, I still have a problem though.
When I use skip = skipline in my read.delim It only works for the fist row.
I get the warning message:
In if (skip > 0L) readLines(file, skip) :
the condition has length > 1 and only the first element will be used
May have found a solution, but not the optimal one. Let's see.
Import your df with the empty lines:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE)
Find the number of empty rows at the beginning:
NonNAindex <- which(!is.na(df2[,2]))
lastEmpty <- (min(NonNAindex)-1)
Re-import your document using that info:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = lastEmpty)

How to Create Table from Irregular Length Element in R

I'm new with R and seek some digestible guidance. I wish to create data.frame so I can create column and establish variables in my data. I start with exporting url into R and save into Excel;
data <- read.delim("http://weather.uwyo.edu/cgi-bin/wyowx.fcgi?
TYPE=sflist&DATE=20170303&HOUR=current&UNITS=M&STATION=wmkd",
fill = TRUE, header = TRUE,sep = "\t" stringsAsFactors = TRUE,
na.strings = " ", strip.white = TRUE, nrows = 27, skip = 9)
write.xlsx(data, "E:/Self Tutorial/R/data.xls")
This data got missing value somewhere in the middle of element thus make the length irregular. Due to irregular length I use write.table instead of data.frame.
As 1st attempt, in global environment, data exist in R value(NULL) not in R data;
dat.table = write.table(data)
str(dat.table) # just checking #result NULL?
try again
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", sep = "\t", quote = FALSE)
dat.table ##print nothing
remove sep =
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", quote = FALSE
dat.table ##print nothing
since its not working, I try read.table
dat.read <- read.table("E:/Self Tutorial/R/data.xls", header = T, sep = "\t")
Data loaded in R console, but as expected with irregular column distribution, (??even though I already use {na.strings = " ", strip.white = TRUE} in read.delim argument)
What should I understand from this mistake, and which is it. Thank you.

R find maxima of multiple variables from multiple .CSV files

I have multiple csv's, each containing multiple observations for one participant on several variables. Let's say each csv file looks something like the below, and the name of the file indicates the participant's ID:
data.frame(
happy = sample(1:20, 10),
sad = sample(1:20, 10),
angry = sample(1:20, 10)
)
I found some code in an excellent stackoverflow answer that allows me to access all files saved into a specific folder, calculate the sums of these emotions, and output them into a file:
# access all csv files in the working directory
fileNames <- Sys.glob("*.csv")
for (fileName in fileNames) {
# read original data:
sample <- read.csv(fileName,
header = TRUE,
sep = ",")
# create new data based on contents of original file:
data.summary <- data.frame(
File = fileName,
happy.sum = sum(sample$happy),
sad.sum = sum(sample$sad),
angry.sum = sum(sample$angry))
# write new data to separate file:
write.table(data.summary,
"sample-allSamples.csv",
append = TRUE,
sep = ",",
row.names = FALSE,
col.names = FALSE)}
However, I can ONLY get "sum" to work in this function. I would like to not only find the sums of each emotion for each participant, but also the maximum value of each.
When I try to modify the above:
for (fileName in fileNames) {
# read original data:
sample <- read.csv(fileName,
header = TRUE,
sep = ",")
# create new data based on contents of original file:
data.summary <- data.frame(
File = fileName,
happy.sum = sum(sample$happy),
happy.max = max(sample$happy),
sad.sum = sum(sample$sad),
angry.sum = sum(sample$angry))
# write new data to separate file:
write.table(data.summary,
"sample-allSamples.csv",
append = TRUE,
sep = ",",
row.names = FALSE,
col.names = FALSE)}
I get the following warning message:
In max(sample$happy) : no non-missing arguments to max; returning -Inf
Would sincerely appreciate any advice anyone can give me!
using your test data, the max() statement works fine for me. Is it related to a discrepancy between the sample code you have posted and your actual csv file structure?

Resources