Read tab seperated data set with errors - r

I have a little problem with some datasets which are containing tab seperated data, but unfortunately there are some errors in the raw data, causing problems while reading into R.
A small example for better understanding, the dataset looks like this:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
The 7 8 9 part should be in one row, but is wrongly seperated into two (in the raw data). Is there any chance to correct this while reading in and not by manually changing this? Because the dataset is around 4m observations large, a manual correction would take a lot of time...

Try this example:
# read the file line by line:
x <- readLines("data.txt")
# Split by " " (or in your case "\t"), and convert to dataframe with 3 columns:
res <- data.frame(matrix(unlist(strsplit(x[-1], " "), recursive = TRUE),
ncol = 3, byrow = TRUE))
# Add column names to dataframe:
colnames(res) <- unlist(strsplit(x[1], " "))
res
# Col1 Col2 Col3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
# 4 10 11 12
Example data.txt file:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
Note: Just noticed your real data is 4 million rows, maybe this is not the most efficient way.

My solution is more complicated than the solution by user zx8754 but here it goes.
readWrong <- function(file, skip = 1){
txt <- readLines(file)
header <- txt[seq_len(skip)]
header <- scan(what = character(), textConnection(header))
txt <- txt[-seq_len(skip)]
data <- scan(textConnection(txt))
data <- matrix(data, ncol = length(header), byrow = TRUE)
data <- as.data.frame(data)
names(data) <- header
data
}
readWrong("data.txt")
# Col1 Col2 Col3
#1 1 2 3
#2 4 5 6
#3 7 8 9
#4 10 11 12

Related

How to create a parameter using $ to select from data frame

I have three different data frames that are similar in their columns such:
df1 df2 df3
Class 1 2 3 Class 1 2 3 Class 1 2 3
A 5 3 2 A 7 3 10 A 5 4 1
B 9 1 4 B 2 6 2 A 2 6 2
C 7 9 8 C 4 7 1 A 12 3 8
I would like to iterate through the three files and select the data from the columns with similar name. In other words, I want to iterate three times and everytime select data of column 1, then column 2, and then column 3 and merge them in one data frame.
To do that, I did the following:
df1 <- read.csv(R1)
df2 <- read.csv(R2)
df3 <- read.csv(R3)
df <- data.frame(Class=character(), B1_1=integer(), B1_2=integer(), B1_3=integer(), stringsAsFactors=FALSE)
for(i in 1:3){
nam <- paste("X", i, sep = "") #here I want to call the column name such as X1, X2, and X3
df[seq_along(df1[nam]), ]$B1_1 <- df1[nam]
df[seq_along(df2[nam]), ]$B1_2 <- df2[nam]
df[seq_along(df3[nam]), ]$B1_3 <- df3[nam]
df$Class <- df1$Class
}
In this line df[seq_along(df1[nam]), ]$B1_1 <- df1[nam], I followed the solution from this but this produces the following error:
Error in `$<-.data.frame`(`*tmp*`, "B1_1", value = list(X1 = c(5L, 7L, :
replacement has 10 rows, data has 1
Do you have any idea how to solve it?

Merge in loop R

I am using a for loop to merge multiple files with another file:
files <- list.files("path", pattern=".TXT", ignore.case=T)
for(i in 1:length(files))
{
data <- fread(files[i], header=T)
# Merge
mydata <- merge(mydata, data, by="ID", all.x=TRUE)
rm(data)
}
"mydata" looks as follows (simplified):
ID x1 x2
1 2 8
2 5 5
3 4 4
4 6 5
5 5 8
"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):
ID x3
1 8
2 4
ID x3
3 4
4 5
5 1
When I run my code I get the following dataset:
ID x1 x2 x3.x x3.y
1 2 8 8 NA
2 5 5 4 NA
3 4 4 NA 4
4 6 5 NA 5
5 5 8 NA 1
What I would like to get is:
ID x1 x2 x3
1 2 8 8
2 5 5 4
3 4 4 4
4 6 5 5
5 5 8 1
ID's are unique (never duplicates over the 600 files).
Any idea on how to achieve this as efficiently as possible much appreciated.
It's better suited as comment, But I can't comment yet.
Would it not be better to rbind instead of merge?
This seems to be what you want to acomplish.
Set fill argument TRUE to take care of different column numbers:
asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)
x1 x2
1: 1 4
2: 2 5
3: NA 5
Do this with data and then merge into mydata by ID.
Update for comment
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.
With ldply from plyr rbind all dataframes into one dataframe.
Don't touch mydata yet.
binded.data <- data.table(binded.data, key = ID)
Depending on your mydata you will perform different merge commands.
See:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Update 2
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
Update 3
You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.
ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Converting row names to data frame column

I want to be able to access b0.e7, c0.14,...,f8.d4. But right now these are not in a column, but are the "row names". How can I have the row names be 1,2,3,4,5,6,7 and b0.e7, c0.14,...,f8.d4 to be it's own column. Thanks for the help in advance.
df=as.data.frame(c)
df = subset(df, c>7)
df
c
b0.e7 11
c0.14 8
f8.d1 10
f8.d2 9
f8.d3 11
f8.d4 12
Try this. The first line assigns a new column that is just the current row names of the data frame. The second line resets the row names to NULL, resulting in a sequence.
> df$new <- rownames(df)
> rownames(df) <- NULL
Which should result in
> df
# c new
# 1 11 b0.e7
# 2 8 c0.14
# 3 10 f8.d1
# 4 9 f8.d2
# 5 11 f8.d3
# 6 12 f8.d4
And you can reverse the column order if needed with df[, c(2, 1)]
You can make use of the fact that cbind.data.frame can make use of arguments from data.frame, one of which is row.names. That argument can be set to NULL, meaning that a slightly more direct approach than proposed by Richard is:
cbind(rn = rownames(mydf), mydf, row.names = NULL)
# rn c
# 1 b0.e7 11
# 2 c0.14 8
# 3 f8.d1 10
# 4 f8.d2 9
# 5 f8.d3 11
# 6 f8.d4 12
You can try this as well.
rows = row.names(df)
df1 = cbind(rows,df)

Putting output in R into excel

Guys I have a code that generates 2 columns of data (e.g Number, Median) which refers to a particular person...but I have taken samples of 7 people
so basically I get this output:
[[1]
Number Median
1 5
2 3
.....
[[2]]
Number Median
1 6
2 4
....
[[3]]
Number Median
1 3
2 5
So I basically get this output....up til [[7]]
I tried transferring this output in excel using this code
write.csv(cbind(data),"data1.csv")
and I get this type of output:
list(c(Median =.......It lists all the median on the rows
But I want it to save the data referring to the 'median' and 'Number' in columns NOT ROWS
If I just type
write.csv(data,"data1.csv")
I get an error
arguments imply differing number of rows: 157, 179, 178, 180
As Marius said, you have a list of data.frames which can't be written to a .csv file. You need to do:
NewDataFrame <- do.call("rbind", YourList)
write.csv(NewDataFrame, "Data.csv")
do.call takes each of the elements from a list and applies whatever function you tell it (in this case rbind) to all of them.
Here are two options. Both use the following sample data:
myList <- list(data.frame(matrix(1:4, ncol = 2)),
data.frame(matrix(3:10, ncol = 2)),
data.frame(matrix(11:14, ncol =2)))
myList
# [[1]]
# X1 X2
# 1 1 3
# 2 2 4
#
# [[2]]
# X1 X2
# 1 3 7
# 2 4 8
# 3 5 9
# 4 6 10
#
# [[3]]
# X1 X2
# 1 11 13
# 2 12 14
Option 1: Write a csv file where the data.frames are presented as they are in the list
sink("list_of_dataframes.csv", type="output")
invisible(lapply(myList, function(x) dput(write.csv(x))))
sink()
If you open the resulting "list_of_dataframes.csv" file in a text editor, you will get something that looks like this. When you read this into a spreadsheet program, the first column will include the rownames and NULL separating each data.frame:
"","X1","X2"
"1",1,3
"2",2,4
NULL
"","X1","X2"
"1",3,7
"2",4,8
"3",5,9
"4",6,10
NULL
"","X1","X2"
"1",11,13
"2",12,14
NULL
Option 2: Write or search around for a version of cbind that accommodates binding data.frames with differing number of rows.
Here is one such function that I've written.
cbind2 <- function(datalist) {
nrows <- max(sapply(datalist, nrow))
expandmyrows <- function(mydata, rowsneeded) {
temp1 = names(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(data.frame(
matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
rbind(mydata, temp2)
}
do.call(cbind, lapply(datalist, expandmyrows, rowsneeded = nrows))
}
And here is that function applied to your list:
cbind2(myList)
# X1 X2 X1 X2 X1 X2
# 1 1 3 3 7 11 13
# 2 2 4 4 8 12 14
# 3 NA NA 5 9 NA NA
# 4 NA NA 6 10 NA NA
That output should be easy for you to use with write.csv and related functions.

data frame in user defined function in R

I'm trying to make a function that takes two arguments. One argument is the name of a data frame, and the second is the name of a column in that data frame. The goal is for the function to manipulate data in the whole frame based on information contained in the specified column.
My problem is that I can't figure out how to use the character expression entered into the second argument to access that particular column in the data frame within the function. Here's a super brief example,
datFunc <- function(dataFrame = NULL, charExpres = NULL) {
return(dataFrame$charExpress)
}
If, for instance you enter
datFunc(myData, "variable1")
this does not return myData$variable1. there HAS to be a simple way to do this. Sorry if the question is stupid, but i'd appreciate a little help here.
A related question would be, how do i use the character string "myData$variable1" to actually return variable1 from myData?
I think OP wants to pass name of dataframe as string too. If that is the case your function should be something like. (borrowed sample from other answer)
fooFunc <- function( dfNameStr, colNamestr, drop=TRUE) {
df <- get(dfNameStr)
return(df[,colNamestr, drop=drop])
}
> myData <- data.frame(ID=1:10, variable1=rnorm(10, 10, 1))
> myData
ID variable1
1 1 10.838590
2 2 9.596791
3 3 10.158037
4 4 9.816136
5 5 10.388900
6 6 10.873294
7 7 9.178112
8 8 10.828505
9 9 9.113271
10 10 10.345151
> fooFunc('myData', 'ID', drop=F)
ID
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> fooFunc('myData', 'ID', drop=T)
[1] 1 2 3 4 5 6 7 8 9 10
You're almost there, try using [ instead of $ for this kind of indexing
datFunc <- function(dataFrame = NULL, charExpres = NULL, drop=TRUE) {
return(dataFrame[, charExpres, drop=drop])
}
# An example
set.seed(1)
myData <- data.frame(ID=1:10, variable1=rnorm(10, 10, 1)) # DataFrame
datFunc(myData, "variable1") # dropping dimensions
[1] 9.373546 10.183643 9.164371 11.595281 10.329508 9.179532 10.487429 10.738325 10.575781 9.694612
datFunc(myData, "variable1", drop=FALSE) # keeping dimensions
variable1
1 9.373546
2 10.183643
3 9.164371
4 11.595281
5 10.329508
6 9.179532
7 10.487429
8 10.738325
9 10.575781
10 9.694612
Alternatively, you can find the column index of the dataframe:
df <- as.data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- sample(LETTERS, 10)
column.index.of.A <- grep("^A$", colnames(df))
df[, column.index.of.A]

Resources