How to split ASCII file by integers/digits? - r

If I have an ASCII text file that reads like this:
12345
and I want to separate it by integers so that it becomes
v1 v2 v3 v4 v5
1 2 3 4 5
In other words, each integer is a variable.
I know I can use the read.fwf in R but since I have nearly 500 variables in my dataset, is there a better way to divide the integers up into their own columns than having to put widths=c(1,) and repeat the "1," 500 times?
I also tried importing the ASCII file into Excel and SPSS but both don't allow me to put in the variable breaks at fixed integer distances.

You could determine the width of the file by reading in one row as-is, then use that for read_fwf. Using tidyverse functions,
library(readr)
library(stringr)
path <- "path_to_data.txt" # your path
# one pass of the data
pass <- read_csv(path, col_names = FALSE, n_max = 1) # one row, no header
filewidth <- str_length(pass[1, ]) # width of first row
# use fwf with specified number of columns
df <- read_fwf(path, fwf_widths(rep(1, filewidth)))

Here's an option using read.fwf(), which was your initial choice.
# for the example only, a two line source with different line lengths
input <- textConnection("12345\n6789")
df1 <- read.fwf(input, widths = rep(1, 500))
ncol(df1)
# [1] 500
But assume you actually have less than 500 (as you say, and is the case in this example), then the extra columns with all values set to NA can be removed as follows. This uses your longest line to determine the number of columns that are retained.
df1 <- df1[, apply(!is.na(df1), 2, all)]
df1
# V1 V2 V3 V4 V5
# 1 1 2 3 4 5
# 2 6 7 8 9 NA
However, if no missing values are acceptable, then use any() to use your shortest line to determine the number of columns that are retained.
df1 <- df1[, apply(!is.na(df1), 2, any)]
df1
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 6 7 8 9
Of course, if you know the exact line length and all lines are the same length, then just set widths = rep(1, x) with x set to the known length.

If you are using Excel 2010 or later, you can import the file using Power Query (aka Get & Transform). When you edit the input, there is an option to split columns and specify the number of characters:
This tool is included in Excel 2016, and is a free Microsoft add-in for Excel 2010 and later.

Related

Rename column R

I am trying to rename columns but I do not know if that column will be present in the dataset. I have a large data set and if a certain column name is present I want to rename it. For example:
A B C D E
1 4 5 9 2
3 5 6 9 1
4 4 4 9 1
newNames <- data %>% rename(1=A,2=B,3=C,4=D,5=E)
This works to rename what is in the dataset but I am looking for the flexibility to add more potential name changes, without an error occurring.
newNames2 <- data %>% rename(1=A,2=B,3=C,4=D,5=E,6=F,7=G)
This ^ will not work it give me an error because F and G are not in the data set.
Is there any way to write a code to ignore the column change if the name does not exist?
Thanks!
There can be plenty of ways to do this. One would be to create a named vector with the names and their corresponding 'new name' (as the vector's names) and use that, i.e.
#The below vector v1, uses LETTERS as old names and 1:7 as the new ones
v1 <- setNames(LETTERS[1:7], 1:7)
names(df) <- names(v1)[v1 %in% names(df)]

Rounding decimals of an Index

I created a "trust in politics"-index by aggregating 5 variables, all of which measured some form of trust on a scale from 1-10.
attach(ess_variablen)
aggr_trst <- (1/5)*(trst_prl+trst_leg+trst_part+trst_politic+trst_polit)
However, the results now contain 1 decimal, whereas I would like to round the numbers in order not to have any decimals in the index numbers.
I have not been able to find a work-around in order to round numerical values created by an index. Does anyone know how to achieve that? Thank you!
The round() function can be used to round to the nearest whole number when one uses 0 as the number of decimal places. The following example illustrates this by summing rows in a data frame of uniform random numbers.
set.seed(950141237)
data <- as.data.frame(matrix(runif(20),nrow=5,ncol=4))
data
data$index <- round(rowSums(data),0) # round to 0 decimal places to obtain whole numbers
data
...and the output.
> set.seed(950141237)
> data <- as.data.frame(matrix(runif(20),nrow=5,ncol=4))
> data
V1 V2 V3 V4
1 0.07515484 0.8874008 0.37130877 0.05977506
2 0.30097513 0.8178419 0.05203982 0.10694951
3 0.82328607 0.4182799 0.24034152 0.52173278
4 0.52849399 0.8690592 0.66814229 0.66475498
5 0.01914658 0.8322007 0.41399458 0.19649338
> data$index <- round(rowSums(data),0) # round to 0 decimal places to obtain whole numbers
> data
V1 V2 V3 V4 index
1 0.07515484 0.8874008 0.37130877 0.05977506 1
2 0.30097513 0.8178419 0.05203982 0.10694951 1
3 0.82328607 0.4182799 0.24034152 0.52173278 2
4 0.52849399 0.8690592 0.66814229 0.66475498 3
5 0.01914658 0.8322007 0.41399458 0.19649338 1
Edit : Seems previous solution did not works sorry for inconvenience
You can try the ceiling() and the floor() function depending of your needs
notRound <- (1/5) * (1.2+2.3+3.3)
print(notRound)
roundUp <- ceiling(notRound)
print(roundUp)
roundDown <- floor(notRound)
print(roundDown)
Snippet live if you want to try
http://rextester.com/WURVZ7294

Replace semicolon-separated values to tab

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources