I will use two files as example to explain my question.
I have multiple text files like below:
### First file
GEORGIA file name first row not use
Col1 Col2
A 2
A 4
A 5
B 2
B 6
### Second file
New York file name first row not use
Col1 Col2
C 2
C 4
D 5
E 2
F 6
I use data.table to import text file and then extract information I want.
library(data.table)
my_read_data <- function(x){ data <- data.table::fread(x, header = T, strip.white = T, fill = T, skip = 1) }
file.list <- dir(path = "C:/Users/filesnames/", pattern='\\.txt', full.names = T)
dt.list <- sapply(file.list, my_read_data, simplify=FALSE)
cd <- rbindlist(dt.list, idcol = 'id')[, FileNo := substr(id, 24, 25)]
And the result is in the following:
Col1 Col2 FileNo
A 2 1
A 4 1
A 5 1
B 2 1
B 6 1
C 2 2
C 4 2
D 5 2
E 2 2
F 6 2
However, what I actually want is:
Col1 Col2 FileNo Name
A 2 1 GEORGIA
A 4 1 GEORGIA
A 5 1 GEORGIA
B 2 1 GEORGIA
B 6 1 GEORGIA
C 2 2 New York
C 4 2 New York
D 5 2 New York
E 2 2 New York
F 6 2 New York
Since I skip the first row, so I cannot extract the words from first row where I found from here.
But if I did not remove the first row, it imported incorrectly.
Text File shows like:
### First file
GEORGIA file name first row not use
Col1,Col2
A,2
A,4
A,5
B,2
B,6
### Second file
New York file name first row not use
Col1,Col2
C,2
C,4
D,5
E,2
F,6
We can read the first line separately and create a column
library(data.table)
rbindlist(lapply(setNames(file.list, file.list), function(x) {
dat <- fread(x, header = TRUE, strip.white = TRUE, fill = TRUE, skip = 1)
v1 <- readLines(x, n = 1)
dat[, Name := sub("\\s+file name.*", "", v1)]
}), idcol = 'id')
Related
I have a script written in R that is ran weekly and produces a csv. I need to add headers over top of some of the column names as they are grouped together.
Header1 Header2
A B C D E F
1 2 3 4 5 6
7 8 9 a b c
In this example ABC columns are under the "Header1" header, and DEF are under the "Header2" header. Obviously this can be done manually but I was curious if there was a package that can do this. "No" is an acceptable answer.
EDIT: should of added that the file can also be a xlsx. Initially I write off most of my files as CSVs since they usually get used by a script again at some point.
It is a bit ugly but you can do on a csv as long as you do not require any merging of cells. I used data.table in my example, but I am pretty sure you can use any other writing function as long as you write the headers with append = FALSE and col.names = FALSE and the data both with TRUE. Reading it back gets a bit ugly but you can skip the first row.
dt <- fread("A B C D E F
1 2 3 4 5 6
7 8 9 a b c")
fwrite(data.table(t(c("Header1", NA, NA, "Header2", NA, NA))), "test.csv", append = FALSE, col.names = FALSE)
fwrite(dt, "test.csv", append = TRUE, col.names = TRUE)
fread("test.csv")
# V1 V2 V3 V4 V5 V6
# 1: Header1 Header2
# 2: A B C D E F
# 3: 1 2 3 4 5 6
# 4: 7 8 9 a b c
fread("test.csv", skip = 1L)
# A B C D E F
# 1: 1 2 3 4 5 6
# 2: 7 8 9 a b c
If you happen to want your header information back you can do something like this. Read the first line, find the positions of the headers and find the headers itself.
headers <- strsplit(readLines("test.csv", n = 1L), ",")[[1]]
which(headers != "")
# [1] 1 4
headers[which(headers != "")]
# [1] "Header1" "Header2"
I have a list of dataframes, all of which contain a user column and another column called 'VD'. I want to add a new column to all dataframes 'VD_z' in the list with the scaled values of the VD column
df1 <- data.frame(VD = 1:3, user=letters[1:3])
df2 <- data.frame(VD = 4:6, user=letters[4:6])
filelist <- list(df1,df2)
I read several similar questions, finally trying:
filelist <- mapply(cbind(filelist, VD_z= lapply(filelist, function(df) scale(df$VD))))
What I expect is that all dataframes in the list now have the new VD_z column with the scaled values, like this:
df1 <- data.frame(VD = 1:3, user=letters[1:3], VD_z=c(-1,0,1))
df2 <- data.frame(VD = 4:6, user=letters[4:6], VD_z=c(-1,0,1))
What I get is an Error message 'Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'
Thanks for your help!
We can use map from purrr to loop through the list and mutate to create the 'VD_z'
library(tidyverse)
filelist %>%
map( ~ .x %>%
mutate(VD_z = scale(VD)))
or using base R with lapply/transform
filelist1 <- lapply(filelist, transform, VD_z = scale(VD))
filelist1
#[[1]]
# VD user VD_z
#1 1 a -1
#2 2 b 0
#3 3 c 1
#[[2]]
# VD user VD_z
#1 4 d -1
#2 5 e 0
#3 6 f 1
If we using the logic from the OP's post, assign thescaleto new coumn 'VD_z' and thenreturn` 'df'
filelist1 <- lapply(filelist, function(df) {df$VD_z <- scale(df$VD); df})
A data.table approach can be,
library(data.table)
dd <- rbindlist(filelist, idcol = 'id')[, VD_z := scale(VD), by = id]
# id VD user VD_z
#1: 1 1 a -1
#2: 1 2 b 0
#3: 1 3 c 1
#4: 2 4 d -1
#5: 2 5 e 0
#6: 2 6 f 1
You can then use split() to split the data frame to a list, i.e.
split(dd, dd$id)
which gives,
$`1`
id VD user VD_z
1: 1 1 a -1
2: 1 2 b 0
3: 1 3 c 1
$`2`
id VD user VD_z
1: 2 4 d -1
2: 2 5 e 0
3: 2 6 f 1
I have a little problem with some datasets which are containing tab seperated data, but unfortunately there are some errors in the raw data, causing problems while reading into R.
A small example for better understanding, the dataset looks like this:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
The 7 8 9 part should be in one row, but is wrongly seperated into two (in the raw data). Is there any chance to correct this while reading in and not by manually changing this? Because the dataset is around 4m observations large, a manual correction would take a lot of time...
Try this example:
# read the file line by line:
x <- readLines("data.txt")
# Split by " " (or in your case "\t"), and convert to dataframe with 3 columns:
res <- data.frame(matrix(unlist(strsplit(x[-1], " "), recursive = TRUE),
ncol = 3, byrow = TRUE))
# Add column names to dataframe:
colnames(res) <- unlist(strsplit(x[1], " "))
res
# Col1 Col2 Col3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
# 4 10 11 12
Example data.txt file:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
Note: Just noticed your real data is 4 million rows, maybe this is not the most efficient way.
My solution is more complicated than the solution by user zx8754 but here it goes.
readWrong <- function(file, skip = 1){
txt <- readLines(file)
header <- txt[seq_len(skip)]
header <- scan(what = character(), textConnection(header))
txt <- txt[-seq_len(skip)]
data <- scan(textConnection(txt))
data <- matrix(data, ncol = length(header), byrow = TRUE)
data <- as.data.frame(data)
names(data) <- header
data
}
readWrong("data.txt")
# Col1 Col2 Col3
#1 1 2 3
#2 4 5 6
#3 7 8 9
#4 10 11 12
I have a text file in which data is stored is stored as given below
{{2,3,4},{1,3},{4},{1,2} .....}
I want to remove the brackets and convert it to two column format where first column is bracket number and followed by the term
1 2
1 3
1 4
2 1
2 3
3 4
4 1
4 2
so far i have read the file
tab <- read.table("test.txt",header=FALSE,sep="}")
This gives a dataframe
V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2 .....
How to proceed ?
We read it with readLines and then remove the {} with strsplit and convert it to two column dataframe with index and reshape to 'long' format with separate_rows
library(tidyverse)
v1 <- setdiff(unlist(strsplit(lines, "[{}]")), c("", ","))
tibble(index = seq_along(v1), Col = v1) %>%
separate_rows(Col, convert = TRUE)
# A tibble: 8 x 2
# index Col
# <int> <int>
#1 1 2
#2 1 3
#3 1 4
#4 2 1
#5 2 3
#6 3 4
#7 4 1
#8 4 2
Or a base R method would be replace the , after the } with another delimiter, split by , into a list and stack it to a two column data.frame
v1 <- scan(text=gsub("[{}]", "", gsub("},", ";", lines)), what = "", sep=";", quiet = TRUE)
stack(setNames(lapply(strsplit(v1, ","), as.integer), seq_along(v1)))[2:1]
data
lines <- readLines(textConnection("{{2,3,4},{1,3},{4},{1,2}}"))
#reading from file
lines <- readLines("yourfile.txt")
Data:
tab <- read.table(text=' V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2
2 {{2,3,4 {1,3 {4 {1,2 ')
Code: using gsub, remove { and split the string by ,, then make a data frame. The column names are removed. Finally the list of dataframes in df1 are combined together using rbindlist
df1 <- lapply( seq_along(tab), function(x) {
temp <- data.frame( x, strsplit( gsub( "{", "", tab[[x]], fixed = TRUE ), split = "," ),
stringsAsFactors = FALSE)
colnames(temp) <- NULL
temp
} )
Output:
data.table::rbindlist(df1)
# V1 V2 V3
# 1: 1 2 2
# 2: 1 3 3
# 3: 1 4 4
# 4: 2 1 1
# 5: 2 3 3
# 6: 3 4 4
# 7: 4 1 1
# 8: 4 2 2
I the following dataframes:
a <- c(1,1,1)
b<- c(10,8,2)
c<- c(2,2)
d<- c(3,5)
AB<- data.frame(a,b)
CD<- data.frame(c,d)
I would like to join AB and CD, where the first column of CD is equal to the second column of AB. Please note that my actual data will have a varying number of columns, with varying names, so I am really looking for a way to join based on position only. I have been trying this:
#Get the name of the last column in AB
> colnames(AB)[ncol(AB)]
[1] "b"
#Get the name of the first column in CD
> colnames(CD)[1]
[1] "c"
Then I attempt to join like this:
> abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]=colnames(CD)[1]))
Error: unexpected '=' in "abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]="
The behavior I am looking for is essentially this:
> abcd<- full_join(AB, CD, by = c("b" = "c"))
> abcd
a b d
1 1 10 NA
2 1 8 NA
3 1 2 3
4 1 2 5
We can do setNames
full_join(AB, CD, setNames(colnames(CD)[1], colnames(AB)[ncol(AB)]))
# a b d
#1 1 10 NA
#2 1 8 NA
#3 1 2 3
#4 1 2 5
We can replace the target column names with a common name, such as "Target", and then do full_join. Finally, replace the "Target" name with the original column name.
library(dplyr)
AB_name <- names(AB)
target_name <- AB_name[ncol(AB)] # Store the original column name
AB_name[ncol(AB)] <- "Target" # Set a common name
names(AB) <- AB_name
CD_name <- names(CD)
CD_name[1] <- "Target" # Set a common name
names(CD) <- CD_name
abcd <- full_join(AB, CD, by = "Target") %>% # Merge based on the common name
rename(!!target_name := Target) # Replace the common name with the original name
abcd
# a b d
# 1 1 10 NA
# 2 1 8 NA
# 3 1 2 3
# 4 1 2 5