Unformatted Excel data import? - r

I'm trying to read an Excel file with over 30 tabs of data. The complication is that each tab actually has 2 tables in it. There is a table at the top of the sheet, then a few blank rows, then a second table below with completely different column titles.
I'm aware of the openxlsx and readxl packages, but they seem to assume that the Excel data is formatted into tidy tables.
If I can get the raw data into R (perhaps in a text matrix...), I'm confident I can do the dirty work of parsing it into data frames. Any advice? Many thanks.

you can use XLConnect package to access arbitrary region in Excel Worksheet. Then you can extract list of data frames. Please see below:
Simulation:
library(XLConnect)
# simulate xlsx-file
df1 <- data.frame(x = 1:10, y = 0:9)
df2 <- data.frame(x = 1:20, y = 0:19)
wb <- loadWorkbook("temp.xlsx", create = TRUE )
createSheet(wb, "sh1")
writeWorksheet(wb, df1, "sh1", startRow = 1)
writeWorksheet(wb, df2, "sh1", startRow = 15)
lapply(2:30, function(x) cloneSheet(wb, "sh1", paste0("sh", x)))
saveWorkbook(wb)
Extract Data
# read.data
wb <- loadWorkbook("temp.xlsx")
df1s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 1, endRow = 11))
df2s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 15, endRow = 35))
df1s[[1]]
df2s[[2]]
Output data.frame #1 from the first sheet and data.frame #2 from the second one:
> df1s[[1]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
> df2s[[2]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
15 15 14
16 16 15
17 17 16
18 18 17
19 19 18
20 20 19

Related

How to use a fulljoin on my dataframes and rename columns with the same name R

I have two dataframes and they both have the exact same column names, however the data in the columns is different in each dataframe. I am trying to join the two frames (as seen below) by a full join. However, the hard part for me is the fact that I have to rename the columns so that the columns corresponding to my one dataset have some text added to the end while adding different text to the end of the columns that correspond to the second data set.
combined_df <- full_join(any.drinking, binge.drinking, by = ?)
A look at one of my df's:
Without custom function and shorter:
df <- cbind(cars, cars)
colnames(df) <- c(paste0(colnames(cars), "_any"), paste0(colnames(cars), "_binge"))
Output:
> head(df)
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10
Certainly not the most elegant way but maybe it is what you want:
custom_bind <- function(df1, suffix1, df2, suffix2){
colnames(df1) <- paste(colnames(df1), suffix1, sep = "_")
colnames(df2) <- paste(colnames(df2), suffix2, sep = "_")
df <- cbind(df1, df2)
return(df)
}
custom_bind(cars, "any", cars, "binge")
I made it as a function in case you want to do it with other tables. If not then it is not necessary.
Output:
> head(custom_bind(cars, "any", cars, "binge"))
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

Convert a table without any col and row name created from a matrix to data frame

I wrote a code for my matrix that I want to create
tstp<-matrix(1:200, ncol = 4, byrow = TRUE)
And then I wrote this code to get my required format
write.table(tstp, row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
I am here presenting the first four rows. The out is like that
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
It is my required output if its class will be data frame. So I wrote a code to convert it into data frame that is given as
> timestp<-data.frame(tstp)
And the output from the code has created the column names and row number that are not required as shown below.
> timestp
X1 X2 X3 X4
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
its produced the class that I need
> class(timestp)
[1] "data.frame"
But I want output like given below with class of data.frme
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
We can use as.data.frame with optional parameter set to TRUE
timestp <- as.data.frame(tstp, optional = TRUE)
colnames(df)
#NULL
You can do this:
rownames(timestp) <- NULL
colnames(timestp) <- NULL
or if you only want to exclude row names, then use this:
timestp<-data.frame(tstp, row.names = NULL)
However, when you print it will show the numbers (as indices, not names). Refer to "Removing display of R row names from data frame".
You have successfully removed the rownames. The print.data.frame method just shows the row numbers if no rownames are present.
If you want to exclude the row numbers while printing then this will help you:
print(timestp, row.names = FALSE)
This will be the output:
> print(head(timestp), row.names = FALSE)
# 1 2 3 4
# 5 6 7 8
# 9 10 11 12
# 13 14 15 16
# 17 18 19 20
# 21 22 23 24

Create multiple data frames from one based off values with a for loop

I have a large data frame that I would like to convert in to smaller subset data frames using a for loop. I want the new data frames to be based on the the values in a column in the large/parent data frame. Here is an example
x<- 1:20
y <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","C","C","C")
df <- as.data.frame(cbind(x,y))
ok, now I want three data frames, one will be columns x and y but only where y == "A", the second where y==
"B" etc etc. So the end result will be 3 new data frames df.A, df.B, and df.C. I realize that this would be easy to do out of a for loop but my actual data has a lot of levels of y so using a for loop (or similar) would be nice.
Thanks!
If you want to create separate objects in a loop, you can use assign. I used unique because you said you had many levels.
for(i in unique(df$y)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$y==i,])
}
> df.A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
> df.B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
I think you just need the split function:
split(df, df$y)
$A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
$B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
$C
x y
18 18 C
19 19 C
20 20 C
It is just a matter of properly subsetting the output to split and store the results to objects like dfA <- split(df, df$y)[[1]] and dfB <- split(df, df$y)[[2]] and so on.

Combine data from different txt

I have 20 different txt which all have the same columns with the same names BUT different values
for example
TXT1
a b c d
1 4 5 6
3 4 5 3
TXT2
a b c d
2 4 8 6
3 5 2 9
how can i create a new txt which will have all the values from both TXT1 and TXT2 in the correct column?
thank you
Anna
When I include reading the data, I would solve your problem like this:
library(plyr)
large_table = ldply(list_src_files, read.table)
write.table(large_table, file = "large_table.txt")
Here is some R magic to make your life very easy:
Create some data in the format you described:
TXT1 <- data.frame(a = 1:4,b = 5:8,c = 9:12)
TXT2 <- data.frame(a = 11:14,b = 15:18,c = 19:22)
TXT3 <- data.frame(a = 21:24,b = 25:28,c = 29:32)
TXT4 <- data.frame(a = 31:34,b = 35:38,c = 39:42)
Stich it together:
x <- ls(pattern = "TXT[[:digit:]]", all.names=TRUE)
do.call(rbind, lapply(x, get))
The results:
a b c
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 11 15 19
6 12 16 20
7 13 17 21
8 14 18 22
9 21 25 29
10 22 26 30
11 23 27 31
12 24 28 32
13 31 35 39
14 32 36 40
15 33 37 41
16 34 38 42
assuming your column names are identical, per your above example:
TXT3 <- rbind(TXT1,TXT2)
write.table(TXT3,file="TXT3.txt")
Once you read in your files, use rbind() .
Example:
dat.in.1 <- read.delim(dat.1)
dat.in.2 <- read.delim(dat.2)
dat.in.3 <- read.delim(dat.3)
dat.in.4 <- read.delim(dat.4)
dat.in.5 <- read.delim(dat.5)
dat.total <- rbind(dat.in.1, dat.in.2, dat.in.3, dat.in.4, dat.in.5)
You should also give this a look:
R Data Import/Export Manual

Resources