Combine data from different txt - r

I have 20 different txt which all have the same columns with the same names BUT different values
for example
TXT1
a b c d
1 4 5 6
3 4 5 3
TXT2
a b c d
2 4 8 6
3 5 2 9
how can i create a new txt which will have all the values from both TXT1 and TXT2 in the correct column?
thank you
Anna

When I include reading the data, I would solve your problem like this:
library(plyr)
large_table = ldply(list_src_files, read.table)
write.table(large_table, file = "large_table.txt")

Here is some R magic to make your life very easy:
Create some data in the format you described:
TXT1 <- data.frame(a = 1:4,b = 5:8,c = 9:12)
TXT2 <- data.frame(a = 11:14,b = 15:18,c = 19:22)
TXT3 <- data.frame(a = 21:24,b = 25:28,c = 29:32)
TXT4 <- data.frame(a = 31:34,b = 35:38,c = 39:42)
Stich it together:
x <- ls(pattern = "TXT[[:digit:]]", all.names=TRUE)
do.call(rbind, lapply(x, get))
The results:
a b c
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 11 15 19
6 12 16 20
7 13 17 21
8 14 18 22
9 21 25 29
10 22 26 30
11 23 27 31
12 24 28 32
13 31 35 39
14 32 36 40
15 33 37 41
16 34 38 42

assuming your column names are identical, per your above example:
TXT3 <- rbind(TXT1,TXT2)
write.table(TXT3,file="TXT3.txt")

Once you read in your files, use rbind() .
Example:
dat.in.1 <- read.delim(dat.1)
dat.in.2 <- read.delim(dat.2)
dat.in.3 <- read.delim(dat.3)
dat.in.4 <- read.delim(dat.4)
dat.in.5 <- read.delim(dat.5)
dat.total <- rbind(dat.in.1, dat.in.2, dat.in.3, dat.in.4, dat.in.5)
You should also give this a look:
R Data Import/Export Manual

Related

R - how to select elements from sublists of a list by their name

I have a list of lists that looks like this:
list(list("A[1]" = data.frame(W = 1:5),
"A[2]" = data.frame(X = 6:10),
B = data.frame(Y = 11:15),
C = data.frame(Z = 16:20)),
list("A[1]" = data.frame(W = 21:25),
"A[2]" = data.frame(X = 26:30),
B = data.frame(Y = 31:35),
C = data.frame(Z = 36:40)),
list("A[1]" = data.frame(W = 41:45),
"A[2]" = data.frame(X = 46:50),
B = data.frame(Y = 51:55),
C = data.frame(Z = 56:60))) -> dflist
I need my output to also be a list of list with length 3 so that each sublist retains elements whose names start with A[ while dropping other elements.
Based on some previous questions, I am trying to use this:
dflist %>%
map(keep, names(.) %in% "A[")
but that gives the following error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
Trying to select a single element, for example just A[1] like this:
dflist %>%
map(keep, names(.) %in% "A[1]")
also doesn't work. How can I achieve the desired output?
I think you want:
purrr::map(dflist, ~.[stringr::str_starts(names(.), "A\\[")])
What this does is:
For each sublist (purrr::map)
Select all elements of that sublist (.[], where . is the sublist)
Whose names start with A[ (stringr::str_starts(names(.), "A\\["))
You got the top level map correct, since you want to modify the sublists. However, map(keep, names(.) %in% "A[") has some issues:
names(.) %in% "A[" should be a function or a formula (starting with ~
purrr::keep applies the filtering function to each element of the sublist, namely to the data frames directly. It never "sees" the names of each data frame. Actually I don't think you can use keep for this problem at all
Anyway this produces:
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50
If we want to use keep, use
library(dplyr)
library(purrr)
library(stringr)
map(dflist, ~ keep(.x, str_detect(names(.x), fixed("A["))))
Here a base R solution:
lapply(dflist, function(x) x[grep("A\\[",names(x))] )
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

Error in FUN(left, right) : non-numeric argument to binary operator

i'm trying to add values to each individual columns in a specific rows which i'm using a loop but it keeps giving the error of "non-numeric argument to binary operator" so maybe i am thinking the program reads the index value of the column?
This is my code:
col1st <- colnames(NB1stRow)[5:74]
for(i in seq_along(col1st)){
NB1stRow[i] <- NB1stRow[i]*2
}
Here's how a column would look like
NB1stRow[6]
X417.897
1 21.29759
2 22.52447
3 25.59260
4 29.67289
5 34.45366
6 30.30945
7 28.02665
8 28.13356
9 31.67405
10 28.65952
11 28.49534
12 32.18732
13 35.24368
14 32.02267
15 30.92876
I am using base R.
First check if the class of the data is numeric. str(NB1stRow) will show you classes of each column. If they are not numeric, turn them to numeric by -
cols <- 5:74
NB1stRow[cols] <- lapply(NB1stRow[cols], as.numeric)
Multiplication (*) doesn't require loop and it can be applied to dataframe directly so you can do
NB1stRow[cols] <- NB1stRow[cols] * 2
For example,
dummy <- matrix(c(1:25), nrow = 5, byrow = TRUE) %>% as.data.frame()
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
4 16 17 18 19 20
5 21 22 23 24 25
with this data, it will work.
for (i in seq_along(colnames(dummy))){
dummy[i] <- dummy[i]*2
}
V1 V2 V3 V4 V5
1 2 4 6 8 10
2 12 14 16 18 20
3 22 24 26 28 30
4 32 34 36 38 40
5 42 44 46 48 50
when change one values into text, like
dummy <- matrix(c(1:25), nrow = 5, byrow = TRUE) %>% as.data.frame()
dummy[3,4] <- "a"
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 a 15
4 16 17 18 19 20
5 21 22 23 24 25
for (i in seq_along(colnames(dummy))){
dummy[i] <- dummy[i]*2
}
Error in FUN(left, right) : non-numeric argument to binary operator
try check str(NB1stRow[,c(5:74)]) have any chr or factor or etc
Assuming that columns 5:74 are numeric, the problem is that seq_along(col1st) is 1:70 so it is trying to double those columns, not columns 5:74. Using DF to represent the data frame we want:
ix <- 5:74
for(i in ix) DF[i] <- 2 * DF[i]
or just
DF[ix] <- 2 * DF[ix]

Binning with quantiles adding exception in r

I need to create 10 bins with the most approximate frequency each; for this,
I am using the function "ClassInvervals" from the library (ClassInt) with the style
'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.
What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.
For example, let's assume we have this DF:
df <- read.table(text="
X
1 5
2 29
3 4
4 26
5 4
6 17
7 4
8 4
9 4
10 25
11 4
12 4
13 5
14 14
15 18
16 13
17 29
18 4
19 13
20 6
21 26
22 11
23 2
24 23
25 4
26 21
27 7
28 4
29 18
30 4",h=T,strin=F)
So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:
2 1
4 11
5 2
6 1
7 1
11 1
13 2
14 1
17 1
18 2
21 1
23 1
25 1
26 2
29 2
With this info, first we should treat "4" as a unique bin.
So we have a final output more or less like this:
X Bins
1 5 [2,6)
2 29 [27,30)
3 4 [4]
4 26 [26,27)
5 4 [4]
6 17 [15,19)
7 4 [4]
8 4 [4]
9 4 [4]
10 25 [19,26)
11 4 [4]
12 4 [4]
13 5 [2,6)
14 14 [12,15)
15 18 [15,19)
16 13 [12,15)
17 29 [27,30)
18 4 [4]
19 13 [12,15)
20 6 [6,12)
21 26 [26,27)
22 11 [6,12)
23 2 [2,6)
24 23 [19,26)
25 4 [4]
26 21 [19,26)
27 7 [6,12)
28 4 [4]
29 18 [15,19)
30 4 [4]
Until now, my approach has been something like this:
Moda <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Binner <- function(df) {
library(classInt)
#Input is a matrix that wants to be binned
for (c in 1:ncol(df)) {
if (sapply(df,class)[c]=="numeric") {
VectorTest <- df[,c]
# Here I get the 10% of the values
TenPer <- floor(length(VectorTest)/10)
while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that
# are repeated more than 10% but I still don't know how to add it as a special bin
VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
Counter <- Counter +1
}
binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
df[ , paste0("Binned_", colnames(df)[c])] <- binsBrakets
}
}
return (df)
}
Can someone help me?
You could use cutr::smart_cut:
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
#
# [2,4) [4,5) [5,6) [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29]
# 1 11 2 2 3 2 2 2 3 2
more on cutr and smart_cut
you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).
library(magrittr)
#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>%
.[. >= length(df$X)/10] %>%
names()
#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]
#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))
#Let's combine the information into a single data frame
numbers_bins <- rbind(
data.frame(
n = left_over,
bins = left_over_bins %>% as.character,
stringsAsFactors = F
),
data.frame(
n = df$X[df$X %in% large],
bins = df$X[df$X %in% large] %>% as.character,
stringsAsFactors = F
)
)
If you table the information you'll get something like this
table(numbers_bins$bins) %>% sort(T)
4 (1.97,5] (11,14] (23,26] (17,20]
11 3 3 3 2
(20,23] (26,29] (5,8] (14,17] (8,11]
2 2 2 1 1

Unformatted Excel data import?

I'm trying to read an Excel file with over 30 tabs of data. The complication is that each tab actually has 2 tables in it. There is a table at the top of the sheet, then a few blank rows, then a second table below with completely different column titles.
I'm aware of the openxlsx and readxl packages, but they seem to assume that the Excel data is formatted into tidy tables.
If I can get the raw data into R (perhaps in a text matrix...), I'm confident I can do the dirty work of parsing it into data frames. Any advice? Many thanks.
you can use XLConnect package to access arbitrary region in Excel Worksheet. Then you can extract list of data frames. Please see below:
Simulation:
library(XLConnect)
# simulate xlsx-file
df1 <- data.frame(x = 1:10, y = 0:9)
df2 <- data.frame(x = 1:20, y = 0:19)
wb <- loadWorkbook("temp.xlsx", create = TRUE )
createSheet(wb, "sh1")
writeWorksheet(wb, df1, "sh1", startRow = 1)
writeWorksheet(wb, df2, "sh1", startRow = 15)
lapply(2:30, function(x) cloneSheet(wb, "sh1", paste0("sh", x)))
saveWorkbook(wb)
Extract Data
# read.data
wb <- loadWorkbook("temp.xlsx")
df1s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 1, endRow = 11))
df2s <- lapply(1:30, function(x) readWorksheet(wb, x, startRow = 15, endRow = 35))
df1s[[1]]
df2s[[2]]
Output data.frame #1 from the first sheet and data.frame #2 from the second one:
> df1s[[1]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
> df2s[[2]]
x y
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
15 15 14
16 16 15
17 17 16
18 18 17
19 19 18
20 20 19

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources