finding .txt files with incomplete information in R - r

I have a folder of .txt files. One of the columns in each .txt files is called "Row." (That's confusing, I'm sorry.) "Row" column contains values A thru H.
I'm trying to write something that I can run through each .txt file to check to see all the values from A thru H are present in the file. I want this function to tell me which .txt file is incomplete (missing some of the values). I don't have to know what it's missing... I just have to know which .txt file doesn't have all the values from A to H.
Is there a way to do this?
Thank you in advance~
EDIT
Here is a sample of the data I'm working with. This is from the first data frame in a list of data frames. So this is after I already made each .txt file into a data frame.
row col TOF EXT time green yellow red worm call50 norm.red stage
1 A 1 20 20 0 2 0 0 1.922668e-02 bubble 0.000000000 L1
2 A 1 32 45 358 6 6 3 9.637690e-01 worm 0.093750000 L1
3 A 1 24 30 1185 6 1 0 2.246214e-02 bubble 0.000000000 L1
4 A 1 139 230 2433 39 49 31 1.000000e+00 worm 0.223021583 L2
5 A 1 27 23 2433 3 4 2 8.262885e-01 worm 0.074074074 L1
6 A 1 24 25 3946 3 4 3 9.077824e-01 worm 0.125000000 L1
7 A 2 40 55 0 30 46 29 1.000000e+00 worm 0.725000000 L1
8 A 2 31 34 2793 3 2 0 1.100591e-01 bubble 0.000000000 L1
9 A 3 37 42 0 8 9 5 9.996614e-01 worm 0.135135135 L1
10 A 3 89 172 562 28 38 20 1.000000e+00 worm 0.224719101 L1
...
648 B 1 124 160 0 11 8 4 9.999695e-01 worm 0.032258065 L2
649 B 1 125 211 47 13 11 4 9.999610e-01 worm 0.032000000 L2
650 B 1 65 112 141 6 4 3 9.404593e-01 worm 0.046153846 L1

Try:
ldf <- list(df1 = data.frame(row=LETTERS[1:8],col=1:8), df1 = data.frame(row=LETTERS[1:7],col=1:7))
> ldf
$df1
row col
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
$df1
row col
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
> lapply(ldf, function(x) sum(LETTERS[1:8] %in% x$row)!=8)
$df1
[1] FALSE
$df1
[1] TRUE
BUILD
Can also use: names(ldf) <- c("file1.txt", "file2.txt")
$file1.txt
[1] FALSE
$file2.txt
[1] TRUE
UPDATE
A nicer way would be:
lapply(ldf, function(x) all(LETTERS[1:8] %in% x$row) & all(1:8 %in% x$col))
$file1.txt
[1] TRUE
$file2.txt
[1] FALSE

Related

Reading in space seperated dataset in R for frequent items

I have a .txt file that consists of numbers separated by spaces. Each row has a different amount of numbers in it. I need to do market basket analysis on the data, however I can't seem to properly load the data (especially because there is a different number of items in each 'basket'). What is the best way to store the data so I can find the frequent items and then check for frequent items in each basket?
Example of data:
1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54
You should be able to input with readLines and then use lapply to separate into numerics. Assume that is in a file named txt.txt:
dat <- lapply( readLines("txt.txt"), function(Line) scan(text=Line) )
The reason I didn't suggest read.table with fill=TRUE (which would give yiu something similar to the otehr answer that has appeared is that the column stucture was not needed. unless there was information encoded in the position of those numbers. I'm wondering whether the might be additional information encoded in the individual lines such as regions or stores or some other entity as the source of particular numbered items. This would be the reason for keeping it in a list structure with an uneven count. You can get a global enumerations just with table:
table( unlist(dat) )
1 2 3 4 5 6 7 8 21 23 32 34 43 54 67 145 154 456
1 4 2 3 2 1 1 1 2 1 2 1 1 1 1 1 1 1
my_text = '1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54'
my_text2 <- strsplit(my_text, split = '\n')
my_text2 <- lapply(my_text2, trimws)
my_text2 %>%
do.call('rbind',.) %>%
t %>%
as.data.frame() %>%
separate(V1, sep = ' ',into = paste('col_', 1:10))
col_ 1 col_ 2 col_ 3 col_ 4 col_ 5 col_ 6 col_ 7 col_ 8 col_ 9 col_ 10
1 1 2 4 3 67 43 154 <NA> <NA> <NA>
2 4 5 3 21 2 <NA> <NA> <NA> <NA> <NA>
3 2 4 5 32 145 <NA> <NA> <NA> <NA> <NA>
4 2 6 7 8 23 456 32 21 34 54

R - Issues while calling a user-defined function

I have the following dataframe named "dataset"
> dataset
V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450
I have two helper functions as below
get_A <- function(p){
return(data.frame(Scorecard = p,
Results = dataset[nrow(dataset),(p+1)]))
} #Pulls the value from the last row for a given value of (p and offset by 1)
get_P <- function(p){
return(data.frame(Scorecard= p,
Results = dataset[p,ncol(dataset)]))
} #Pulls the value from the last column for a given value of p
I have the following dataframe on which I need to run the above helper functions. There will be NAs because I'm reading this "data_sub" dataframe from an excel file which can have unequal rows for the two columns.
> data_sub
Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA
When I call the helper functions, I get some strange results as shown below:
> get_P(data_sub[complete.cases(data_sub$Key_P),]$Key_P)
Scorecard Results
1 2 1837
2 3 315
3 4 621
> get_A(data_sub[complete.cases(data_sub$Key_A),]$Key_A)
Scorecard Results.V2 Results.V4 Results.V6
1 1 12 8 11
2 3 12 8 11
3 5 12 8 11
Warning message:
In data.frame(Scorecard = p, Results = dataset[nrow(dataset), (p + :
row names were found from a short variable and have been discarded
The call to the get_P() helper function is working the way I want. I'm getting the "Results" for each non-NA value in data_sub$Key_P as a dataframe.
But the call to the get_A() helper function is giving strange results and also a warning.I was expecting it to give a similar dataframe as given the call to get_P(). Why is this happening and how can I make get_A() to give the correct dataframe? Basically, the output of this should be
Scorecard Results
1 1 12
2 3 8
3 5 11
I found this link related to the warning but it's unhelpful in solving my issue.
The following works
get_P <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ]
data.frame(
Scorecard = data_sub$Key_P,
Results = df[data_sub$Key_P, ncol(df)])
}
get_P(df, data_sub)
# Scorecard Results
#1 2 1837
#2 3 315
#3 4 621
get_A <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ];
data.frame(
Scorecard = data_sub$Key_A,
Results = as.numeric(df[nrow(df), data_sub$Key_A + 1]))
}
get_A(df, data_sub)
# Scorecard Results
#1 1 12
#2 3 8
#3 5 11
To avoid the warning, we need to strip rownames with as.numeric in get_A.
Another tip: It's better coding practice to make get_P and get_A a function of both df and data_sub to avoid global variables.
Sample data
df <- read.table(text =
" V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450", header = T, row.names = 1)
data_sub <- read.table(text =
" Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA", header = T, row.names = 1)

Rearranging order for a pair in R

I have a column with 10 random numbers, from that I want to create a new column that have switched places for every pair, see example for how I mean. How would you do that?
column newcolumn
1 5
5 1
7 6
6 7
25 67
67 25
-10 2
2 -10
-50 36
36 -50
Taking advantage of the fact that R will replicate smaller vectors when adding them to larger vectors, you can:
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- a$column[seq(nrow(a)) + c(1, -1)]
Something like this.
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- 0
a[seq(1,nrow(a),by=2),"newColumn"]<-a[seq(2,nrow(a),by=2),"column"]
a[seq(2,nrow(a),by=2),"newColumn"]<-a[seq(1,nrow(a),by=2),"column"]
# results
column newColumn
1 1 5
2 5 1
3 7 6
4 6 7
5 25 67
6 67 25
7 -10 2
8 2 -10
9 50 36
10 36 50
Here is a base R one-liner: We can cast column as 2 x nrow(df)/2 matrix, swap rows, and recast as vector.
df$newcolumn <- c(matrix(df$column, ncol = nrow(df) / 2)[c(2,1), ]);
# column newcolumn
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
Sample data
df <- read.table(text =
"column
1
5
7
6
25
67
-10
2
-50
36", header = T)
Another option would be to use ave and rev
transform(df, newCol = ave(x = df$column, rep(1:5, each = 2), FUN = rev))
# column newCol
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
The part rep(1:5, each = 2) creates a grouping variable ("pairs") for each of which we reverse the elements.
Here's a compact way:
a$new_col <- c(matrix(a$column,2)[2:1,])
# column new_col
# 1 1 5
# 2 5 1
# 3 7 6
# 4 6 7
# 5 25 67
# 6 67 25
# 7 -10 2
# 8 2 -10
# 9 50 36
# 10 36 50
The idea is to write in a 2 row matrix, switch the rows, and unfold back in a vector.

Merge two datasets in R

I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.
Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Resources