r summary of row over multiple files

r summary of row over multiple files - r

I have around 100 text files which I have loaded into R:
myFiles <- (Sys.glob("C:/../.../*.txt"))
dataFiles <- lapply(myFiles, read.table)
Files have different number of rows, but all have 4 columns. 1st column is the name and the last 3 are coordinates.
example of files:
[[1]]
n x y z
1 Bal 0.459405 -238.3565 -653.5304
2 tri 0.028990 -224.5127 -600.0000
.....
14 mon 24.514049 -264.7673 -627.0550
[[2]]
n x y z
1 bal 2.220795 -284.1022 -651.8112
2 reg 2.077444 -290.4326 -631.3667
...
8 tri 32.837284 -347.2596 -633.0000
There is one row which is present in all files: e.g. row.name="tri". I want to find summary (median,mean,max,min) of that row's coordinates (x,y,z) over all 100 files.
I found quite a few examples of summary of a row in one file but not over multiple files.
I think I need to use lapply but not sure how to start with it.
Also I need summary to create classes later based on the values I have. I found "summary" function for taht to be sueful. If there is any other function which might be be of more use you could suggest for taht purposes it would be helpful.
Any help would be great!
Thanks!

For pulling all those "tri" rows together you can do:
df <- do.call("rbind", lapply(dataFiles, function(z) z[z$n=="tri",]))
summary(df)

Related

how to check the same ID in a different dataframe and make the merge file?

I want to modify the longitudinal data based on ID.
I want to check whether the IDs in data wave 1(A) and data in wave 2(B) match properly. Also, I want to combine the data of A and B into one file based on ID.
I tried to merge the file using merge() code and tried to check whether the ID matched through the sex variable. However, it is difficult to check ID if there is no same variable in both waves, and it does not directly check each ID.
ID <- c(1012,1102,1033,1204,1555)
sex <- c(1,0,1,0,1)
A <- cbind(ID,sex)
A <- as.data.frame(A)
ID <- c(1006,1102,1001,1033,1010,1234,1506,1999)
sex <- c(1,0,1,1,1,0,0,0)
B <- cbind(ID,sex)
B <- as.data.frame(B)
merge.AB<-merge(A,B,by="ID")
all(merge.AB$sex.x == merge.AB$sex.y)
1. Are there any way to merge A(wave1) and B(wave2) files by ID other than merge() code?
Since there are 2 or 3 wave1 files other than A, it would be nice to be able to combine several files at once.
2. Is there a way to check if two frames have the same ID directly?
I tried to combine the files and check matching IDs through cbind() code for combining the A and B. But I couldn't check them together because the number of rows between the A and B dataframe is different.
It would be helpful to use a loop(e.g. if, for, etc.), but it would be nice if there was a way to do it with a package or simple code.
3. How do I locate a row number with a mismatched ID?
I want to know the all of locations in the row(row number) for the example.
e.g.
mismatched ID in A: 1012,1204,1555
mismatched ID in B: 1006,1001,1010,1234,1506,1999

Question 1 : you can merge multiple dataframes with merge. You first need to create a list of the df you want to merge and then you could use Reduce.
df_list <- list(df1,df2,...dfn)
data=Reduce(function(x,y) merge(x=x,y=y,by="ID",all=T),df_list)
Alternatively using tidyverse:
library(tidyverse)
df_list %>% reduce(full_join, by='ID')
In your example, pay attention that it is not convenient to merge two df with the same variable name and that contain the same information. You could simply use
data=Reduce(function(x,y) merge(x=x,y=y,all=T), df_list)
to remove redundant information from merged df.
Question 2 : check IDs with setdiff() and intersect()
intersect() gives you the common values between two vectors
setdiff(x,y) gives you the values in x that are not present in y
intersect(A$ID,B$ID)
[1] 1102 1033
setdiff(A$ID,B$ID)
[1] 1012 1204 1555
setdiff(B$ID,A$ID)
[1] 1006 1001 1010 1234 1506 1999
Question 3 : a simple which() including %in% test will give you the position in the dataframe
which(!(A$ID %in% B$ID))
[1] 1 4 5
which(!(B$ID %in% A$ID))
[1] 1 3 5 6 7 8

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...

Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Make dataframe from a list of lists in R

I would like to make a dataframe from a list of n. Each list contains 3 different list inside. I am only intrested in 1 list of those 3 list inside. The list I am intrested in is a data.frame with 12 obs of 12 variables.
My imput tmp in my lapply function is a list of n with each 5 observations.
2 of those observations are the Latitude and Longitude. This is how my lapply function looks like:
DF_Google_Places<- lapply(tmp, function(tmp){
Latitude<-tmp$Latitude
Longitude<-tmp$Longitude
LatLon<- paste(Latitude,Longitude, sep=",")
res<-GET(paste("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=",LatLon,"&radius=200&types=food&key=AIzaSyDS6usHdhdoGIj0ILtXJKCjuj7FBmDEzpM", sep=""))
jsonAnsw<-content(res,"text")
myDataframe<- jsonlite::fromJSON(content(res,"text"))
})
My question is: how do I get this list of 12 obs of 12 variables into a dataframe from a list of n ?
Could anyone help me out?, Thanks

I'm just posting my comment as an answer so I can show output to show you the idea:
x <- list(a=list(b=1,c=2),d=list(b=3,c=4))
So x is a nested list structure, in this case with consistent naming / structure one level down.
> x
$a
$a$b
[1] 1
$a$c
[1] 2
$d
$d$b
[1] 3
$d$c
[1] 4
Now we'll use do.call to build the data.frame. We need to pass it a named list of arguments, so we'll use list(sapply to get the named list. We'll walk the higher level of the list by position, and the inner level by name since the names are consistent across sub-lists at the inner level. Note here that the key idea is essentially to reverse what would be the intuitive way of indexing; since I want to pull observations at the second level from across observations at the first level, the inner call to sapply traverses multiple values of level one for each value of the name at level two.
y <- do.call(data.frame,
list(sapply(names(x[[1]]),
function(t) sapply(1:length(x),
function(j) x[[j]][[t]]))))
> y
b c
1 1 2
2 3 4
Try breaking apart the command to see what each step does. If there is any consistency in your sub-list structure, you should be able to adapt this approach to walk that structure in the right order and fold the data you need.
On a large dataset, this would not be efficient, but for 12x12 it should be fine.

How to format R data.frame output?

Let me start by saying I am brand new to R, so any solution with a detailed explanation would be appreciated so I can learn from it.
I have a set of csv files with the following rows of information:
"ID" "Date" "A" "B" (where A and B are some data points)
I am attempting to get the output in a meaningful manner and am stuck on what I am missing.
observations <- funtion(dir, id= 1:10){
#get all file names in a vector
all_files <- list.files(directory, full.names=TRUE)
#get the subset of files we want to read
file_contents <- lapply(all_files[id], read.csv)
#cbind the file contents
output <- do.call(rbind, file_contents)
#remove all NA values
output <- output[complete.cases(output), ]
#at this point output is a data.frame so display the output
table(output[["ID"]])
}
My current output is :
2 4 8 10 12
1000 500 200 150 100
which is correct but I need it in column form so it can be understood by looking at it. The output I am trying to get to is below:
id obs_total
1 2 1000
2 4 500
3 8 200
4 10 150
5 12 100
What am I missing here?

table outputs a contingency table. You want a data frame. You can wrap as.data.frame(...) around you output to convert it.
as.data.frame(table(ID = output[["ID"]]))

Assuming that the numbers are correct, looks like you have everything you need, just transpose the data frame. Try this:
mat<-matrix(round(runif(10),3),nrow=2)
df<-as.data.frame(mat)
colnames(df)=c("1","2","3","4","5")
t(df)

Split data in R and perform operation

I have a very large file that simply contains wave heights for different tidal scenarios at different locations. My file is organized into 13 wave heights x 9941 events, for 5153 locations.
What I want to do is read in this very long data file, which looks like this:
0.0
0.1
0.2
0.4
1.2
1.5
2.1
.....
Then split it into segments of length 129,233 (corresponds to 13 tidal scenarios for 9941 events at a specific location). On this subset of the data I'd like to perform some statistical functions to calculate exceedance probability, among other things. I will then join it to the file containing location information, and print some output files.
My code so far is not working, although I've tried many things. It seems to read the data just fine, however it is having trouble with the split. I suspect it may have something to do with the format of the input data from the file.
# read files with return period wave heights at defense points
#Read wave heights for 13 tides per 9941 events, for 5143 points
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRP <- c(WaveRPtable)
#colnames(WaveRP) <- c("WaveHeight")
print(paste(WaveRP))
#Read X,Y information for defense points
DefPT.file <- paste('DefXYevery10thpt.out')
DefPT <- read.table(DefPT.file, head=FALSE)
colnames(DefPT) <- c("X_UTM", "Y_UTM")
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, 129233)
print(paste(length(WaveByDefPt[[1]])))
for (i in 1:length(WaveByDefPt)/129233){
print(paste("i",i))
}
I have also tried
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, ceiling(seq_along(WaveRP)/129233))
No matter how I seem to perform the split, I am simply getting the original data as one long subset. Any help would be appreciated!
Thanks :)
Kimberly

Try cut to build groups:
v <- as.numeric(readLines(n = 7))
0.0
0.1
0.2
0.4
1.2
1.5
2.1
groups <- cut(v, breaks = 3) # you want breaks = 129233
aggregate(x = v, by = list(groups), FUN = mean) # e.g. means per group
# Group.1 x
# 1 (-0.0021,0.699] 0.175
# 2 (0.699,1.4] 1.200
# 3 (1.4,2.1] 1.800

You are kind of shuffling the data into various data types here.
When the file is originally read, it is a dataframe with 1 column (V1). Then you pass it to c(), which results in a list with a single vector in it. This means if you try and do anything to WaveRP you will probably fail because that's the name of the list. The numeric vector is WaveRP[[1]].
Instead, just extract the numeric vector using the $ operator and then you can work with it. Or just work with it inside the data frame. The fun part will be thinking of a way to create the grouping vector. I'll give an example.
Something like this:
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRPtable$group <- ceiling(seq_along(WaveRPtable$V1)/129233)
SplitWave <- split(WveRPtable,WaveRPtable$group)
Now you will have a list containing 13 dataframes. Look at each one using double bracket indexing. SplitWave[[2]], for example, to look at the second group. You can merge the location information file with these dataframes individually.