reading a table in R? - r

I have a txt file with the following structure:
NAME DATA1 DATA2
a 10 1,2,3
b 6 8,9
c 20 5,6,7 ,8
The first line represent the header and the data is separated by tabs. I need to put the elements of DATA1 in a list or vector in a way that I can traverse the elements one by one.
Also I need to extract the elements of DATA2 for each NAME and to put them in a list so I can traverse then individually, e.g. get the elements 8,9 for NAME b and put it into a list. (Note that the third record has a space in the list in DATA2 between the 7 and the comma).
How I can do that both operations? I know that I can use read.table and $ for accessing individual elements, but I am stuck.
info<-read.table("table1", header=FALSE,sep="\t")
namelist<-list(info$NAME)

Run this demo and look at the structure of n, d1, and d2 -- that should help you get going:
df = read.table(text="NAME\tDATA1\tDATA2
a\t10\t1,2,3
b\t6\t8,9
c\t20\t5,6,7 ,8",
header= TRUE,
stringsAsFactors=FALSE,
sep='\t')
n = df$NAME
d1 = df$DATA1
d2 = lapply(strsplit(df$DATA2, ","),
as.numeric)
names(d2) = n
d2['b'][1] # access first element in list named 'b'
lapply(d2, FUN=mean) # mean of all rows in d2

Related

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Adding single rows of a data frame as columns to a large number of other datasets matching 1 by 1

I have 23 data frames each containing ~20 observations over 200 variables and another data frame containing 13 variables and 23 observerations. These 13 variables hold information about the 23 data frames.
What I'm trying to do is to find a way to add the information from the lone data frame to each corresponding data frame in the list of 23, so that each observation in one out of the 23 data frames will hold the same value (e.g. the timepoint the whole data frame has been recorded).
The first line in the lone data frame corresponds to the information for the first data frame of the list of 23 and so on.
ls()
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df3" "df4" "df5" "df6" "df7" "df8" "df9" "i"
[25] "lf"
After some research I tried putting this into a list but realized that I have actually no idea in which order the list stores my data. I know that df1 matches row one of the lone frame "lf" (and if the list just flips things I'll match it the wrong way).
So on a single example I tried combine which worked somewhat (but not all too well):
> testdf <- c(df1,lf[1,])
> is.data.frame(testdf)
[1] FALSE
> testdf <- as.data.frame(testdf)
> is.data.frame(testdf)
[1] TRUE
At first it was a list, but using as.data.frame and having a look at the specific columns using View() it was the result I need. e.g. a new column at the end of the frame containing a variable like "time" that has values 13:37 for all observations in "df1".
Next I tried a loop...
for (i in 1:23){
+ assign(paste0("df",i), cbind(paste0("df",i),lf[i,], row.names = NULL))
+ }
...basically just trying to do what I did first multiple times (as.data.frame() is missing here, but it doesn't change a thing). What happens is that each data frame now only has 1 Observeration containing 13 variables I wanted to add at the end of the original frame.
After that everything has gone to s*** basically. I've tried to google for hours, but couldn't get anything to work really. Mostly I've tried playing around with it as a list until I realized this was a bad idea without getting the order right first (I actually know now how I can get that sorted out but right now I don't have the energy to do that. If you have a solution with a list that contains the name of each data frame as stored in the list, I'm sure I can get up to that point).
EDIT So I tried to make an example and show where I'm coming from. I hope it's more clear. I'm aware that I sadly don't solve it the "R-way" like this, which is why I tried looking at lists and apply a lot, but wasn't able to come up with a solution still.
> #create 3 data frames, 5 observations and 10 variables each
> df1 <- as.data.frame(matrix(rnorm(50, mean = 50, sd = 10), ncol = 10, nrow = 5))
> df2 <- as.data.frame(matrix(rnorm(50, mean = 50, sd = 10), ncol = 10, nrow = 5))
> df3 <- as.data.frame(matrix(rnorm(50, mean = 50, sd = 10), ncol = 10, nrow = 5))
>
> #create lone data frame with 3 observerations (1 per data frame) and 2 variables
> df4 <- as.data.frame(matrix(rnorm(6, mean = 5, sd = 1), ncol = 2, nrow = 3))
>
> #create colnames for better explanation
> cn <- c()
> for (i in 1:12){
+ cn[i] <- paste0("Var",i)
+ }
> colnames(df1) <- cn[1:10]
> colnames(df2) <- cn[1:10]
> colnames(df3) <- cn[1:10]
> colnames(df4) <- cn[11:12]
>
> #working example for 1 out of 3 matches
> #adding the first row of the lone data frame "df4" containing
> #Var11 and Var12 to df1. Result is as desired
> newdf1 <- c(df1,df4[1,])
> as.data.frame(newdf1)
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12
1 52.37538 48.47529 41.93258 45.93547 41.71611 58.86811 40.70888 41.87981 56.80464 49.73488 5.233276 4.417211
2 51.90261 61.72404 44.96621 48.59473 51.61673 51.07525 55.02000 43.48264 34.03446 48.93913 5.233276 4.417211
3 39.85056 48.72688 49.93816 60.41899 54.90524 56.84387 53.92486 55.92178 50.81779 66.03640 5.233276 4.417211
4 41.61915 53.22312 47.96660 50.79573 34.98073 41.81004 46.43976 45.49678 32.48257 58.65475 5.233276 4.417211
5 58.52455 39.70007 51.26386 39.92583 47.08723 31.41743 45.34423 63.06964 61.07181 55.44908 5.233276 4.417211
> df4
Var11 Var12
1 5.233276 4.417211
2 5.309388 5.375850
3 6.342876 5.318077
Really grateful for any help offered :)
PS: My first post here, I hope it's readable.
Having a bunch of data.frames lying around with names that have numbers in them is a sign that you're not doing things the "R way". Another sign that things aren't looking good is the use of assign(). You generally should keep such objects in a list in R. That makes everything easier to work with.
But let's say you have such data frames
df1<-data.frame(id=1:10, a=1:10)
df2<-data.frame(id=1:10, b=1:10)
df3<-data.frame(id=1:10, c=1:10)
lf<-data.frame(x=1:3, y=1:3)
We can use ls() to get their names and mget() to return them in a list. Then we can use Map() to cbind() each data.frame in the list to each row of lf. This will return a new list with all the updated data.frames
Map(function(a,b) {row.names(b)<-NULL; cbind(a, b)} ,
mget(ls(pattern="^df\\d+")),
split(lf, 1:nrow(lf))
)
Given the lack of reproducible example it's hard to know exactly what you wanted. You should provide small input data sets and show the desired output. This would make it easier to test solutions.

Combine, Order, Dedup over Multiple Files in R

I have a large number of CSV files that look like this:
var val1 val2
a 2 1
b 2 2
c 3 3
d 9 2
e 1 1
I would like to:
Read them in
Take the top 3 from each CSV
Make a list of the variable names only (3 x number of files)
Keep only the unique names on the list
I think I have managed to get to point 3 by doing this:
csvList <- list.files(path = "mypath", pattern = "*.csv", full.names = T)
bla <- lapply(lapply(csvList, read.csv), function(x) x[order(x$val1, decreasing=T)[1:3], ])
lapply(bla,"[", , 1, drop=FALSE)
Now, I have a list of the top 3 variables in each CSV. However, I don't know how to convert this list to a string and keep only the unique values.
Any help is welcome.
Thank you!
The issue is in extracting the first columns of bla with drop=FALSE. This preserves the results as a list of columns (where each row has a name) instead of coercing it to its lowest dimension, which is a vector. Use drop=TRUE instead and then unlist followed by unique as #Frank suggests:
unique(unlist(lapply(bla,"[", , 1, drop=TRUE)))
As you know, drop=TRUE is the default, so you don't even have to include it.
Update to new requirements in comments.
To keep the first two columns var and var1 and remove duplicates in var (keep only the unique vars), do the following:
## unlist each column in turn and form a data frame
res <- data.frame(lapply(c(1,2), function(x) unlist(lapply(bla,"[", , x))))
colnames(res) <- c("var","var1") ## restore the two column names
## remove duplicates
res <- res[!duplicated(res[,1]),]
Note that this will only keep the first row for each unique var. This is the definition of removing duplicates here.
Hope this helps.

Using R to list and mark multiple csv files with characters from the title of those files, and put those in a dataframe

I have a large number of files that are all numbered and labeled from a CTD cast. These files all contain 3 columns, for bottle number fired, Depth, and Conductivity, and 3 rows, one for each water bottle fired.
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547
These files are named after the cast number as such "OS1505xxx.csv", where the xxx is the cast number. I would like to take the data from multiple casts, label the data with the cast number(which I presume would go in another column for each bottle sample), and then merge that data together in one dataframe.
1,68.93,0.2123,001
2,14.28,0.3139,001
3,8.683,0.3547,001
1,109.5,0.2062,002
2,27.98,0.4842,002
3,5.277,0.3705,002
One other thing, some files only have 1 or 2 bottles fired, While others also have 4 bottles fired. I tried finding files with only 3 rows and making a list of the filenames repeated three times, and then mergeing that with the binded csv files that had three rows into a dataframe but I am very new to R and couldn't figure it out. Any help is appreciated.
This gets all of them into one data frame in order (001-100), and from there you can export it however you want.
df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c("V1", "V2", "V3", "file")
for(i in 1:100) {
file_name <- paste("OS1505",as.name(sprintf("%03d", i)),".csv",sep="")
if(file.exists(file_name)) {
print("match found")
df_tmp <- read.csv(file_name, header = FALSE, sep = ",",fill = TRUE)
df_tmp$file <- sprintf("%03d", i)
df <- rbind(df, df_tmp)
}
}
Try this:
files <- list.files(pattern="OS1505")
lst <- lapply(files, read.csv)
ids <- substr(files, 7,9)
for(i in 1:length(lst)) lst[[i]][,4] <- ids[i]
do.call(rbind, lst)
# X V1 V2 V3
#1 1 1 68.930 001
#2 2 2 14.280 001
#3 3 3 8.683 001
#4 1 1 109.500 002
#5 2 2 27.980 002
#6 3 3 5.277 002
We start by first creating two dummy files to try and save them as csv files to test. I named them in a way to match your files. (i.e. "OS1505001.csv"):
file1 <- read.table(text="
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547", sep=',')
file2 <- read.table(text="
1,109.5,0.2062
2,27.98,0.4842
3,5.277,0.3705", sep=',')
write.csv(file1, "OS1505001.csv")
write.csv(file2, "OS1505002.csv")
Going through the code, files checks the directory for any files that have OS1505 in them. There are two files that match that description "OS1505001.csv" "OS1505002.csv". We bring those two files into R with read.csv. It is wrapped in lapply so that the process can happen to all of the files in the files vector at once and saved in a list called lst. Now ids is a way to grab the id numbers from the file names. In a for loop we assign each id to the 4th column of the data frames. Lastly, do.call brings it all together with the rbind function.

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

Resources