It's hard to explain what exactly I want to achieve with my script but let me try.
I have 20 different csv files, so I loaded them into R:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
then with your help I combined them into one and removed all of the duplicates:
data_rd <- subset(transform(all_data, X = sub("\\..*", "", X)),
!duplicated(X))
I have now 1 master table which includes all of the names (Accession):
Accession
AT1G19570
AT5G38480
AT1G07370
AT4G23670
AT5G10450
AT4G09000
AT1G22300
AT1G16080
AT1G78300
AT2G29570
Now I would like to find this accession in other csv files and put the data of this accession in the same raw. There are like 20 csv files and for each csv there are like 20 columns so in same cases it might give me a 400 columns. It doesn't matter how long it takes. It has to be done. Is it even possible to do with R ?
Example:
First csv Second csv Third csv
Accession Size Lenght Weight Size Lenght Weight Size Lenght Weight
AT1G19570 12 23 43 22 77 666 656 565 33
AT5G38480
AT1G07370 33 22 33 34 22
AT4G23670
AT5G10450
AT4G09000 12 45 32
AT1G22300
AT1G16080
AT1G78300 44 22 222
AT2G29570
It looks like a hard task to do. Propably it has to be done by the loop. Any ideas ?
This is a merge loop. Here is rough R code that will inefficiently grow with each merge.
Begin as before:
tbls = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
tbl=list_of_data[[1]]
for(i in 2:length(list_of_data))
{
tbl=merge(tbl, list of_data[[i]], by="Accession", all=T)
}
The matching column names (not used as a key) will be renamed with a suffix (.x,.y, and so on), the all=T argument will ensure that whenever a new Accession key is merged a new row will be made and the missing cells will be filled with NA.
Related
I'm trying to replace some of the columns in my data frame with extracted strings from each column name. This is my current data frame:
Date Time Temp ActivityLevelActivity ExplainActivityvalues4 AppetiteLevelAppetite
10/22/21 10:26 76 4 Activity was low 8
10/23/21 8:42 79 3 Activity was low again 7
I would like to replace the "ActivityLevelActivity" and "AppetiteLevelAppetite" column names with just "Activity" and "Appetite". I would like to change the "ExplainActivityvalues4" to "Activity_Comments".
I have tried:
gsub("Level", "[^L]+", names(df))
gsub("Explain", "(?<=\\n)[[:alpha:]]+(?<=\\v)", names(df))
I used "Level" and "Explain" as the patterns because the word "Level" is included in every column name where I would just like to take the first word. "Explain" is included for every column name where I would like to take the middle word and add "_Comments".
Essentially, I would like the new data frame to look like this:
Date Time Temp Activity Activity_Comments Appetite
10/22/21 10:26 76 4 Activity was low 8
10/23/21 8:42 79 3 Activity was low again 7
EDIT:
To explain further, here are all of my column names:
names(df) <- c(“Date”, “Time”, “Temp”, “ActivityLevelActivity”, “ExplainActivityvalues4”, “AppetiteLevelAppetite”, “ExplainAppetitevalues4”, “ComfortLevelComfort”, “ExplainComfortvalues4”, “DemeanorLevelDemeanor”, “ExplainDemeanorvalues4”, CooperationLevelCooperation”, “ExplainCooperationvalues4”, “HygieneLevelHygiene”, “ExplainHygienevalues4”, “MobilityLevelMobility”, “ExplainMobilityvalues4”)
Since you only have three columns and there's not really much of a shared pattern here, it would just be easier to more directly use rename.
# library(dplyr)
dd %>%
rename(
Activity = ActivityLevelActivity,
Appetite = AppetiteLevelAppetite,
Activity_Comments = ExplainActivityvalues4)
I have a problem importing a file in R. The file correctly organized must contain 5 million records and 22 columns. I cannot separate the data base properly. I tried it with this code:
content <- scan("filepath",'character',sep='~') # Read the file as a single line
# To split content in lines:
lines <- regmatches(content,gregexpr(".{211}",content)) #Each line must have 211 characters with 5 million rows in total
x <- tempfile()
library(erer)
write.list(lines,x)
data <- read.fw(x, widths = c(12,9,9,3,4,8,1,1,3,3,3,1,12,14,13,30,8,9,12,6,6,27))
unlink(x)
Each record has numbers and letters. I don't know what I can correct to separate in columns properly.
All rows looks like this:
1000100060040000000000808040512000000188801072010010010000000000000 CABANILLAS GONZALES MARIA MANUEL CABANILLAS MARIA GONZALES 00000000000000000000000
I want to separate it according to the widths specified in the function
It includes some spaces that I cannot include in the final view
I tried to read all posts like this but I did not succeed.
I need to extract tables of different layouts from a single sheet in excel, for each sheet of the file.
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
I would use readxl. The code below reads just one sheet, but it is easy enough to adapt to read multiple or different sheets.
First we just want to read the sheet. Obviously you should change the path to reflect where you saved your file:
library(readxl)
sheet = read_excel("~/Downloads/try.xlsx", col_names = LETTERS[1:12])
If you didn't know you had 12 columns, then using read_excel without specifying the column names would give you enough information to find that out. The different tables in the sheet are separated by one or two blank rows. You can find the blank rows by testing each row to see if all of the cells in that row are NA using the apply function.
blanks = which(apply(sheet, 1, function(row)all(is.na(row))))
blanks
> blanks
[1] 7 8 17 26 35 41 50 59 65 74 80 86 95 98
So you could extract the first table by taking rows 1--6 (7 - 1), the second table by taking rows 9--16 and so on.
I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.
c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE, )
so I use the above code to read about 300 csv files.
and some files look like this
65792,1,round-5,72797,140,yellow,75397,192,red,75497,194,crash
86267,1,round6,92767,130,yellow,94702,168,brake,95457,178,go,95807,185,red,96057,190,brake,97307,200,crash
108092,1,round-7,116157,130,yellow,117907,165,red
120108,1,round-8,130173,130,yellow,130772,142,brake,133173,152,red
137027,1,round-9,147097,130,yellow,148197,152,brake,148597,160,red
As you can see the second is longer than other line (for each row the third element is supposed have round#) and when I do read.table R cuts the line in half, below I copied the first 5 columns from R
9 86267 1 round-6 92767 130
10 95807 185 red 96057 190
11 108092 1 round-7 116157 130
12 120108 1 round-8 130173 130
is there a way to edit that so that the row is one line instead of being split?
You can prime the data.frame width by specifying "col.names" argument along with "fill=TRUE" as in:
c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE,
col.names=paste("V", 1:21,sep=""))
That's assuming you know how many columns you have. If you don't know, you might want to make a single pass through the file to find the maximum width.