Avoid quotation marks in column and row names when using write.table [duplicate] - r

This question already has an answer here:
Delete "" from csv values and change column names when writing to a CSV
(1 answer)
Closed 5 years ago.
I have the following data in a file called "data.txt":
pid 1 2 4 15 18 20
1_at 100 200 89 189 299 788
2_at 8 78 33 89 90 99
3_xt 300 45 53 234 89 34
4_dx 49 34 88 8 9 15
The data is separated by tabs.
Now I wanted to extract some columns on that table, based on the information of csv file called "vector.csv", this vector got the following data:
18,1,4,20
So I wanted to end with a modified file "datamod.txt" separated with tabs that would be:
pid 18 1 4 20
1_at 299 100 89 788
2_at 90 8 33 99
3_xt 89 300 53 34
4_dx 9 49 88 15
I have made, with some help, the following code:
fileName="vector.csv"
con=file(fileName,open="r")
controlfile<-readLines(con)
controls<-controlfile[1]
controlins<-controlfile[2]
test<-paste("pid",controlins,sep=",")
test2<-c(strsplit(test,","))
test3<-c(do.call("rbind",test2))
df<-read.table("data.txt",header=T,check.names=F)
CC <- sapply(df, class)
CC[!names(CC) %in% test3] <- "NULL"
df <- read.table("data.txt", header=T, colClasses=CC,check.names=F)
df<-df[,test3]
write.table(df,"datamod.txt",row.names=FALSE,sep="\t")
The problem that I got is that my resulting file has the following format:
"pid" "18" "1" "4" "20"
"1_at" 299 100 89 788
"2_at" 90 8 33 99
"3_xt" 89 300 53 34
"4_dx" 9 49 88 15
The question I have is how to avoid those quotation "" marks that appear in my saved file, so that the data appears like I would like to.
Any help?
Thanks

To quote from the help file for write.table
quote
a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
any character or factor columns will be surrounded by double quotes.
If a numeric vector, its elements are taken as the indices of columns
to quote. In both cases, row and column names are quoted if they are
written. If FALSE, nothing is quoted.
Therefore
write.table(df,"datamod.txt",row.names=FALSE,sep="\t", quote = FALSE)
should work nicely.

Related

How to extend a hash with multiple values in R

So I understand that in R, a hash() is similar to a dictionary. I would like to extract specific values from my dataframe and put them in to a hash.
The componentindex column is were I have my keys and the cluster.index + UniqueFileSourceCounts columns contain my values. So for the same key I have multiple values. e.g: hash {91: [1,15],[22,99] etc..
So I would like to create a hash that contains each key, with multiple values. But im not sure how to do that.
mini_df <- head(df,10) #using a small df
compID <- unique(mini_df$componentindex) #list with unique keys
h1 <- hash()
for (i in 1:length(mini_df)){
if(compID == mini_df[i,"componentindex"]){
h1 <- hash(mini_df[i,"componentindex"] ,c(mini_df[i,"cluster.index"],mini_df[i,"UniqueFileSourcesCount"]))
}
#h2 <- append(h2,h1)
}
if I print h1 , I end up having only the last value:
<hash> containing 1 key-value pair(s).
91 : 42 5
Which I understand since I don't append to this hash but overwrite it. Im not sure how to append/expand hashes in R and I have not been able to find a solution yet.
mini_df:
UniqueFileSourcesCount cluster.index componentindex
1 15 1 91
2 15 10 -1
3 99 22 91
4 63 23 1675
5 12 25 91
6 6 27 91
7 50 37 91
8 5 42 91
9 2 43 -1
10 2 69 -1

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

R: Reformatting data file

I have what I suspect is a simple data reformatting question. The data file (txt) is structured with the observation numbers on separate lines,
1
45 65
78 56
2
89 34
39 55
The desired output is,
1 45 65
1 78 56
2 89 34
2 39 55
Suggestions on how to make that conversion would be most appreciated. Thanks.
We could read the file with readLines. Create an index variable and split the 'lines'. Remove the first element of the list elements, use read.table to read the file, and unnest
lines <- readLines('file.txt')
library(stringr)
#remove leading/lagging spaces if any
lines <- str_trim(lines)
#create the index mentioned above based on white space
indx <- !grepl('\\s+', lines)
#cumsum the above index to create grouping
indx1 <- cumsum(indx)
#split the lines with and change the names of the list elements
lst <- setNames(split(lines, indx1), lines[indx])
#Use unnest after reading with read.table
library(tidyr)
unnest(lapply(lst, function(x) read.table(text=x[-1])), gr)
# gr V1 V2
#1 1 45 65
#2 1 78 56
#3 2 89 34
#4 2 39 55
Or we can use Map from base R approach
do.call(rbind,Map(cbind, gr=names(lst),
lapply(lst, function(x) read.table(text=x[-1]))))

How to remove special character from data frame

I have imported data from a url and converted it to a data frame using the following code:
url <-"http://apims.doe.gov.my/v2/hourly2.php"
tables<- readHTMLTable(url)
try<-do.call(rbind, lapply(tables, data.frame, stringsAsFactors=FALSE))
The data has '*' next to the numbers. I would like to isolate the numbers only.
So instead of
52* 45* 67* 55*
I have
52 45 67 55
I have tried several methods to get the * special character out of 3rd through 8th columns and change the column to a numeric but since this character also has a meaning in R these are not working. I have tried:
x <- "~!##$%^&*"
str_replace_all(x, as.character(try[,3:8]), " ")
I have also tried:
gsub("*","",try[,3:8])
The only function that has identified the * characters correctly is grep and grapl but I need another function that will use the grep output to remove the '*' special character.
grep('*',try)
Try this:
dat<-do.call(rbind, lapply(tables, data.frame, stringsAsFactors=FALSE))
dat[, -(1:2)] <- sapply(dat[, -(1:2)], function(col) {
as.numeric(sub("[*]$", "", col))
})
head(dat)
# NEGERI...STATE KAWASAN.AREA MASA.TIME06.00AM MASA.TIME07.00AM MASA.TIME08.00AM MASA.TIME09.00AM MASA.TIME10.00AM MASA.TIME11.00AM
# NULL.1 Johor Kota Tinggi 52 53 52 50 50 49
# NULL.2 Johor Larkin Lama 51 51 51 NA 51 51
# NULL.3 Johor Muar 45 45 45 45 45 45
# NULL.4 Johor Pasir Gudang 56 56 55 56 56 56
# NULL.5 Kedah Alor Setar 50 50 50 50 50 49
# NULL.6 Kedah Bakar Arang, Sg. Petani NA NA NA NA NA NA

R: How to divide a data frame by column values?

Suppose I have a data frame with 3 columns and 10 rows as follows.
# V1 V2 V3
# 10 24 92
# 13 73 100
# 25 91 120
# 32 62 95
# 15 43 110
# 28 54 84
# 30 56 71
# 20 82 80
# 23 19 30
# 12 64 89
I want to create sub-dataframes that divide the original by the values of V1.
For example,
the first data frame will have the rows with values of V1 from 10-14,
the second will have the rows with values of V1 from 15-19,
the third from 20-24, etc.
What would be the simplest way to make this?
So if this is your data
dd<-data.frame(
V1=c(10,13,25,32,15,38,30,20,23,13),
V2=c(24,73,91,62,43,54,56,82,19,64),
V3=c(92,100,120,95,110,84,71,80,30,89)
)
then the easiest way to split is using the split() command. And since you want to split in ranges, you can use the cut() command to create those ranges. A simple split can be done with
ss<-split(dd, cut(dd$V1, breaks=seq(10,35,by=5)-1)); ss
split returns a list where each item is the subsetted data.frame. So to get at the data.frame with the values for 10-14, use ss[[1]], and for 15-19, use ss[[2]] etc.

Resources