My sample data frame would look like the following:
1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y
I want to make rows 1, 3, and 5 column names and have the data below each fall into each column, respectively. I was looking into the reshape function, but I only saw examples where an entire column of values needed to be individual columns. So I wasn't sure what to do in this scenario--apologies if it's obvious.
Here is the desired output:
1 Number Type Code Reason Date Amount Damage Act State City Zip Phone
2 0123 06 09 010 08/31/16 10,000 Y N WI GB 1234 Y
Thanks
As some people have commented, you could build a data frame out of the rows of your starting data frame, but I think it's a little easier to work on the lines of text.
If your starting file looks something like this
Number , Type , Code ,Reason
0123 , 06 , 09 , 010
Date , Amount , Damage , Act
08/31/16 , 10000 , Y , N
State , City , Zip , Phone
WI , GB , 1234, Y
we can read it in with each line as an element of a character vector:
lines <- readLines("start.csv")
make all the odd lines into a single line:
oddind <- seq(from=1, to= length(lines), by=2)
namelines <- paste(lines[oddind], collapse=",")
make all the even lines into a single line:
datlines <- paste(lines[oddind+1], collapse=",")
make those lines into a new CSV to read:
writeLines(text= c(namelines, datlines), con= "nice.csv")
print(read.csv("nice.csv"))
This gives
Number Type Code Reason Date Amount Damage Act State
1 123 6 9 10 08/31/16 10000 Y N WI
City Zip Phone
1 GB 1234 Y
So, it's all in one row of the data frame and all the variable names show up correctly in the data frame.
The benefits of this strategy are:
It will work for starting CSV files where the number of variables isn't a multiple of 4.
It will work for starting CSV files with any number of rows
There is no chance of weird dynamic casting happening with unlist() or as.character().
Creating a dataframe roughly appearing like that (although it necessarily has column names). Those are probably factor columns if you just used one of the standard read.* functions without using stringsAsFactors=FALSE, hence the need to convert with as.character.
dat=read.table(text="1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y")
Then you can set odd number rows as names of the values-vector of the even number rows with:
setNames( unlist( lapply( dat[!c(TRUE,FALSE), ] ,as.character)),
unlist( lapply( dat[c(TRUE,FALSE), ] ,as.character)) )
1 3 5 Number Date State Type
"2" "4" "6" "0123" "08/31/16" "WI" "06"
Amount City Code Damage Zip Reason Act
"10,000" "GB" "09" "Y" "1234" "010" "N"
Phone
"Y"
The !c(TRUE,FALSE) and its logical complement in the next extract operation get magically recycled along all the possible rows. Obviously there would be better ways of doing this if you posted a version of a text file rather than saying that the starting point was a dataframe. You would need to remove what were probably rownames. If you want a "clean solution then post either dput(.) from your dataframe or the raw text file.
c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE, )
so I use the above code to read about 300 csv files.
and some files look like this
65792,1,round-5,72797,140,yellow,75397,192,red,75497,194,crash
86267,1,round6,92767,130,yellow,94702,168,brake,95457,178,go,95807,185,red,96057,190,brake,97307,200,crash
108092,1,round-7,116157,130,yellow,117907,165,red
120108,1,round-8,130173,130,yellow,130772,142,brake,133173,152,red
137027,1,round-9,147097,130,yellow,148197,152,brake,148597,160,red
As you can see the second is longer than other line (for each row the third element is supposed have round#) and when I do read.table R cuts the line in half, below I copied the first 5 columns from R
9 86267 1 round-6 92767 130
10 95807 185 red 96057 190
11 108092 1 round-7 116157 130
12 120108 1 round-8 130173 130
is there a way to edit that so that the row is one line instead of being split?
You can prime the data.frame width by specifying "col.names" argument along with "fill=TRUE" as in:
c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE,
col.names=paste("V", 1:21,sep=""))
That's assuming you know how many columns you have. If you don't know, you might want to make a single pass through the file to find the maximum width.
I've had some temperature measurements in .csv format and am trying to analyse them in R. For some reason the data files contain temperature with degree C following the numeric value. Is there a way to remove the degree C symbol and return the numeric value? I though of producing an example here but did not know how to generate a degree symbol in a string in R. Anyhow, this is what the data looks like:
> head(mm)
dateTime Temperature
1 2009-04-23 17:01:00 15.115 °C
2 2009-04-23 17:11:00 15.165 °C
3 2009-04-23 17:21:00 15.183 °C
where the class of mm[,2] is 'factor'
Can anyone suggest a method for converting the second column to 15.115 etc?
You can remove the unwanted part and convert the rest to numeric all at the same time with scan(). Setting flush = TRUE treats the last field (after the last space) as a comment and it gets discarded (since sep expects whitespace separators by default).
mm <- read.table(text = "dateTime Temperature
1 '2009-04-23 17:01:00' '15.115 °C'
2 '2009-04-23 17:11:00' '15.165 °C'
3 '2009-04-23 17:21:00' '15.183 °C'", header = TRUE)
replace(mm, 2, scan(text = as.character(mm$Temp), flush = TRUE))
# dateTime Temperature
# 1 2009-04-23 17:01:00 15.115
# 2 2009-04-23 17:11:00 15.165
# 3 2009-04-23 17:21:00 15.183
Or you can use a Unicode general category to match the unicode characters for the degree symbol.
type.convert(sub("\\p{So}C", "", mm$Temp, perl = TRUE))
# [1] 15.115 15.165 15.183
Here, the regular expression \p{So} matches various symbols that are not math symbols, currency signs, or combining characters. C matches the character C literally (case sensitive). And type.convert() takes care of the extra whitespace.
If all of your temperature values have the same number of digits you can make left and right functions (similar to those in Excel) to select the digits that you want. Such as in this answer from a different post: https://stackoverflow.com/a/26591121/4459730
First make the left function:
left = function (string,char){
substr(string,1,char)
}
Then recreate your Temperature string using just the digits you want:
mm$Temperature<-left(mm$Temperature,6)
degree symbol is represented as \u00b0, hence following code should work:
df['Temperature'] = df['Temperature'].replace('\u00b0','', regex=True)
I have an ASCII file that includes a set of MODIS data containing a series of pixel values for each acquisition date. The data format is:
ASCII values are comma delimited
Data values start after header rows and are space delimited.
An example of two dates from the data is shown below:
----------------------------------------------------------------------------
MODIS HDF Tile MOD13Q1.A2003273.h11v03.005.2008260032604.hdf
Scientific Data Set (Band) 250m_16_days_EVI
Number of Values Passing QA Filter 81 of 81
Applying the Scale of .0001 MEAN: 0.24070987654321, STD-DEV: 0.0257345931611507
Unscaled MEAN: 2407.0987654321, STD-DEV: 257.345931611507
2213,2160,2206,2408,2369,2362,2423,2466,2318,2160,2429,2316,2260,2362,2431,2172,2021,2254,2424,2391,2427,2331,1934,2220,2235,2254,2186,2325,2046,1956,2273,2220,2235,2257,2425,2534,2141,2288,2273,2263,2436,2568,2603,2470,2561,2288,2369,2628,2725,2730,2603,2704,2744,2732,2624,2606,2694,2730,2718,2765,2771,2732,2771,2726,2694,2637,2699,2806,2712,2384,1904,1982,2747,2788,2610,2647,2408,2096,1946,1858,1791
----------------------------------------------------------------------------
MODIS HDF Tile MOD13Q1.A2003289.h11v03.005.2008263131227.hdf
Scientific Data Set (Band) 250m_16_days_EVI
Number of Values Passing QA Filter 81 of 81
Applying the Scale of .0001 MEAN: 0.261756790123457, STD-DEV: 0.0232843291670261
Unscaled MEAN: 2617.56790123457, STD-DEV: 232.843291670261
2074,2323,2382,2574,2614,2661,2631,2599,2525,2399,2548,2545,2541,2599,2415,2428,2417,2518,2549,2471,2539,2520,2407,2358,2426,2461,2575,2427,2412,2518,2500,2394,2509,2567,2569,2648,2414,2573,2498,2626,2509,2708,2694,2654,2702,2536,2750,2804,2917,2926,2942,2938,2844,2839,2863,2985,3006,2991,2997,2937,2830,2838,2607,3101,3093,3085,2950,2881,2608,2570,2499,2233,2912,2833,2819,2348,2426,2541,2243,2239,2071
A typical ASCII file has about 900 dates included i.e. 900 "tiles" of information in exactly the same format as listed above, on after another. The number of pixels is the same in each i.e. 81 values for each date.
What i would like to do is to read in the file and for each date, extract the "MODIS HDF Tile" name e.g. MOD13Q1.A2003289.h11v03.005.2008263131227.hdf and each pixel value to individual columns, something like:
MODIS HDF Tile Scientific Data Set (Band) V2 V3 V4 V5 V6 V7...
MOD13Q1.A2003273.h11v03.005.2008263131227.hdf 250m_16_days_ENVI 2213 2160 2206 2408 2369 .......
MOD13Q1.A2003289.h11v03.005.2008263131227.hdf 250m_16_days_ENVI 2074 2323 2382 2574 2614 .....
Any help would be greatly appreciated!
Perhaps something like this could work
modis <- readLines("modis.txt")
headers <- grep("^MODIS", modis)
headtiles <- sapply(strsplit(modis[headers[1]],"\\s{2,}"), '[',1 )
headbands <- sapply(strsplit(modis[headers[1]+1],"\\s{2,}"), '[',1 )
tiles <- sapply(strsplit(modis[headers],"\\s{2,}"), '[',2 )
bands <- sapply(strsplit(modis[headers+1],"\\s{2,}"), '[',2 )
pxlines <- grep("(,.*?){5,}", modis)
pixels <- do.call(rbind, lapply(strsplit(modis[pxlines], ","), as.numeric))
dd<-data.frame(tiles, bands, pixels)
names(dd)<-c(headtiles , headbands , paste0("pixel", seq.int(ncol(pixels))))
Here we grep through all the lines to find the header line, then we assume the next line is the band line. And then we look for lines with lots of commas for the pixel values. This is making a lot of assumptions about the data file based on the limited sample you provided.