Data Cleanup with R: Getting rid of extra fullstops - r

I am cleaning the data using R.
Below is my data format
Input
1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |
expected output:
1) 100 | 101.25 | 102.25 |NA | NA |201.5
2) 200.05|200.26 | 205 |NA | NA |3000
3) 300.98|300.26 |2001.26 |NA |0.2 |5.65
there are extra full stops at in the table, which I am trying to get cleaned, but to retain decimal numbers in its format
I tried replace all in R, which clears all the full stops, and decimal numbers are distorted.

If the trailing full stop is really the only manifestation of the problem, then you may try just removing it with sub:
x <- c("101.25", "200.56.", "300.26")
x <- sub("\\.$", "", x)

You can use look-ahead to replace dot(.) which are not before space or | as:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'
y <- gsub("([.]+)(?=[[:blank:]|])","",x,perl = TRUE)
cat(y)
# 1) 100 | 101.25 | 102.25 | | | 201.5 |
# 2) 200.05 | 200.56 | 205 | | | 3000 |
# 3) 300.98 | 300.26 | 2001.56| | 0.2| 5.65 |
Regex explanation:
([.]+) - Group any number of . before look-ahead
(?=[[:blank:]|]) - Look-ahead before :blank: or |
Data:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'

Related

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

Converting comma separated list to dataframe

If I have a list similar to x <- c("Name,Age,Gender", "Rob,21,M", "Matt,30,M"), how can I convert to a dataframe where Name, Age, and Gender become the column headers.
Currently my approach is,
dataframe <- data.frame(matrix(unlist(x), nrow=3, byrow=T))
which gives me
matrix.unlist.user_data...nrow...num_rows..byrow...T.
1 Name,Age,Gender
2 Rob,21,M
3 Matt,30,M
and doesn't help me at all.
How can I get something which resembles the following from the list mentioned above?
+---------------------------------------------+
| name | age | gender |
| | | |
+---------------------------------------------+
| | | |
| | | |
| ... | ... | ... |
| | | |
| | | ++
+---------------------------------------------+
| | | |
| ... | ... | ... |
| | | |
| | | |
+---------------------------------------------+
We paste the strings into a single string with \n and use either read.csv or read.table from base R
read.table(text=paste(x, collapse='\n'), header = TRUE, stringsAsFactors = FALSE, sep=',')
Alternatively,
data.table::fread(paste(x, collapse = "\n"))
Name Age Gender
1: Rob 21 M
2: Matt 30 M

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

How to find correlations in a dataset containing over 350 columns in R

I have a dataset with ~360 measurement types listed as columns and has 200 rows each with unique ID.
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| | ID | M1 | M2 | M3 | M4 | M5 | … | M360 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1 | 6F0ZC | 0.068 | 0.0691 | 37.727 | 42.6139 | 41.7356 | … | 44.9293 |
| 2 | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3 | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262 | … | 45.7038 |
| 4 | 6F108 | 0.0685 | 0.069 | 38.3026 | 43.5699 | 42.3 | … | 46.1701 |
| 5 | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6 | 6F19W | 0.0682 | 0.071 | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7 | 6F1A0 | 0.0681 | 0.069 | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8 | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438 | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528 | 44.606 | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
I am trying to find correlations within these measurements and check for highly correlated features and visualize them. With so many columns, I am not able to do the regular correlation plots. (chart.Correlation,corrgram,etc..)
I also tried using qgraph but the measurements get cluttered at one place and is not very intuitive.
library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"),
layout="spring",
label.cex=0.9,
minimum = 0.90,
label.scale=FALSE)
Is there a good approach to visualize it & tell how these measurements are correlated with each other?
As mentioned in a comment, corrplot(...) might be a good option. Here is a ggplot option that does something similar. The basic idea is to draw a heat map, where color represents the correlation coefficient.
# create artificial dataset - you have this already
set.seed(1) # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))
# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
geom_tile()+
scale_fill_gradientn(colours=rev(heat.colors(10)))
coord_fixed()
# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
geom_tile()+
scale_fill_manual(values=rev(heat.colors(5)))+
coord_fixed()
Note the diagonal. This is by design - the contrived data is set up so that rows i and i+1 are perfectly correlated, for every other row.

Extracting columns from text file

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command).
I use the following command to load the text file:
data3 <-read.table (file.choose(), header = FALSE,sep = ",")
I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result).
Each COLn will contain the relevant characters of the tree in this example.
How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?
Here is the text file content:
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
An example of the COL1 content will be:
MSTV
|
|
|
|
|
|
|
|
MSTV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COL2 content will be:
MLTV
MLTV
|
|
|
|
|
|
>
DP
|
|
|
|
|
|
|
|
|
|
|
|
DP
|
|
|
|
|
|
Try this:
cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt),
header = FALSE,
widths = rep.int(4, max(nchar(cleaned.txt)/4)),
strip.white= TRUE
)
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.
Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.
Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here
widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
Next, I'm creating the data in the way you would like it structured.
for (i in colnames(cleaned.df)) {
assign(i, subset(cleaned.df, select=i))
assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}
rm(i)
rm(cleaned.df)
rm(cleaned.txt)
What this does is it creates a loop for each column header in your data frame.
From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.
Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.
Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using
get(i)[get(i)!=""]
(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).
If we just use get(i), there will be a leading whitespace in the output.

Resources