Simple text cleaning into all columns of a dataframe frame - r

I have a dataframe which I would like to implement some basic formation rules.
The dataframe is:
df <- structure(list(colname1 = structure(c(2L, 1L, 1L), .Label = c("",
"TEXTA"), class = "factor"), colname2 = structure(c(2L, 1L, 3L
), .Label = c("TEXTA", "TEXTB", "TEXTE"), class = "factor"),
colname3 = structure(c(2L, 3L, 1L), .Label = c("", "TEXTC",
"TEXTD"), class = "factor")), .Names = c("colname1", "colname2",
"colname3"), class = "data.frame", row.names = c(NA, -3L))
I try to run the following for the whole dataframe data:
df2 <- as.data.frame(tolower(df))
df2 <- as.data.frame(gsub("[[:punct:]]", "", df2))
but this converts the column names of dataframe to rows. What can I do to make in lower case and remove punctuation from all rows of the example dataframe (I am not interesting for colnames)?

We remove the punctuation characters on each column by looping through the columns (lapply(df, ..), assign the output back to the original dataset
df[] <- lapply(df, function(x) gsub("[[:punct:]]+", "", tolower(x)))
Using tidyverse, this can be done by
library(dplyr)
df %>%
mutate_all(funs(gsub("[[:punct:]]+", "", tolower(.))))

Related

trying to summarize survey data for questions with 'select all that apply' using R

We have a survey that asks for 'select all that apply' so the result is a string inside quotes with the values separated by commas. i.e. "red, black,green"
There are other question about income so I have a factor with 'low, medium, high'
I want to be able to answer questions: What percent selected 'Red', then group that by income.
I can split the string with
'''df4 <- c("black,silver,green")'''
I can create a data frame with a timestamp and the split string with
'''t2 <- as.data.frame(c(df2[2],l2))'''
I am not able to understand how to do this for all rows at one time.
Here is a DPUT of the input:
structure(list(RespData = structure(1:2, .Label = c("1/20/2020",
"1/21/2020"), class = "factor"), CarColor = c("red,blue,green,yellow",
"black,silver,green")), row.names = c(NA, -2L), class = "data.frame")
and here is a DPUT of the desired output:
structure(list(RespData = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("1/20/2020", "1/21/2020"), class = "factor"),
Cars = structure(c(3L, 1L, 2L, 4L, 5L, 6L, 2L), .Label = c("blue",
"green", "red", "yellow", "black", "silver"), class = "factor")), row.names = c(NA,
-7L), class = "data.frame")
Example of Function:
MySplitFunc <- function(ListIn) {
# build an empty data frame and set the column names
x1.all <- ListIn[0,]
names(x1.all) <- c("ResponseTime", "Descriptive")
# for each row build the data and combine to growing list
for(x in 1:nrow(ListIn)) {
#print(x)
r1 <- ListIn[x,1]
c1 <- strsplit(ListIn[x,2],",")
x1 <- as.data.frame(c(r1,c1))
# set the names and combine to all
names(x1) <- c("ResponseTime", "Descriptive")
x1.all <- rbind(x1.all,x1)
}
# strip the whitespace
x1.all <- data.frame(lapply(x1.all, trimws), stringsAsFactors = TRUE)
return(x1.all)
}

summary and descriptive table for mixed data in R

I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")
We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)

how to find similar strings within a data

My data looks like this
df<- structure(list(A = structure(c(7L, 6L, 5L, 4L, 3L, 2L, 1L, 1L,
1L), .Label = c("", "P42356;Q8N8J0;A4QPH2", "P67809;Q9Y2T7",
"Q08554", "Q13835", "Q5T749", "Q9NZT1"), class = "factor"), B = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("P62861", "P62906",
"P62979;P0CG47;P0CG48", "P63241;Q6IS14", "Q02413", "Q07955",
"Q08554", "Q5T749", "Q9UQ80"), class = "factor"), C = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("", "P62807;O60814;P57053;Q99879;Q99877;Q93079;Q5QNW6;P58876",
"P63241;Q6IS14", "Q02413", "Q16658", "Q5T750", "Q6P1N9", "Q99497",
"Q9UQ80"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA,
-9L))
I want to count how many elements are in each columns including those that are separated with a ; , for example in this case
first column has 9, second column has 12 elements and the third column has 16 elements. then I want to check how many times a element is repeated in other columns . for example
string number of times columns
Q5T749 2 1,2
then remove the strings which are seen more than once from the df
One way to approach this is to start by re-organizing the data into a form that is more convenient to work with. The tidyr and dplyr packages are useful for that sort of thing.
library(tidyr)
df$index <- 1:nrow(df)
df <- gather(df, key = 'variable', value = 'value', -index, na.rm = TRUE)
df <- separate(df, "value", into = paste("x", 1:(1 + max(nchar(gsub("[^;]", "", df$value)))), sep = ""), sep = ";", fill = "right")
df <- gather(df, "which", "value", -index, -variable)
Once you do that counting each element is easy:
addmargins(t(table(df[, c("variable", "value")])), margin = 2)
Dropping duplicates is also easy.
df <- df[!duplicated(df$value), ]
If you really want to put the data back into the original for you can (though I don't recommend it).
df <- spread(df, key = "variable", value = "value")
library(dplyr)
summarize(group_by(df, index),
A = paste(na.omit(A), collapse = ";"),
B = paste(na.omit(B), collapse = ";"),
C = paste(na.omit(C), collapse = ";"))
For the count of elements in each column use this
sapply(df,function(x) length(unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";"))))
For counting the repetition use this
words <- lapply(df,function(x) unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";")))
dup_table <- table(unlist(words))
dup_table
There is a very bad approach to remove the repetition
pat <- names(dup_table)[unname(dup_table)>1]
for(i in pat)
df <- as.data.frame.list(lapply(df,function(x) gsub(pattern = i,replacement = "",x)))
But, there is only one problem. It will replace all the occurences of a particular pattern.

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

Replacing rows in R

In R am reading a file with comments as csv using
read.data.raw = read.csv(inputfile, sep='\t', header=F, comment.char='')
The file looks like this:
#comment line 1
data 1<tab>x<tab>y
#comment line 2
data 2<tab>x<tab>y
data 3<tab>x<tab>y
Now I extract the uncommented lines using
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data = read.data.raw[-comment_ind,]
Which leaves me:
data 1<tab>x<tab>y
data 2<tab>x<tab>y
data 3<tab>x<tab>y
I am modifying this data through some separate script which maintains the number of rows/cols and would like to put it back into the original read data (with the user comments) and return it to the user like this
#comment line 1
modified data 1<tab>x<tab>y
#comment line 2
modified data 2<tab>x<tab>y
modified data 3<tab>x<tab>y
Since the data I extracted in read.data preserves the row names row.names(read.data), I tried
original.read.data[as.numeric(row.names(read.data)),] = read.data
But that didn't work, and I got a bunch of NA/s
Any ideas?
Does this do what you want?
read.data.raw <- structure(list(V1 = structure(c(1L, 3L, 2L, 4L, 5L),
.Label = c("#comment line 1", "#comment line 2", "data 1", "data 2",
"data 3"), class = "factor"), V2 = structure(c(1L, 2L, 1L, 2L, 2L),
.Label = c("", "x"), class = "factor"), V3 = structure(c(1L, 2L, 1L,
2L, 2L), .Label = c("", "y"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data <- read.data.raw[-comment_ind,]
# modify V1
read.data$V1 <- gsub("data", "DATA", read.data$V1)
# rbind() and then order() comments into original places
new.data <- rbind(read.data.raw[comment_ind,], read.data)
new.data <- new.data[order(as.numeric(rownames(new.data))),]

Resources