summary and descriptive table for mixed data in R - r

I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")

We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)

Related

How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column

So what I have is two different data frames: the one I've been working on (df1) and the one with all the new data I need to put in the first one (df2). Df1 has several columns of zeroes, waiting for the data to be added in. Df2 has the data I need, and several more rows and columns that I don't care about beyond that data. Here is a small subset of the type of data I'm working with.
This is my first time posting my data so I hope I'm doing it right. Let me know if you need a different format.
df1:
structure(list(season = c(" FA15", " FA15", " FA15", " FA15",
" FA15", " FA15", " FA15", " FA15", " FA15", " FA15"), year = c("2015",
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015",
"2015"), territory.name = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), plot = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0"), color.band = c("APGBY",
"APGGU", "APGPW", "APGPW", "APGR", "APGUO", "APGUO", "APGUO",
"APGUO", "APGYR")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df2:
structure(list(bandnum = c(157328052, 160379101, 157328094, 151313455,
170364680, 160379104, 151373458, 157328066, 160379103, 160379105
), project = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("*ISSJ", "ISSJ"), class = "factor"), color.band = c("PAWR",
"WYWAR", "APGP", "APGO", "ABYG", "URYAR", "APBW", "WABG", "OBWAR",
"GBGAR"), sex = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 1L,
2L), .Label = c("?", "F", "M"), class = "factor"), age = structure(c(2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("AHY", "ASY",
"HY", "N", "SY"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
I've been chewing on this problem for a few days, trying different things and reading so many answers on stack overflow, but I'm failing to come up with a clear answer on how to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column.
Pretty much, I want r to see that both data frames have in the color.band column a listing for the band ABCDEF, and then take the value from df2$bandnum in the same row as ABCDEF and copy it to df1$bandnum in the ABCDEF row there.
I don't want to copy rows that are in df2 but not df1 into df1. I want to mark entries that exist in df1 but not df2 as N/A in the bandnum column.
Column names and data format for color band and band number have been standardized between the two data frames so everything should line up.
What I have so far with code is this:
> practicedf <- left_join(x=df1, y=df2, by = "color.band", all.x = TRUE)
%>% mutate(y = ifelse(is.na(df1$color.band), df1$bandnum, df1$color.band)) %>% select(df2$bandnum)
left_join seems to be the right one because it keeps all rows in the left (df1) data frame and only matching rows from the right (df2) data frame.
I get this error though:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c("APGBY", "APGGU", "APGPW", :
replacement has 1261 rows, data has 2559
color.band is a character vector while bandnum is numerical, is that a problem? What could be the problem here?
Edit: I had an error with having the column bandnum in both dataframes so I changed df2$bandnum to bandnum.y. My code is now
df1_test <- left_join(x=df1, y=df2, by = "color.band") %>% mutate(y =
ifelse(is.na(color.band), bandnum, color.band)) %>% select(bandnum.y)
but when I view(df1_test) it only shows me the column bandnum.y and it's not the same number of entries as my original df1
Here's a subset of df1_test (not the whole thing because it's 2600 entries)
Any way I can make it show the rest of my data as well?
structure(list(bandnum.y = c("171324972", "171324972", "171324972",
"178324697", "178324697", "178324697", "178324697", "178324697",
"178324697", "178324697", "170364505", "170364505", "170364505",
"170364505", "170364505", "170364505", NA, "178324692", "178324692",
"178324692")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
We cannot use the original dataset 'df1' columns after the join becuase it is a left_join. In tidyverse, we specify the unquoted column names. There is no all.x argument in left_join. It should be from merge
library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>%
mutate(y = ifelse(is.na(color.band), bandnum, color.band))
left_join does not have all.x = TRUE that is part of base R merge.
You could do the following in base R :
df1_test <- transform(merge(df1, df2, by = "color.band", all.x = TRUE),
y = ifelse(is.na(color.band), bandnum, color.band))
If I don't get you wrong, you want to update an old df (df1) with information from a new df (df2).
In data.table, you can try this:
libraty(data.table)
setDT(df1)
setDT(df2)
update.vars = intersect(names(df1), names(df2)) # update only common variables
df1[df2, c(update.vars) := df2[,update.vars, with=FALSE], on= 'color.band']
Generally this should work. But in the given data the 'merge' ids (color.band column) are not unique, which may affect the results.

Simple text cleaning into all columns of a dataframe frame

I have a dataframe which I would like to implement some basic formation rules.
The dataframe is:
df <- structure(list(colname1 = structure(c(2L, 1L, 1L), .Label = c("",
"TEXTA"), class = "factor"), colname2 = structure(c(2L, 1L, 3L
), .Label = c("TEXTA", "TEXTB", "TEXTE"), class = "factor"),
colname3 = structure(c(2L, 3L, 1L), .Label = c("", "TEXTC",
"TEXTD"), class = "factor")), .Names = c("colname1", "colname2",
"colname3"), class = "data.frame", row.names = c(NA, -3L))
I try to run the following for the whole dataframe data:
df2 <- as.data.frame(tolower(df))
df2 <- as.data.frame(gsub("[[:punct:]]", "", df2))
but this converts the column names of dataframe to rows. What can I do to make in lower case and remove punctuation from all rows of the example dataframe (I am not interesting for colnames)?
We remove the punctuation characters on each column by looping through the columns (lapply(df, ..), assign the output back to the original dataset
df[] <- lapply(df, function(x) gsub("[[:punct:]]+", "", tolower(x)))
Using tidyverse, this can be done by
library(dplyr)
df %>%
mutate_all(funs(gsub("[[:punct:]]+", "", tolower(.))))

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

R: Find identical strings in two lists

I have two lists of strings and want to find which strings are in both lists.
I tried converting the lists to vectors so that I could use intersect or setequal but that converted all the strings to numbers and (apologies if there's an obvious answer I can't figure out), I can't seem to convert the lists without that happening.
What's the best way forward?
EDIT:
I have these data frames:
dput(s)
structure(list(V1 = structure(c(3L, 2L, 1L, 4L), .Label = c("24d2afb212410711de0e237e5435e104",
"2a3d9ca791a579a14883de538a012e24", "a90b03209a8095ec406809d89d5035c3",
"f271eb38cc409c6bfe9dcf2bfcab8471"), class = "factor")), .Names = "V1", row.names = c(NA,
-4L), class = "data.frame")
dput(r)
structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("24d2afb212410711de0e237e5435e104",
"2a3d9ca791a579a14883de538a012e24", "7320e2e921df862968954d4b60e2a80a",
"a9f47ec7c488d2bcddf2c1adc2bf6305"), class = "factor")), .Names = "V1", row.names = c(NA,
-4L), class = "data.frame")
I want to find the strings that are in both, i.e.
2a3d9ca791a579a14883de538a012e24 and 24d2afb212410711de0e237e5435e104.
as.character() doesn't work for preserving those strings; is there something else that would work for converting them into factors or is there another operation that would work better?
You need to specify the columns too within your data frames.
With intersect,
intersect(r$V1, s$V1)
#[1] "2a3d9ca791a579a14883de538a012e24" "24d2afb212410711de0e237e5435e104"
With grep,
unlist(sapply(r$V1, function(i)grep(i, s$V1, value = TRUE)))
#[1] "2a3d9ca791a579a14883de538a012e24" "24d2afb212410711de0e237e5435e104"

as.numeric is rounding off values

I am trying to convert a character column from a data frame to the numerics. However, what I am getting as a result are rounded up values.
Whatever I have tried by researching other questions of the same nature on SO, hasn't worked for me. I have checked the class of the column vector I am trying to convert, and it is a character, not a factor.
Here is my code snippet:
some_data <- read.csv("file.csv", nrows = 100, colClasses = c("factor", "factor", "character", "character"))
y <- Vectorize(function(x) gsub("[^\\.\\d]", "", x, perl = TRUE))
some_data$colC <- y(data1$colC)
data1$colD <- y(data1$colCD)
data1$colC <- as.numeric(data1$colC)
data1$colD <- as.numeric(data1$colD)
Edit:
> dput(head(data1))
structure(list(colA = structure(c(2L, 2L, 5L, 6L, 5L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), colB = structure(c(1L,
2L, 2L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
colC = c("4023107.87", "3180863.42", "2558777.81", "2393736.25",
"1333148.48", "1275627.13"), colD = c("49731596.35", "33604210.26",
"20807573.12", "20061467.30", "10488358.77", "10442249.09"
)), .Names = c("colA", "colB", "colC", "colD"), row.names = c(NA,
6L), class = "data.frame")
I think this is a representation problem, not an actual rounding problem ...
options("digits") ## 7
From ?options:
‘digits’: controls the number of digits to print when printing numeric values. It is a suggestion only. Valid values are
1...22 with default 7. See the note in ‘print.default’ about
values greater than 15.
digits can be reset either on a one-off basis, i.e. print(object,digits=...), or globally, i.e. options(digits=20) (20 is probably overkill but helps you see what's happening: based on the results below, 10 might serve your needs well.)
as.numeric(data1$colC)
[1] 4023108 3180863 2558778 2393736 1333148 1275627
print(as.numeric(data1$colC),digits=10)
[1] 4023107.87 3180863.42 2558777.81 2393736.25 1333148.48 1275627.13
print(as.numeric(data1$colC),digits=20)
[1] 4023107.8700000001118 3180863.4199999999255 2558777.8100000000559
[4] 2393736.2500000000000 1333148.4799999999814 1275627.1299999998882

Resources