My Initial Data looks like this:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
Data<-data.frame(ID, Value)
I want to create a MeanValue from Value for every Row. And if the Value is having no Value in it, i would like to take the Mean for the Value.
My Idea to Compute the Mean for the first Step was:
library(stringr)
AverageMean<-mean(as.numeric(str_split(Data$Value, ",")))
But it is Creating an Error
The Final Data should kinda look like:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
AverageMean<-c(1.5,-0.5,1,0.666,0.666)
FinalData<-data.frame(ID, Value, AverageMean)
Based on the info, and working on your code, first you do str_split on the concerned column and the output is a list. For getting the mean for individual list elements, you can use lapply with mean. Then unlist it and replace the last value Val[length(Val)] with the mean of all other values and create a new column AverageMean based on the above.
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
Val[length(Val)] <- mean(Val[-length(Val)], na.rm=TRUE)
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
Update
If you have multiple missing values and want to replace that with the mean of the column,
Data <- data.frame(ID=1:5, Value=c("1,2", "0,-1", 1, "", ""))
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
The above steps are the same. Create a logical index with is.na and replace all those missing values by the mean of values that are not missing by negating the logical index !is.na.
Val[is.na(Val)] <- mean(Val[!is.na(Val)])
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
#5 5 0.6666667
Related
Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)
With a data frame like below:
set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))
> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4
I need to do an aggregation by id such that ,for each col prop1 to propN, a histogram data is generated as follows.
For each id,
prop1 need to capture ratio of number of discrete values - "A"s , "B"s for all records with same id which can be accessed via names like prop1[["A"]] & prop1[["B"]]
prop2 need to capture ratio of number of discrete values - "TRUE"s , "FALSE"s for all records with same id which can be accessed via names like prop1[["TRUE"]] & prop1[["FALSE"]]
prop3 need to capture ratio of number of discrete values - "3, 4, 5, 6" for all records with same id which can be accessed via names like prop1[["3"]], prop1[["4"]], prop1[["5"]], prop1[["6"]]
How to get the aggregation for prop1 to propN done in the above format - using base R
Update:Adding output representation.
I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id basis for further computation.
Here is an idea which uses a custom function defined as follows:
It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.
#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i)
prop.table(table(i[,n])))))}
data.frame(id = unique(sort(df$id)),
do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))
# id A B FALSE. TRUE. X3 X4 X5 X6
#1 1 0.0000000 1.0000000 0.0 1.0 0.0000000 0.0000000 1 0.0000000
#2 2 0.5000000 0.5000000 0.5 0.5 0.0000000 1.0000000 0 0.0000000
#3 3 0.6666667 0.3333333 1.0 0.0 0.3333333 0.3333333 0 0.3333333
Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])
#so you can extract each one separately,
l1[['prop1']]
# A B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333
I think you want this:
library(reshape)
df[-1] <- lapply(df[-1],as.factor)
# second, rearrange vars in a named vector
df <- melt(df,id=c("id"),variable_name = "prop")
df$prop <- as.factor(df$prop)
#third, make the histograms with ggplot2
library(ggplot2)
h <- ggplot(df,aes(x=id))
h + geom_bar(stat="count", aes(fill=id)) + facet_grid(~ prop + value)
how to calculate colMedian using colMedian function. I get an error: Argument 'x' must be a matrix or a vector.
col_medians <- round(colMedians(impute_marks[,-1], na.rm=TRUE),0)
k <- which(is.na(impute_marks), arr.ind=TRUE)
impute_marks[k] <- col_medians[k[,-1]]
I need to do the below operation for all the columns other than first column in the data frame.
Below code works fine. but I in for loop gives an error unknown courses when looped.
impute_marks$c1[is.na(impute_marks$c1)] <- round(mean(impute_marks$c1[!is.na(impute_marks$c1)]),0)
here, impute_marks is the name of the dataset and c1 is the column name.
using the above operation I am able to find the mean and replace all NA values in c1 (column). But I have 30+ columns. How can I write the above operation in a for loop to loop through each course and replace NA value with the mean?
my function for the operation:
impute_marks$F27SA[is.na(impute_marks$F27SA)] <- round(mean(impute_marks$F27SA[!is.na(impute_marks$F27SA)]),0)
imputing_using_mean <- function()
{
courses <- names(impute_marks)[2:26]
for(i in seq_along(courses))
{
impute_marks$courses[[i]][is.na(impute_marks$courses[[i]])] <- round(mean(impute_marks$courses[[i]][!is.na(impute_marks$courses[[i]])]),0)
}
}
imputing_using_mean()
Essentially the same as answer from #Aaron on
Replace NA values by row means . Tweaked to account for the first column.
marks <- read.table(text="
a 1 NA 3
b 1 2 3
c NA NA NA
")
col_means <- round(colMeans(marks[,-1], na.rm=TRUE), 0)
k <- which(is.na(marks), arr.ind=TRUE)
marks[k] <- col_means[k[,2]-1]
# V1 V2 V3 V4
#1 a 1 2 3
#2 b 1 2 3
#3 c 1 2 3
Below is a solution for calculating median for each column and replacing each NA values with the median calculated for each column. same goes for mean as well but the step to convert it to a matrix is not required.
# first convert it to matrix
matrix_marks <- as.matrix(impute_marks)
$calculate the median for each column
col_medians <- round(colMedians(matrix_marks[,-1], na.rm=TRUE),0)
#get the index for each NA values
k <- which(is.na(matrix_marks), arr.ind=TRUE)
finally replace those values with median value.
matrix_marks[k] <- col_medians[k[,-1]]
I have managed to aggregate data successfully using the following pattern:
newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]
However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.
The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]
Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.
Literal example as requested:
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
>
As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.
As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that 'dt$column' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.
So, replacing the value in the return statement with something that represents a single value works. Example:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/length(dt$someOtherColumn))
}
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.3333333
2: def 1.0000000
Now the data is aggregated correctly.
Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT