How to replace na in R based on values in two columns - r

I am trying to replace null values based on two columns. Basically, I have company codes in one column and its respective values in the second. I need to replace mean of the values for each of the company code rather than mean of the complete column. How do I do it in R? (Look at the image below)

Assuming your data is in a data frame called 'myData' you can go ahead and use the ddply function from the plyr package to generate the mean per company code. The ddply function applies a function to a column(s) grouped by another column(s).
library(plyr)
#Find the entries where the values are NULL, using "" (empty string) as NULL
#Can replace "" with whatever NULL is for you
nullMatches <- myData$Values == ""
#Generate the mean for each company
#This will return a 2 column data frame, first column will be "Symbol".
#Second column will the value of means for each 'Symbol'.
meanPerCompany <- ddply(myData[!nullMatches,], "Symbol", numcolwise(mean))
#Match the company symbol and store the mean
myData$Values[nullMatches] <- meanPerCompany[match(myData$Symbol[nullMatches], meanPerCompany[,1]),2]

Do you need something like this:
df <- data.frame(Symbol = c("NXCDX", "ALX", "ALX", "BESOQ", "BESOQ", "BESOQ"),
Values = c(2345, 8654, NA, 6394, 8549, NA))
df %>% dplyr::group_by(Symbol) %>% dplyr::summarise(mean_values = mean(Values, na.rm = TRUE))

using data.table
library(data.table)
setDT(df)[,replace(Values,is.na(Values),mean(Values,na.rm = T)),by=Symbol]

Related

Mutate a dataframe by a vector which should match variable names

I have a dataframe with a vector of years and several columns which contain the gdp_per_head_values of different countries at a specific point in time. I want to mutate this dataframe to get a variable which contains only the values of the variable of the specific point in time defined by the vector of years.
My data.frame looks like this :
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000),
)
How can I mutate this dataframe as explained above, for example by the variable gpd_head
EDIT : Output should look like
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000)) %>%
mutate(gdp_head =c(.$'1940'[1],.$'1940'[2],.$'1960'[3],
.$'1950'[4],.$'1940'[5],.$'1960'[6],
.$'1960'[7],.$'1950'[8] ))
Here is one approach:
First, since you are going to compare the year_vector column with column names (which will be character), you can convert year_vector to character as well:
dataset$year_vector <- as.character(dataset$year_vector)
You currently have a tibble defined - but if you have it as a plain data.frame you can subset based on a [row, column] matrix and add the matched results as gdp_head:
dataset <- as.data.frame(dataset)
dataset$gdp_head <- as.numeric(dataset[cbind(1:nrow(dataset), match(dataset$year_vector, names(dataset)))])
I came up with the following solution which works aswell :
dataset %>%
do(.,mutate(.,gdp_head = pmap(list(1:nrow(.), year_vector),
function(x,y) .[x,(y-1901+16)]) %>%
unlist() ))
In this solution I just added the position of the first year variable to the column index and subtract that number from the year_vector. In this case the year variables start in the year 1901 which column index corresponds to 16.

How can I select certain columns in a dataframe based on their number of valid values (except NA) in R?

I'm using R, and I have a dataframe with multiple columns. I want to run a code and automatically check the number of values (valid values, not NA) in each column. Then, it should select the columns that 50% of its rows are filled by valid values, and save them in a new dataframe.
Can anybody help me doing this? Thank you very much.
Is there any way that the codes can be applied for an uncertain number of columns?
Using purrr package, you can write function below to check for the percentage of missing values:
pct_missing <- purrr::map_dbl(df,~mean(is.na(.x)))
After that, you can select those columns that have less than 50% missing values by their names.
selected_column <- colnames(df)[pct_missing < 0.5]
To create a new dataset, you may use:
library(dplyr)
df_new <- df %>% select(one_of(selected_column))
You can create a function within R base also to automatically retrieve the colums matching the critria:
Function:
ColSel <- function(df){
vals <- apply(df,2, function(fo) mean(is.na(fo))) < .5
return(df[,vals])
}
Some toy data
## example
df1 <- data.frame(
a = c(runif(19),NA),
b = c(rep(NA,11),runif(9)),
d = rep(NA,20),
e = runif(20)
)
Test
df2 <- ColSel(df1)

Get column names and frequency from a table that looks like a matrix

I have a data frame that looks like:
'Part Number' 'Person Working'
'A' 'James'
'B' 'Brian'
'A' 'Andrea'
'C' 'Tiffany'
and so on for thousands of rows. The same part can have multiple people assigned to it. I'm pretty bad at summarizing data in R, but I'm able to produce (in the console) a table that looks like a frequency matrix by typing:
table(df$partnumber, df$personworking)
and it spits out unique items as rows, and every person working's name as a column. The values are a 0 or a 1 depending on if they are working that part.
What I'm looking for is a way to summarize this information in a digestible format that says, per item:
Part Number NumWorkers Names
A 3 "James, Andrea"
B 1 "Brian"
C 1 "Tiffany"
I'm also struggling with getting my table into a data frame. I've tried:
thedataframe <- data.frame(thetable[,])
but I'm not getting very far. I want to sum the amount of people working each unique part, and concat and print each column name that has a one as a value for a given part.
What is the best way to summarize this data in Base R?
Here is a method you could use in base R with aggregate:
dfAgg <- do.call(data.frame,
aggregate(df$Person, list(df$Parts),
FUN=function(x) c(length(x), paste(x, collapse=", "))))
# add nicer names
names(dfAgg) <- c("Parts", "Count", "Person")
Aggregate allows you to run a function over groups. In this instance, we are running a function that returns both the count of individuals (via length) and their names (via paste).
Here is the sample data I used to test this.
data
set.seed(1234)
df <- data.frame("Parts"=sample(LETTERS[1:3], 10, replace=T),
"Person"=sample(c("James", "Brian", "Sam", "Tiff", "Sandy"),
10, replace=T), stringsAsFactors=F)
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'partnumber', get the number of rows (.N) and paste the 'personworking' in each 'partnumber'.
library(data.table)
setDT(df)[,.(NumWorkers = .N, Names = toString(personworking)) , by = partnumber]
or we could use dplyr
library(dplyr)
df %>%
group_by(partnumber) %>%
summarise(NumWorkers = n(), Names = toString(personworking))
Or using base R
do.call(rbind, by(df, df$partnumber, FUN = function(x)
data.frame(NumWorkers = length(x$personworking), Names = toString(x$personworking))))

In R, I have two columns and would like to take the sum if a condition is met

I have am trying to write a script in R where I would take sum of values correspnding to a condition from another column.
Say I have two columns, fakeVector & fakeVector1 of table "total"
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=rbind(fakeVector, fakeVector1)
I want to get the values for fakeVector1 where fakeVector = specific value.
For example, I would like to grab the fakeVector1 value where fakeVector = specific value, for example "NTC.H3"
How would I do that?
We can try
sum(as.numeric(total["fakeVector1",][total["fakeVector",]=="NTC.H3"]))
total[2,][which(total[1,] == "NTC.H3")]
#[1] "1"
v1 <- c('NTC.H3', 'NTC.F22', 'abc123')
sum(as.numeric(total[2,][which(total[1,] %in% v1)]))
#[1] 8
If your data set is organized as a data.frame and if you want to know the sum of one column for every condition in another column of, you can use the fast data.table package.
# load library
library(data.table)
# get your data
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=cbind(fakeVector, fakeVector1)
total <- as.data.table(total)
total$fakeVector1 <- as.numeric(total$fakeVector1)
# Solution
total[, .(mysum = sum(fakeVector1)), by=.(fakeVector)]

sum different columns in a data.frame

I have a very big data.frame and want to sum the values in every column.
So I used the following code:
sum(production[,4],na.rm=TRUE)
or
sum(production$X1961,na.rm=TRUE)
The problem is that the data.frame is very big. And I only want to sum 40 certain columns with different names of my data.frame. And I don't want to list every single column. Is there a smarter solution?
At the end I also want to store the sum of every column in a new data.frame.
Thanks in advance!
Try this:
colSums(df[sapply(df, is.numeric)], na.rm = TRUE)
where sapply(df, is.numeric) is used to detect all the columns that are numeric.
If you just want to sum a few columns, then do:
colSums(df[c("X1961", "X1962", "X1999")], na.rm = TRUE)
res <- unlist(lapply(production, function(x) if(is.numeric(x)) sum(x, na.rm=T)))
will return the sum of each numeric column.
You could create a new data frame based on the result with
data.frame(t(res))
If you dont want to include every single column, you somehow have to indicate which ones to include (or alternatively, which to exclude)
colsInclude <- c("X1961", "X1962", "X1963") # by name
# or #
colsInclude <- paste0("X", 1961:2003) # by name
# or #
colsInclude <- c(10:19, 23, 55, 147) # by column number
To put those columns in a new data frame simply use [ ] as you've done: '
newDF <- oldDF[, colsInclude]
To sum up each column, simply use colSums
sums <- colSums(newDF, na.rm=T)
# or #
sums <- colSums(oldDF[, colsInclude], na.rm=T)
Note that sums will be a vector, not necessarilly a data frame.
You can make it into a data frame using as.data.frame
sums <- as.data.frame(sums)
# or, to include the data frame from which it came #
sums <- rbind(newDF, "totals"=sums)

Resources