Imputing missing character values in R

Imputing missing character values in R - r

I have a dataset called credit_df with dimensions 32561*15. It has a column for native.country with 1843 missing values. missing values are given as ?
I have created a factor variable with the list of countries using the below code
country <- unique(credit_df$native.country)
The above code also came with one ? value as it was part of the dataset. So i have removed that alone using the below
country <- as.data.frame(country)
country %>% filter(country != "?")
Now the country factor variable has all the country names in the dataset. Now I would like to assign those to the missing values in the column randomly. How do i do it ?
I tried the below code per one of the suggested methods
credit_df$native.country[credit_df$native.country %in% c("?")] <-
sample(country, NROW(credit_df$native.country[credit_df$native.country %in% c("?")]), replace = T)
but all the "?" turned out to be missing values
sum(is.na(credit_df$native.country))
[1] 583
NOTE: Even not considering this example if any of you could suggest how to impute character values randomly I am okay with it.
Example : if I have a column of country with missing values . and I have a vector/dataframe with a bunch of country names. How do i assign them randomly to the missing values in the country column

You could try using sample()
credit_df$native.country[credit_df$native.country %in% c("?")] <-
sample(country, NROW(credit_df$native.country[credit_df$native.country %in% c("?")]), replace = T)
The sample command here creates a vector using random values form country. The length of the generated vector is the same length as the number of rows you want to replace. The replace = T argument is only needed if you want to take a sample larger than the population (didn't know how much rows there are to replace and how many values there are in country).

Related

Fill columns with data via 2 defined parameters

I have a sample working data set (called df) which I have added columns to in R, and I would like to fill these columns with data according to very specific conditions.
I ran samples in a lab with 8 different variables, and always ran each sample with each variable twice (sample column). From this, I calculated an average result, called Cq_mean.
The columns I have added in R below refer to each variable name.
I would like to fill these columns with positive or negative based on 2 conditions :
Variable
Cq_mean
As you see with my code below, I am able to create positive or negative results based on Cq_mean, however this logically runs it over the entire dataset, not taking into account variable as well, and it fills in cells with data that I would like to remain empty. I am not sure how to ask R to take these two conditions into account at the same time.
positive: Cq_mean <= 37.1
negative: Cq_mean >= 37
Helpful information:
Under sample, the data is always separated by a dash (-) with sample number in front, and variable name after. Somehow I need to isolate what comes after the dash.
Please refer to my desired results table to visualize what I am aiming for.
df <- read.table("https://pastebin.com/raw/ZPJS9Vjg", header=T,sep="")
add column names respective to variables
df$TypA <- ""
df$TypB <- ""
df$TypC <- ""
df$RP49 <- ""
df$RPS5 <- ""
df$H20 <- ""
df$F1409B <-""
df$F1430A <- ""
fill columns with data
df$TypA <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
df$TypB <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
and continue through with each variable
desired results (subset of entire dataset done by hand in excel):
desired_outcome <- read.table("https://pastebin.com/raw/P3PPbiwr", header = T, sep="\t")

Something like this will do the trick:
df$TypA[grepl('TypA', df$sample1)] <- ifelse(df$Cq_mean[grepl('TypA', df$sample1)] >= 37.1,
'neg', 'pos')
You'll need to do this once per new column you want.
The grepl will filter out only the rows where your string of choice (here TypA) is present in the sample variable.

Loop through dataframe rows and add value in a new column (R)

I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)

Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.

Merging Dataframes based on values

Apologies if I lack enough info in the question, first time posting here
I have two data frames, one with 12,000(GPS) second with 196 (Details).
The GPS dataframe has repeated values for a "names" column.
The Details datafame has a "position" column and a name column with a different position value for each name.
I need the GPS df to have a column "position" which pulls from Details$position but repeats each time a name is shown
I tried to do this by creating a list of the names and then using a combination of setDT & setDF using a line of code given to me by someone trying something similar:
Weigh_in_check <- setDF(setDT(Weigh_in_check)[setDT(Weight_first),
Weight_initial := Weight_first$Weight, on=c("Name")])
however I cannot change it around for it to work for me with as follows
Name_check <- setDF(setDT(Name_check)[setDT(GPSReview2), Position :=
PlayerDetails$Position, on=c("Player Name")])
New code following comment by Flo.P
GPSReview4[,"Position"] <- NA
GPSReview4$Position <- as.character(GPSReview4$Position)
GPSReview4$Position <- left_join(GPSReview4, PlayerDetails, by ="Position" )
Which gives following error
Error in $<-.data.frame(*tmp*, Position, value = list(Full session = c("Yes", :
replacement has 132235 rows, data has 26447
**EDIT:
These are the 2 dataframes
GPS Review4
Detail

R - Removing rows in data frame by list of column values

I have two data frames, one containing the predictors and one containing the different categories I want to predict. Both of the data frames contain a column named geoid. Some of the rows of my predictors contains NA values, and I need to remove these.
After extracting the geoid value of the rows containing NA values, and removing them from the predictors data frame I need to remove the corresponding rows from the categories data frame as well.
It seems like a rather basic operation but the code won't work.
categories <- as.data.frame(read.csv("files/cat_df.csv"))
predictors <- as.data.frame(read.csv("files/radius_100.csv"))
NA_rows <- predictors[!complete.cases(predictors),]
geoids <- NA_rows['geoid']
clean_categories <- categories[!(categories$geoid %in% geoids),]
None of the rows in categories/clean_categories are removed.
A typical geoid value is US06140231. typeof(categories$geoid) returns integer.

I can't say this is it, but a very basic typo won't be doing what you want, try this correction
clean_categories <- categories[!(categories$geoid %in% geoids),]
Almost certainly this is what you meant to happen in that line. You want to negate the result of the %in% operator. You don't include a reproducible example so I can't say whether the whole thing will do as you want.

Put column sums in a new row in a matrix

I have a data frame that consists of municipality names (factors) in the first column and number of projects (integers) in columns two and three.
Var.1<-c("Andover", "Avon", "Bethany")
Freq.x<-c(2,NA,10)
Freq.y<-c(4,2,9)
Projects<-data.frame(Var.1,as.integer(as.numeric(Freq.y)),as.integer(as.numeric(Freq.x)))
[Note: I am making the second and third columns as integers here because that's how they are categorized in my actual data set.]
I was able to take the row sums of the rows using:
Projects$Sum<-rowSums(Projects[,2:3])
However, I'm unable to figure out how to take the column sums. I tried using the following formula:
Projects[Total,]<-colSums(Projects[2:3,])
I get the error:
Error in colSums(Projects[2:3, ]) : 'x' must be numeric
Even when I convert the second and third columns to as.numeric, I get the same response.
Can someone advise how to obtain the column sums create a new row at the bottom which will house the results?

You can do something like this:
Var.1<-c("Andover", "Avon", "Bethany")
Freq.x<-c(2,NA,10)
Freq.y<-c(4,2,9)
freq <- cbind(Freq.x, Freq.y)
freq <- rbind(freq, colSums(freq, na.rm=TRUE))
Projects <- data.frame(name=c(Var.1, "Total"), freq)
In particular: keep numeric part separate and compute it's sums; add "TOtal" to the character vector before it will be converted to factor, and thereafter make the data.frame