Add new column based on coincidence of two columns - r

I have these two starter data frames:
df1 <- data.frame("Location" = c('NE', 'SW', 'NW'), "Time" = c('0400', '1620', '2110'), "Assignment" = c('Painter', 'Astronaut', 'Bartender'), "Frequency" = c(84, 122, 139))
df1
Location Time Assignment Frequency
1 NE 0400 Painter 84
2 SW 1620 Astronaut 122
3 NW 2110 Bartender 139
df2 <- data.frame("Location" = c('NE', 'SW', 'NW', 'NW', 'SE'), "Time" = c('0400', '1620', '2110', '2240', '1410'), "Assignment" = c('Scripter', 'Port Patrol', 'Lawyer', 'Supplier', 'Youtuber'), "Frequency" = c(82, 126, 144, 94, 102))
df2
Location Time Assignment Frequency
1 NE 0400 Scripter 84
2 SW 1620 Port Patrol 122
3 NW 2110 Lawyer 139
4 NW 2240 Supplier 94
5 SE 1410 Youtuber 102
Suppose I didn't know which data frame was larger. But in this case, df2>df1 , so now I want to try and see which values of the columns 'Location' AND 'Time' coincide. For these equivalents, add a new column stating 'Coincide'. If not, this column should be NA.
For this, I tried:
df3$NewCol <- NA
df3$NewCol[df1$Location == df2$Location & df1$Time == df2$Time] <- 'Coincide'
or
if(df1$Location == df3$Location & df1$Time == df3$Time) {
df3$NewCol <- 'Coincide'
}
(In this ones I created a new df3 which is a merge of df1 + df2)
But on both of these tries I get the error:
longer object length is not a multiple of shorter object length
Which I believe is a problem on both data frames having different lengths, but how could I overcome this ?
Thanks in advance

Answering the first question of adding a new column with 'Coincide'.
We can do a full join with df1 and df2 which would give all the entries present in both the dataframes irrespective of their size. We can then check for NA values and assign 'Coincide' or NA value based on that.
all_data <- merge(df1, df2, by = c('Location', 'Time'), all = TRUE)
all_data$new_col <- c('Coincide', NA)[(rowSums(is.na(all_data[-c(1:2)])) > 0) + 1]
all_data
# Location Time Assignment.x Frequency.x Assignment.y Frequency.y new_col
#1 NE 0400 Painter 84 Scripter 82 Coincide
#2 NW 2110 Bartender 139 Lawyer 144 Coincide
#3 NW 2240 <NA> NA Supplier 94 <NA>
#4 SE 1410 <NA> NA Youtuber 102 <NA>
#5 SW 1620 Astronaut 122 Port Patrol 126 Coincide
You can then select only the columns that you need for further analysis.

Related

I can differentiate between months using pairs in R

data(airquality)
a=airquality
convert_fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
a[,4]=
convert_fahr_to_kelvin(a[,4])
oz=a[,1]
sr=a[,2]
wv=a[,3]
te=a[,4]
pairs(~oz+sr+wv+te,
col = c("orange") ,
pch = c(18),
labels = c("Ozono", "Irradiancia Solar", "Velocidad del viento","Temperatura"),
main = "Diagramas de dispersión por parejas")
This is the graphic that i get
This is what i am doing but, actually i would like to differentiate between months, like 31 first numbers of my a matrix, from all columns, be color green, for example and this for each month, i tried to separate the numbers in groups using group:
group <- NA
group[sr[1:31]]<-1
group[sr[32:61]]<-2
group[sr[62:92]]<-3
group[sr[93:123]]<-4
group[sr[124:153]]<-5
group[sr[1:31]]
group[sr[32:61]]
group[sr[62:92]]
group[sr[93:123]]
group[sr[124:153]]
here the numbers repeated
But what i get is that if the numbers in each column are the same they get in the same group, and i have been trying to solve it in other ways but i don't finally get what i want.
It is easier to create a group with gl
group <- as.integer(gl(length(sr), 31, length(sr)))
table(group)
#group
#1 2 3 4 5
#31 31 31 31 29
In the OP's code, 'group' is initialized as NA of length 1. Then, it is assigned based on values of 'sr' instead of just
group <- integer(length(sr))
group[1:31] <- 1
group[32:61] <- 2
...
whereas if we use sr values as index
sr[1:31]
#[1] 190 118 149 313 NA NA 299 99 19 194 NA 256 290 274 65 334 307 78 322 44 8 320 25 92 66 266 NA 13 252 223 279
then group values that are changed to 1 are at positions 190, 118, 149, 313, ....

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Creating empty R dataframe and adding data row-by-row

I'm new to R and this hurdle may be a case of me crossing my R and Python wires - I apologise if that's the case.
I have some data that is supplied as individual rows. I'd like to create an empty dataframe and add each row of data one at a time. I read several posts that recommend not doing this if possible but, in this case, I think it should be easier. I've read several posts giving solutions to the same problem and I think I've followed them. The code I have so far is:
# Create empty dataframe with 1 column for string and several integer columns:
df = data.frame(name=character(), int_a=integer(), int_b=integer(), int_c=integer(), int_d=integer(), int_e=integer(), stringsAsFactors=FALSE)
# Create a series of lists containing the data
r1 = list(name="Row1", int_a=13234, int_b=567, int_c=566, int_d=53, int_e=11)
r2 = list(name="Row2", int_a=34454, int_b=34, int_c=643, int_d=33, int_e=56)
r3 = list(name="Row3", int_a=73857, int_b=3, int_c=226, int_d=4, int_e=55)
r4 = list(name="Row4", int_a=86754, int_b=346, int_c=384, int_d=35, int_e=59)
r5 = list(name="Row5", int_a=33748, int_b=456, int_c=461, int_d=6, int_e=85)
r6 = list(name="Row6", int_a=97865, int_b=34654, int_c=65, int_d=35, int_e=148)
r7 = list(name="Row7", int_a=36475, int_b=3444, int_c=365, int_d=55, int_e=34)
r8 = list(name="Row8", int_a=84748, int_b=454, int_c=345, int_d=148, int_e=884)
r9 = list(name="Row9", int_a=94848, int_b=23454, int_c=6548, int_d=7, int_e=566)
# Add row by row:
df = rbind(df, r1)
df = rbind(df, r2)
df = rbind(df, r3)
df = rbind(df, r4)
df = rbind(df, r5)
df = rbind(df, r6)
df = rbind(df, r7)
df = rbind(df, r8)
df = rbind(df, r9)
The end result is almost right but there are some errors – it looks like this:
name int_a int_b int_c int_d int_e
2 Row1 13234 567 566 53 11
21 <NA> 34454 34 643 33 56
3 <NA> 73857 3 226 4 55
4 <NA> 86754 346 384 35 59
5 <NA> 33748 456 461 6 85
6 <NA> 97865 34654 65 35 148
7 <NA> 36475 3444 365 55 34
8 <NA> 84748 454 345 148 884
9 <NA> 94848 23454 6548 7 566
And there a series of warnings is generated of the format:
1: In `[<-.factor`(`*tmp*`, ri, value = "Row2") :
invalid factor level, NA generated
Can anyone explain why the strings are not being entered into the dataframe and why the row names are a bit odd?
Thanks in advance.
options(stringsAsFactors = F)
your code ....
options(stringsAsFactors = T)
This will work. Not sure why you can't just specify it in the data frame as the OP did. Would appreciate clarification on this as well

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29
Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

Resources