I have a dataframe with a column called 's_nummer'. This column is sometimes NA and in that case, I would like to add a number myself that can range from 700001 to 800000. So in this case, row numbers 3 and 4 do not contain a value in the s_nummer column and I would like to add the values 700001 to row 3 and 700002 to row 4.
dput:
structure(list(s_nummer = c(599999, 599999, NA, NA), eerste_voornaam = c("Debbie",
"Debbie", "Debbie", "Debbie"), tussenvoegsel = c(NA, NA, NA,
NA), geslachtsnaam = c("Oomen", "Oomen", "Oomen", "Oomen")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Hope you can help!
Thanks in advance
You can use which with is.na to get the lines with NA in x$s_nummer and overwrite them with 700000 + seq_along.
i <- which(is.na(x$s_nummer))
x$s_nummer[i] <- 700000 + seq_along(i)
# s_nummer eerste_voornaam tussenvoegsel geslachtsnaam
#1 599999 Debbie NA Oomen
#2 599999 Debbie NA Oomen
#3 700001 Debbie NA Oomen
#4 700002 Debbie NA Oomen
Related
I have one dataframe looking as follows:
Date Element Problem Losses
1 2020-09-29 54 Energy loss NA
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss NA
4 2020-09-29 40 Cooling NA
5 2020-09-29 50 Voltage NA
I would like to insert certain values in the Losses column whenever the problem column has the substring "Energy".
The values I need to insert are in another dataframe, looking like this:
Date Element Losses
1 2020-09-29 54 13.24
2 2020-09-30 54 12.16
This is just an example, as the actual dataframes I'm using are pretty big, so I'd like to do this with some type of merge by the Date and Element columns, instead with looping through both dataframes.
EDIT:
I've tried using a merge by the Element column, so first I get the Losses repeteadly for all the corresponding elements, and then putting those rows where I don't have my desired substring back as Nan.
My problem here is that merging by Element deletes all my other rows, getting only the following:
Date Element Problem Losses
1 2020-09-29 54 Energy loss 13.24
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss 12.16
Base R solution:
transform(df, Losses = insert_df$Losses[match(paste0(Date, Element, grepl("Energy", Problem)),
paste0(insert_df$Date, insert_df$Element, "TRUE"))])
Data:
df <- structure(list(Date = structure(c(18534, 18535, 18535, 18534,
18534), class = "Date"), Element = c(54L, 54L, 54L, 40L, 50L),
Problem = c("Energy loss", "Fault", "Energy loss", "Cooling",
"Voltage"), Losses = c(NA, NA, NA, NA, NA)), row.names = c(NA,
-5L), class = "data.frame")
insert_df <- structure(list(Date = structure(18534:18535, class = c("IDate",
"Date")), Element = c(54L, 54L), Losses = c(13.24, 12.16)), class = "data.frame", row.names = c(NA,
-2L))
I have a data frame describing the ownership levels of companies that looks like this:
Company Subsidiary1 Subsidiary2 Subsidiary3
DE5930 DE5931 NA NA
GB3489 GB3490 NA NA
GB3489 GB3490 GB3491 NA
US2036 US2037 NA NA
US2036 US2037 US2038 NA
US2036 US2037 US2038 GB3491
....# and so on
Now I would like to create one column of all subsidiaries per company that should look like this:
Company Subsidiaries
DE5930 DE5931
GB3489 GB3490
GB3489 GB3491
US2036 US2037
US2036 US2038
US2036 GB3491
The dataset is really large (more than 100.000 rows) and I could not figure any solutions using the group_by or aggregate function as most examples are for numeric variables (e.g. average).
One idea would be to remove the duplicates with df[ !duplicated(df$Subsidiary1), ] to retain the first occurence of each subsidiary and then shift the values to the left, but the problem is that one subsidiary could belong to several companies (like "GB3491") and I do not want to loose these observations. Is there any elegant solution to this problem?
Thank you in advance!
I would suggest next tidyverse approach:
library(tidyverse)
#Data
df <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))
The code:
df %>% pivot_longer(cols = -Company) %>% select(-name) %>%
filter(!is.na(value)) %>%
filter(!duplicated(paste(Company,value)))
Output:
# A tibble: 6 x 2
Company value
<chr> <chr>
1 DE5930 DE5931
2 GB3489 GB3490
3 GB3489 GB3491
4 US2036 US2037
5 US2036 US2038
6 US2036 GB3491
We can use coalesce
library(dplyr)
df1 %>%
transmute(Company, Subsidiaries =
coalesce(!!! rlang::syms(rev(names(df1)[-1]))))
# Company Subsidiaries
#1 DE5930 DE5931
#2 GB3489 GB3490
#3 GB3489 GB3491
#4 US2036 US2037
#5 US2036 US2038
#6 US2036 GB3491
Or with base R using max.col
cbind(df1[1], Subsidiaries = df1[-1][cbind(seq_len(nrow(df1)),
max.col(!is.na(df1[-1]), "last"))])
data
df1 <- structure(list(Company = c("DE5930", "GB3489", "GB3489", "US2036",
"US2036", "US2036"), Subsidiary1 = c("DE5931", "GB3490", "GB3490",
"US2037", "US2037", "US2037"), Subsidiary2 = c(NA, NA, "GB3491",
NA, "US2038", "US2038"), Subsidiary3 = c(NA, NA, NA, NA, NA,
"GB3491")), class = "data.frame", row.names = c(NA, -6L))
I am trying to modify my dataset with a for loop. I want to modify certain cells of some columns depending on the value of its "paired" column. My dataset could be:
data1989 <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, 0.589, 0.120),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 0.589 0.447 NA 66.897 66.097 NA
3 1987-01-19 0.120 NA NA 90.599 NA NA
Columns are "paired" by the suffix of each column, so NDVI_1 is paired with pixelQA_1, and so on. I want to modify the values under NDVI's columns depending on it's "paired" values on pixelQA column, following:
if PixelQa is NA -> then NDVI should be also NA.
if Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
if Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
Applying these very simple rules my data should look like:
data1989clean <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, NA, NA),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989clean
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 NA 0.447 NA 66.897 66.097 NA
3 1987-01-19 NA NA NA 90.599 NA NA
To reach my goal I am trying the following for loop:
for(i in 1:4){
data1989$NDVI_[i] <- ifelse(data1989$pixelQA_[i] < 66.5 & data1989$pixelQA_[i] > 65.5 |
data1989$pixelQA_[i] < 130.5 & data1989$pixelQA_[i] > 129.5,
data1989$NDVI_[i], NA)
}
But so far it is not working, as the dataset output looks exactly the same as the original one. Any suggestion will be welcomed.
As suggested by #George Savva, you can achieve this by pivoting longer, correcting the data, and pivoting back wider. So, using the tidyverse, that gives:
library(tidyverse)
newdd1 <-
#
data1989 %>%
#
pivot_longer(cols = -date,
names_to = c(".value", "set"),
names_sep = "_") %>%
#
mutate(NDVI = case_when(is.na(pixelQA) ~ NA_real_,
between(pixelQA, 65.5, 66.5) ~ NDVI,
between(pixelQA, 129.5, 130.5) ~ NDVI,
TRUE ~ NA_real_)) %>%
#
pivot_wider(names_from = set,
values_from = c(NDVI, pixelQA))
I have a nested list of items such that I have 3 separate lists grouped into one. I would like to make changes to a specific column that is present in all the lists. I have more details below
X
$`Manufacturing`
Stage Days.Added Start.Date End.Date
Planning 2 1968-12-01 NA
Building 14 NA NA
Testing 3 NA NA
Implementation 15 NA NA
$`Project Analysis`
Stage Days.Added Start.Date End.Date
Initial Review 3 1968-12-01 NA
Building 14 NA NA
User Testing 20 NA NA
Implementation 15 NA NA
User Review 7 NA NA
Final Analysis 4 NA NA
lapply(X, '[', 'End.Date') gives me:
$`Manufacturing`
End.Date
NA
NA
NA
NA
$`Project Analysis`
End.Date
NA
NA
NA
NA
NA
NA
I want to create a loop whereby the 'End.Date' column is the addition of the 'Start.Date' and the 'Days.Added' column for the first row. The resulting value would be the 'Start.Date' entry for the second row which would have the 'Days.Added' column added to produce the new 'End.Date' for the second row and so forth. So basically something like this:
$`Manufacturing`
Stage Days.Added Start.Date End.Date
Planning 2 1968-12-01 1968-12-03
Building 14 1968-12-03 1968-12-17
Testing 3 1968-12-17 1968-12-20
Implementation 15 1968-12-20 1969-01-04
$`Project Analysis`
Stage Days.Added Start.Date End.Date
Initial Review 3 1968-12-01 1968-12-04
Building 15 1968-12-04 1968-12-19
User Testing 20 1968-12-19 1969-01-08
Implementation 15 1969-01-08 1969-01-23
User Review 7 1969-01-23 1969-01-30
Final Analysis 4 1969-01-30 1969-02-03
How do I achieve this?
Assuming the Start.Date' isDate` class,
lapply(X, transform, Start.Date = Start.Date[1] +
c(0, cumsum(Days.Added[-length(Days.Added)])),
End.Date = Start.Date[1] + cumsum(Days.Added))
data
X <- list(Manufacturing = structure(list(Stage = c("Planning", "Building",
"Testing", "Implementation"), Days.Added = c(2L, 14L, 3L, 15L
), Start.Date = structure(c(-396, NA, NA, NA), class = "Date"),
End.Date = c(NA, NA, NA, NA)), row.names = c(NA, -4L), class = "data.frame"),
`Project Analysis` = structure(list(Stage = c("Initial Review",
"Building", "User Testing", "Implementation", "User Review",
"Final Analysis"), Days.Added = c(3L, 14L, 20L, 15L, 7L,
4L), Start.Date = structure(c(-396, NA, NA, NA, NA, NA), class = "Date"),
End.Date = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-6L), class = "data.frame"))
I am new to R, and could not find specific help for my question on this site.
I have (among others) ten character variables in my dataframe $grant_database, country_1 through country_10. Each contains either a country code, for example E20, F27 or G10, or an NA. Each case is a grant to a project. The ten country variables specify which country/countries a grant is benefitting. In my dataframe, most, but not all cases will have at least one country code, first marked in country_1, many will have one for country_2 as well, and some even for country_3 to _10. All empty fields are marked with an NA.
id country_1 country_2 country_3 country_4 country_5 country_6 ...new_binaryvar
1 F20 NA NA NA NA NA 0
2 E12 E17 E52 NA NA NA 0
3 O62 O33 NA NA NA NA 0
4 E21 E20 NA NA NA NA 1
5 NA NA NA NA NA NA 0
...
I wish to create a new factor flagging grants which benefit a defined subset of countries. This binary "dummy" variable should give the value "1" to each case that in at least one of the ten country variables corresponds with a list of country codes. It should give "0" to each case/grant that does not have a corresponding country code in any of its ten country variables. Let this subset of country codes to be flagged be: E20, F27 and G10 (in reality, there are about 40 to be flagged, from 150+).
Would you help me out by suggesting a way to program this? Thank you very much for your help!
Assuming that you wanted to check whether a subset of "countrycodes" are there in each of the "country" variables with the condition that if atleast one of the "countrycode" is present in a particular row, that row will get "1", or else "0". The idea is to create a vector (v1) of "countrycodes" that needs to be checked. Convert the dataset (df) to matrix after removing the "id" column (as.matrix(df[,-1])) and then create a logical vector by comparing with "v1" (%in%). The vector can be changed back to "matrix" by assigning the dimensions (dim<-) to dimension of df[,-1] ie (c(5,7)). Do the rowSums, double negate (!!), finally add 0 to get the binary dummy variable.
v1 <- c('E20', 'F27', 'G10')
(!!rowSums(`dim<-`(as.matrix(df[,-1]) %in% v1, c(5,7))))+0
#[1] 0 0 0 1 0
newdata
df <- structure(list(id = 1:5, country_1 = c("F20", "E12", "O62", "E21",
NA), country_2 = c(NA, "E17", "O33", "E20", NA), country_3 = c(NA,
"E52", NA, NA, NA), country_4 = c(NA, NA, NA, NA, NA), country_5 = c(NA,
NA, NA, NA, NA), country_6 = c(NA, NA, NA, NA, NA), country_7 = c(NA,
NA, NA, NA, NA)), .Names = c("id", "country_1", "country_2",
"country_3", "country_4", "country_5", "country_6", "country_7"
), class = "data.frame", row.names = c(NA, -5L))