I have raw data with multiple observation and I have a cleaning log which contains some new values for specific columns of raw data I want to replace old values with these new ones.
My raw data is :
raw_df<- data.frame(
id=c(1,2,3,4),
name=c("a","b","c","d"),
age=c(15,16,20,22),
add=c("xyz","bc","no","da")
)
MY cleaning log is :
cleaning_log <- data.frame(
id=c(2,4),
question=c("name","age"),
old_value=c("b",22),
new_value=c("bob",25)
)
And my expected result is :
result<-data.frame(
id=c(1,2,3,4),
name=c("a","bob","c","d"),
age=c(15,16,20,25),
add=c("xyz","bc","no","da")
)
Note:At the end how can I check whether these new values are replaced properly or not?
In addition, in cleaning log question column I may have more than two columns like 10 to 20 which possibly will have new value but here I just give two column names as an example.
Thanks in advance for you help
Find out the row number and column number to change in raw_df using match and replace it with cleaning_log$new_value.
row_inds <- match(cleaning_log$id, raw_df$id)
col_inds <- match(cleaning_log$question, names(raw_df))
raw_df[cbind(row_inds, col_inds)] <- cleaning_log$new_value
raw_df
# id name age add
#1 1 a 15 xyz
#2 2 bob 16 bc
#3 3 c 20 no
#4 4 d 25 da
Related
I'm new to R Studio and am learning about dataframes.
I'm trying to add the new column "uniqueID" to my dataframe "Populations" with unique values for each row in this new column. No problem, I can append a new column like this: Populations$uniqueID
However I'm having trouble adding unique values to each row under this new column. The values should be a combination of the values in each row from the existing columns "location", "variant", and "time". So, for each row the value for the new column uniqueID should be something like "LocationVariantTime" (e.g. "CaliforniaMedium1953"). Here's the code I'm trying, using paste(), but it's definitely wrong. I need to figure out how to grab the values for each row.
Populations$uniqueID <- paste(Populations$location, Populations$variant, Populations$time)
Here's the output when I view the dataframe. There is no new column with data: https://share.getcloudapp.com/7Kuykdg4
The error that I get reads:
Error in $<-.data.frame(*tmp*, uniqueID, value = character(0)) :
replacement has 0 rows, data has 280932
Thank you in advance for helping someone who is learning,
Your code doesn't seem far off. You might have to convert the values in paste() to character first though, like this:
Populations$uniqueID <- paste(as.character(Populations$location), as.character(Populations$variant), as.character(Populations$time), sep = "")
You could row-wise apply paste on the id columns.
Example
dat <- transform(dat, un.id=apply(dat[1:3], 1, paste, collapse=""))
head(dat)
# id type year value un.id
# 1 A Mmedium 2018 1.3709584 AMmedium2018
# 2 B Mmedium 2018 -0.5646982 BMmedium2018
# 3 C Mmedium 2018 0.3631284 CMmedium2018
# 4 A Large 2018 0.6328626 ALarge2018
# 5 B Large 2018 0.4042683 BLarge2018
# 6 C Large 2018 -0.1061245 CLarge2018
Data:
set.seed(42)
dat <- cbind(expand.grid(id=LETTERS[1:3],
type=c("Mmedium", "Large"),
year=2018:2020), value=rnorm(18))
According to the output the names of the columns are uppercase:
Populations$uniqueID <- paste(Populations$Location, Populations$Variant, Populations$Time)
The solution? A simple case change! Thank's everyone.
I want to map the FactorName in the dataframe FName to the column header names of Stack. Ie Factor1 in Stack is actually named Value, Factor 2 is Leverage etc. I have a large dataset so manually renaming is not an option.
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(FactorID=c("Factor1","Factor2","Factor3"), FactorName=c("Value","Leverage","Growth"))
Thanks.
How about this using match:
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(
FactorID=c("Factor1","Factor2","Factor3"),
FactorName=c("Value","Leverage","Growth"))
# Matching entries from FName
colnames(Stack) <- ifelse(
!is.na(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
as.character(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
colnames(Stack));
Stack;
# rowid Value Leverage Growth
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
Explanation: We match column names of Stack and entries from FName$FactorID. If there is a match, replace with FName$FactorName, else keep the original column name.
if we have factor names handy then we can use that to change the column names
colnames(Stack) <- "facotor header file"
Another approach using match, but using indexing instead of ifelse
# Get indices of matches
m <- match(names(Stack), FName$FactorID)
# replace names where a match is found.
names(Stack)[!is.na(m)] <- as.character(FName$FactorName[m[!is.na(m)]])
Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).
I have a function called notes_count(id) that takes a vector as a parameter (for example the function can accept different arguments 5, c(1,2,3), 6:20, or 5:1 to name a few) and returns the ID and "count" of the notes. I have a data frame with the following contents:
"ID" "Date" "Notes"
that contains an unknown amount of entries per "ID" for example:
ID Date Notes
1 xxx "This is a note"
1 xxx "More notes here"
...
8 xxx "Hello World"
The problem I am running into is that I want the output to be ordered in the same way as the input vector meaning notes_count(3:1) should list the results in reverse order as a data frame:
ID notes_count
1 3 6
2 2 288
3 1 102
and calling notes_count(1:3) would result in:
ID notes_count
1 1 102
2 2 288
3 3 6
however table always reorders from min to max despite the order it is given originally. Is there a way to do what table is doing directly on the data frame but using other functions so that I can control the output.
Currently my code is this:
#Before calling table I have data frame "notes" in the order I want but table reorders it
notes_count <- as.data.frame(table(notes[["ID"]]))
which seems silly to make the original data frame a table and then convert it back.
EDIT:
Here is my code as basic as it is as requested
notes_count <- function(id){
## notes.csv format
## "ID","Date","Notes"
## 1,"2016-01-01","Some notes"
#read the csv to a data frame
notes <- read.csv("notes.csv")
#remove all NA values
notes <- notes[complete.cases(notes), ]
#here is where you can order the data but it won't matter when aggregating the notes to a "count" using table on the next line
notes <- notes[id, ]
#convert the table back to a data frame
notes_count <- as.data.frame(table(notes[["ID"]]))
notes_count
}
Here's a simplified example that should get you going:
set.seed(1234)
notes <- data.frame(id=sample(2:10,size = 100, replace = TRUE), Note="Some note")
notes_count <- function(id) {
counts <- table(notes[notes$id %in% id,])
return(data.frame(count=counts[as.character(id),]))
}
notes_count(c(10,2,5))
# Results
count
10 8
2 12
5 2
If I understand correctly, you want to sort the dataframe by the notes_count variable?
then use order function and reshuffle the df rows.
your_data_frame[order(your_data_frame$notes_count,decreasing=TRUE),]
I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.
Having 3 different rows, only showing the id and the column I need to group by:
ID........... TEXTFIELD
1............ VGH2130
2............ BFGF2345
3............ VGH3321
Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as
ID........... TEXTFIELD........... NEWCOL
1............ VGH2130............. VGH
2............ BFGF2345............ BFGF
3............ VGH3321............. VGH
And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......) )
Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)
You can also try
> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
ID TEXTFIELD group
1 1 VGH2130 VGH
2 2 BFGF2345 BFGF
3 3 VGH3321 VGH
you can try, if df is your data.frame:
df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)
> df
# ID TEXTFIELD NEWCOL
#1 1 VGH2130 VGH
#2 2 BFGF2345 BFGF
#3 3 VGH3321 VGH
Does the text field always have 3 or 4 letters preceding the numbers?
you can check that by doing:
nrow(data[grepl("[aA-zZ]{1,4}\\d+", data$TEXTFIELD)== TRUE, ]) #this gives number of rows where TEXTFIELD contains 3,4 letters followed by digits
If yes, then:
require(stringr)
data$NEWCOL <- str_extract(data$TEXTFIELD, "[aA-zZ]{1,4}")
Final Step:
data$group <- ifelse(data$NEWCOL == "VGH", "Group Name", ifelse(data$NEWCOL == "BFGF", "Group Name", ifelse . . . . ))
# Complete the ifelse statement to classify all groups