Allocating values from one dataframe to another - r

I have the following dataframe
OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1
As illustrated above it contains a long list of occupations assigned to a specific skill level.
My actual dataframe is a household survey with millions of rows, including a column which is also named OCC1990.
My goal is to implement my assigned skill levels from the above-listed data frame into the household survey.
I applied in the past already the following code for smaller dataframes, which is a pretty manual way
cps_data[cps_data$OCC1990 %in% 3,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 4:7,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 8,"skilllevel"] <- 2
But due to the fact that I don't wanna spend hours copying pasting as well as it increases the probability of making mistakes I'm searching for a different, more direct way.
I've already tried to merge both dataframes, but this result in an error related to the size of the vector.
Is there another way than merging just the two dataframes to assign the skill level also to the occupations in the survey?
Many thanks in advance
Xx freddy

Using data.table for large dataset
create two vectors: levels and labels. The levels contains unique values of OCC1990 and labels contains the new skill levels you want to apply.
Now use levels and labels inside the factor function to modify the skill level. (I used Skilllevel = 3 for OCC1990 = 8 )
library(data.table)
setDT(df)
levels <- c(3:7,8) # unique values of OCC1990
labels <- c(rep(1,5), 3) # new Skill levels corresponding to OCC1990
setkey(df, OCC1990) # sort OCC1990 for speed before filtering
df[ OCC1990 %in% levels, Skilllevel := as.integer(as.character(factor(OCC1990, levels = levels, labels = labels)))]
head(df)
# OCC1990 Skilllevel
#1: 3 1
#2: 8 3
#3: 12 2
#4: 14 3
#5: 15 1
If you are still facing memory size issues, read in chunks of data from IO (use fread) and apply the above operation and then append data to a new file.
Data:
df <- read.table(text='OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1 ', header=TRUE)

Related

How can I group rows into categories in R [duplicate]

I would like to add a column to my dataframe that contains categorical data based on numbers in another column. I found a similar question at Create categorical variable in R based on range, but the solution provided there didn't provide the solution that I need. Basically, I need a result like this:
x group
3 0-5
4 0-5
6 6-10
12 > 10
The solutions suggested using cut() and shingle(), and while those are useful for dividing the data based on ranges, they do not create the new categorical column that I need.
I have also tried using something like (please don't laugh)
data$group <- "0-5"==data[data$x>0 & data$x<5, ]
but that of course didn't work. Does anyone know how I might do this correctly?
Why didn't cut work? Did you not assign to a new column or something?
> data=data.frame(x=c(3,4,6,12))
> data$group = cut(data$x,c(0,5,10,15))
> data
x group
1 3 (0,5]
2 4 (0,5]
3 6 (5,10]
4 12 (10,15]
What you've created there is a factor object in a column of your data frame. The text displayed is the levels of the factor, and you can change them by assignment:
levels(data$group) = c("0-5","6-10",">10")
data
x group
1 3 0-5
2 4 0-5
3 6 6-10
4 12 >10
Read some basic R docs on factors and you'll get it.

Extract multiple data.frames from one with selection criteria

Let this be my data set:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000),
split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))
So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.
My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.
I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.
In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:
x1 x2 x3 split
1 0.307379064 0.400526799 0.2898194543 SPLITMEHERE
2 0.465236674 0.915204924 0.5168274657 OBS
3 0.063814420 0.110380201 0.9564822116 OBS
4 0.401881416 0.581895095 0.9443995396 OBS
5 0.495227871 0.054014926 0.9059893533 SPLITMEHERE
6 0.091463620 0.945452614 0.9677482590 OBS
7 0.876123151 0.702328031 0.9739113525 OBS
8 0.413120761 0.441159673 0.4725571219 OBS
9 0.117764512 0.390644966 0.3511555807 OBS
10 0.576699384 0.416279417 0.8961428872 OBS
11 0.854786077 0.164332814 0.1609375612 OBS
12 0.336853841 0.794020157 0.0647337821 SPLITMEHERE
13 0.122690541 0.700047133 0.9701538396 OBS
14 0.733926139 0.785366852 0.8938749305 OBS
15 0.520766503 0.616765349 0.5136788010 OBS
16 0.628549288 0.027319848 0.4509875809 OBS
17 0.944188977 0.913900539 0.3767973795 OBS
18 0.723421337 0.446724318 0.0925365961 OBS
19 0.758001243 0.530991725 0.3916394396 SPLITMEHERE
20 0.888036748 0.862066601 0.6501050976 OBS
What I would like to get is this:
data.frame1:
1 0.465236674 0.915204924 0.5168274657 OBS
2 0.063814420 0.110380201 0.9564822116 OBS
3 0.401881416 0.581895095 0.9443995396 OBS
4 0.091463620 0.945452614 0.9677482590 OBS
5 0.876123151 0.702328031 0.9739113525 OBS
6 0.413120761 0.441159673 0.4725571219 OBS
7 0.117764512 0.390644966 0.3511555807 OBS
8 0.576699384 0.416279417 0.8961428872 OBS
9 0.854786077 0.164332814 0.1609375612 OBS
And
data.frame2:
1 0.122690541 0.700047133 0.9701538396 OBS
2 0.733926139 0.785366852 0.8938749305 OBS
3 0.520766503 0.616765349 0.5136788010 OBS
4 0.628549288 0.027319848 0.4509875809 OBS
5 0.944188977 0.913900539 0.3767973795 OBS
6 0.723421337 0.446724318 0.0925365961 OBS
7 0.888036748 0.862066601 0.6501050976 OBS
Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.
Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.
Is there any way to extract this efficiently in R?
The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.
With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:
df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)
The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Unit of Analysis Conversion

We are working on a social capital project so our data set has a list of an individual's organizational memberships. So each person gets a numeric ID and then a sub ID for each group they are in. The unit of analysis, therefore, is the group they are in. One of our variables is a three point scale for the type of group it is. Sounds simple enough?
We want to bring the unit of analysis to the individual level and condense the type of group it is into a variable signifying how many different types of groups they are in.
For instance, person one is in eight groups. Of those groups, three are (1s), three are (2s), and two are (3s). What the individual level variable would look like, ideally, is 3, because she is in all three types of groups.
Is this possible in the least?
##simulate data
##individuals
n <- 10
## groups
g <- 5
## group types
gt <- 3
## individuals*group membership
N <- 20
## inidividuals data frame
di <- data.frame(individual=sample(1:n,N,replace=TRUE),
group=sample(1:g,N, replace=TRUE))
## groups data frame
dg <- data.frame(group=1:g, type=sample(1:gt,g,replace=TRUE))
## merge
dm <- merge(di,dg)
## order - not necessary, but nice
dm <- dm[order(dm$individual),]
## group type per individual
library(plyr)
dr <- ddply(dm, "individual", function(x) length(unique(x$type)))
> head(dm)
group individual type
2 2 1 2
8 2 1 2
20 5 1 1
9 3 3 2
12 3 3 2
17 4 3 2
> head(dr)
individual V1
1 1 2
2 3 1
3 4 2
4 5 1
5 6 1
6 7 1
I think what you're asking is whether it is possible to count the number of unique types of group to which an individual belongs.
If so, then that is certainly possible.
I wouldn't be able to tell you how to do it in R since I don't know a lot of R, and I don't know what your data looks like. But there's no reason why it wouldn't be possible.
Is this data coming from a database? If so, then it might be easier to write a SQL query to compute the value you want, rather than to do it in R. If you describe your schema, there should be lots of people here who could give you the query you need.

Resources