Monday, 15 January 2007

Cases to Variables and Variables to Cases

Some SPSS techiques are based on the assumption that cases (rows) represent independent obervations and/or that related observations are recorded in separate variables rather than separate cases. If data file contains groups of related cases, you may not be able to use the appropriate statistical techniques (Paired Samples T Test of Repeated Measures GLM) because the data are not organized in the required fashion for those techniques.

The CASETOVARS command combines the related cases and produces the new variables.
[Syntax]
GET FILE='c:\sample.sav'.
SORT CASES BY ID_number.
CASETOVARS
/ID = ID_number
/INDEX = ID_person
/SEPARATOR = "_"
/COUNT = famsize.
VARIABLE LABELS
Income_1 "30K+"
Income_2 "40K+"
Income_3 "50K+".
[/Syntax]

Sort Cases sorts the data file by the variable that will be used to group cases in CASETOVARS command. The data file must be sorted by the variable specified on the ID subcommand of the CASETOVARS command. The ID subcommand of the CASETOVARS indicates the variables that will be used to group cases together. In this example, all cases with the same value for ID_number will become a single case in the restructured file.

Optional INDEX allocate all unique values of all non-ID variables. E.g.: only values of ID_number will be used to generate new variable. SEPARATOR subcommand specifies the character that will be used to separate original variable names and the values appended to those names for the new variable names in the restructured file. By default, a period is used. The COUNT subcommand will create a new variable that indicates the number of original cases represented by each combined case in the restructured file.

VARSTOCASES command creates the exact opposite. A simple excel file contains two columns of information: income for males and income for females. There is no known or assumed relationship between male and female values that are recorded in the same row; the two columns represent independent (unrelated) observations, and we want to create cases (rows) from the columns (variables) and create new variable that indicate the gender for each case.

[Excel input data]
Male_income Female_income
123 567
234 678
345 789
456 890
[/Excel input data]

[Syntax]
GET DATA /TYPE=XLS
/FILE = 'c:\sample.xls'
/READNAMES = ON.
VARSTOCASES
/MAKE Income FROM Male_income Female_income
/INDEX = Gender
VALUE LABELS Gender 1 'Male' 2 'Female'.
[/Syntax]

The MAKE subcommand creates a sinble income variable from the two original income variables. The INDEX subcommand creates a new variable named Gender with integer values that represent the sequential order in which the original variables are specified on the MAKE subcommand.

[SPSS dataview]
Gender Income
1 123
2 567
1 234
2 678
1 345
2 789
1 456
2 890
[/SPSS dataview]

Friday, 12 January 2007

Transposing cases and variables

Two most common commands are generally used for data transposing. FLIP command creates a new data file in which the rows adn columns in the original data are transposed so that cases (rows) become variables (columns) or vice versa. CASETOVAR command is in this manner more complex method where one can namely restructure data not only by flipping rows and columns.

Let's say you want to read (or import) the Excel spreadsheet and transpose the rows and columns:

[Syntax]
GET DATA /TYPE=XSL
/FILE='C:\example.xsl'
/READNAMES=ON.
FLIP VARIABLES=Sport Devices Cars
/NEWNAME=V1.
RENAME VARIABLES (CASE_LBL= Type).
[/Syntax]

Readnames=on reads the first row of the excel spreadsheet as variable names. The original variables names (sport, devices, cars) are automatically stored as values in a new variable called CASE_LBL. Subsequent rename variables command changed the name of this vvariable to Type.

[EXCEL DATA]
SPORT DEVICES CARS
ID 12 32 33
STORE 2 5 4
COST 12,000 32,200 1,000
[/EXCEL DATA]

[Transposed data in SPSS]
Type ID STORE COST
1 SPORT 12 2 12,000
2 DEVICES 32 5 32,200
3 CARS 33 4 1,000
[/Transposed data in SPSS]*
*in DataView

Thursday, 11 January 2007

Weighting data

The WEIGHT command stimulates case replication by treating each case as if it were actually the number of cases indicated by the value of the weight variable. Weight variable can be used to (1) adjust the distribution of cases to more accurately reflect the larger population or (2) to simulate raw data from aggregated data.

In first case, sample data contains following distribution; 52% males and 48% females, but in reality gender distributes as 49% males and 51% females.

[Syntax]
DO IF gender='male'.
- COMPUTE weightvar=49/52.
ELSE IF gender='female'.
- COMPUTE weightvar=51/48.
END IF.
WEIGHT By weightvar.
FREQ variables=gender.
[/Syntax]

DO IF structure sets value of weightvar for males (49/52; 0.94) and different value for females (51/48). The formula used is: desired proportion/observed proportion.

In second case of simulating raw data from aggregated file, variable count will be weight variable.

[Syntax]
DATA LIST LIST /income gender count.
Begin data
1, 1, 25
1, 2, 35
2, 1, 30
2, 2, 10
END DATA.
Value labels
Income 1 'under $50K' 2 '$50K+'
Gender 1 'Male' 2 'Female'.

Weight BY count.

CROSSTABS TABLES=income by gender.

[/Syntax]

Variable count corresponds with the original frequency of the cases.

Tuesday, 9 January 2007

Finding invalid values

Data might contain errors, which can be cought by simple cleaning routines. Invalid data values can include anything from simple out-of-range values to complex combinations of values that should not occur.

This sample contains a variable quantity that represents only valid values (e.g. integer values). Command syntax exlude invalid data from analysis.

[Syntax]
DATA LIST FREE /quantity.
BEGIN DATA
1 1.1 2 5 8.01 2.31 4.11 5.85
END DATA.
COMPUTE filtervar=(MOD(quantity,1)=0).
FILTER by filtervar.
SUMMARIZE
/TABLES=quantity
/FORMAT=LIST CASENUM NOTOTAL
/CELLS=COUNT.
FILTER OFF.
[/Syntax]

Compute command creates a new variable with MOD function. If quantity divided by 1 is 0, then the expression is true and filtervar will have a value of 0. For integer values, filtervar is set to 1.
*This solution filters out the entire case, including valid values for other variables in data file!


Slightly better solution is to assign invalid values to a user-missing category, which indentifies values that shold be excluded or treated in a special manner for that variable.

[Syntax]
DATA LIST FREE /quantity.
BEGIN DATA
1 1.1 2 5 8.01 2.31 4.11 5.85
END DATA.
IF (MOD(quantity,1)>0) quantity = (-9).
MISSING VALUES quantity (-9).
VALUE LABELS quantity -9 "Non-integer values".
SUMMARIZE
/TABLES=quantity
/FORMAT=LIST CASENUM NOTOTAL
/CELLS=COUNT.
[/Syntax]

Monday, 8 January 2007

Temporary variable SCRATCH

Any variable name that begins with a pound sign (#) is treated as a scratch variable that is discardd at the end of the series of transformation commands.
Simple sample:

[Syntax]
DATA LIST FREE /var1.
BEGIN DATA
1 2 3 4 5 6
END DATA.
COMPUTE factor=1.
LOOP #tempvar=1 TO var1.
- COMPUTE factor=factor * #tempvar.
END LOOP.
EXECUTE.
[/Syntax]

The scratch variable #tempvar is used as an index variable for the loop structure. For each case, the COMPUTE command is run iteratively up to the value of var1. For each iteration the current value of the variable factor is multiplied by the current loop iteration number stored in #tempvar.

Result will be:
1 1
2 2
3 6
4 24
5 120
6 720

Using TEMPORARY for transforming existing variables

The Temporary transformations remain effective through the first command that reads the data, after which the variables revert to original values. Here is a quick example how to use Temporary.

[Syntax]
DATA LIST FREE /var1 var2.
BEGIN DATA
1 2
3 4
5 6
7 8
9 10
END DATA.
TEMPORARY.
COMPUTE var1 = var1+5.
RECODE var2 (1 thru 5=1) (6 thru 10=2).
FREQ
/VARIABLES = var1 var2
/STATISTICS = MEAN STDDEV MIN MAX.
DESCRIPTIVES
/VARIABLES = var1 var2
/STATISTICS = MEAN STDDEV MIN MAX.
[/Syntax]

Saturday, 6 January 2007

Checking input data

A simple Syntax like this can make your *.sav file coherent, logic and might even help you when you need to locate any cases that does not follow your research pattern.

Here is a fairly simple check:

[Syntax]
DATA LIST FREE
/age gender pregnant.
BEGIN DATA
25 0 0
12 1 0
80 1 1
47 0 0
34 0 1
9 1 1
19 0 0
27 0 1
END DATA.

VALUE LABELS
/gender 0 'male' 1 'female'
/pregnant 0 'No' 1 'Yes'.

COMPUTE valueCheck = 0.
DO IF pregnant = 1.
DO IF gender = 0.
Compute valuecheck = 1.
ELSE IF gender = 1.
DO IF age > 55.
Compute valuecheck = 2.
ELSE IF age <>
Compute valuecheck = 3.
END IF.
END IF.
END IF.

VALUE LABELS valuecheck
0 'NO problems detected'
1 'Male and pregnant'
2 ' age > 55 and pregnant'
3 ' age <>

FREQ variables = valueCheck.
[/Syntax]

Life becomes even more easier when you use this in a script that runs immediately, when you open *.sav file.

Thursday, 4 January 2007

Getting data into SPSS

Several ways of introducing data to SPSS!

1. Type data by yourself and die!
2. Open data file (*.sav) and be happy!
3. Import data from Excel file (*.xls).
4. Import data from Database (via ODBC, etc.).

In the last case, Syntax should look like this:
[Syntax]
GET DATA
/TYPE=ODBC
/CONNECT='DNS=MS Access Database;DBQ=C:\demo.mdb;'+'DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;'
/SQL =
'SELECT Age, Education, Income'
'FROM GeneralInfosTable'
'WHERE ([Marital Status] <> 2 AND MobilePhone = 1)'.
EXECUTE.
[Syntax]
As we can see that we import data from Microsoft Access Database under certain conditions.

If you want to do the Excel way, you might consider this:
[Syntax]
GET DATA
/TYPE=XLS
/FILE='c:\demo.xls'
/SHEET=NAME 'Income'
/CELLRANGE= RANGE 'A1:K23'
/READNAMES=on.
EXECUTE.
[Syntax]

And if you want to read from a text (*.txt) file use:
[Syntax]
DATA LIST FREE
FILE ='c:\delimited_numbers.txt'
/id (f4) sex (a1) age (F2) opinion1 to opinion5 (5F).
EXECUTE.
[/Syntax]

Or you can - god noooo - type in all the data (directly to your SPSS Data View or you can use smth. like this):
[Syntax]
DATA LIST FREE
/var1(f2) var2(f2) var3(f2) var4(f3) var5(f2) name(a10).
BEGIN DATA
12 22 31 42 51 john
22 23 42 44 21 marry
55 53 32 11 12 fred
END DATA.
EXECUTE.

[/Syntax]